27 Jul

A.A.A.R. for a long time extraction on big Alfresco repositories

speedometerDuring my support activities on the A.A.A.R. solution I often receive the question below. The context is a request of support about the optimization of the extraction and the compression of the time of data extraction for A.A.A.R. Below the question mentioned above.

The script ran for hours but the data extraction process didn’t complete.
In the log I always see something like this:

Cmis Input modified document.0 – Cmis Input – Retrieved n.0 results from item n.714 on a total of n.967 results.
Cmis Input modified document.0 – Cmis Input – Retrieved n.0 results from item n.714 on a total of n.967 results.
Cmis Input modified document.0 – Cmis Input – Retrieved n.0 results from item n.714 on a total of n.967 results.

Francesco, could you give me support please?

In this post I would like to face this relevant issue, describing the reasons of this behaviour and focusing on the solution (because there is a solution) to test and use A.A.A.R. with satisfaction into your Alfresco installations.

Why this issue happens?

In those cases, I usually share with the users that the extraction process is developed to retrieve by default all the available informations about audits, repository (with all the details of the data structure) and workflows. This behaviour has been a design choice to show all the possible analytics, dashboards and reports on all the available informations stored into your Alfresco’s instances. I think that this is definitely a good choice but when the Alfresco repositories contain a big (or huge) quantity of data or when the resources are not enough, this massive extraction could run for hours and hours and hours… and in some cases, this is not acceptable.

Probably you are thinking that I could better define what I mention with “big (or huge) quantity of data” and “resources are not enough”. The correct answer about those two key concepts should be: benchmarking of the A.A.A.R. solution and minimum resources required. Instead of those (relevant) concepts, I would like to focus this post on the issue of the “not acceptable duration” of the extraction process. Of course with the goal to describe how to face it and solve it.

The possibile solutions to the issue

Talking about the solutions, below are listed the main solutions I always suggest:

  1. Improve the performance of the extraction process, optimizing the “longest” tasks with a development activity.
  2. Tune the extraction using the available parameters (because A.A.A.R. has some interesting parameters for this purpose).

About the first solution, it’s my favourite one because it’s a final solution to the issue, even if we all know it requires development and effort, as consequence. About the second solution, being concrete and practical, is the easiest one even if we all have to agree that every tuning could have an impact on the analytics and results.

Description of the A.A.A.R. parameters

If the choice is not to develop an optimization of the extraction process, the suggestion is to take a look to the parameters of the AAAR_Extract script. In particular to the ones described below.

..
 # Set by writeConfiguration.kjb.
 GET_AUDIT="true"
 GET_REPOSITORY="true"
 GET_PARENTS="false"
 GET_WORKFLOWS="false"
 ...

Below a brief description of the parameters and their use.

GET_AUDIT:=’true’|’false’

With this parameter you request for the extraction of the audit data from Alfresco. If the audit analytics are not your goal or interest, set it to false value and the extraction process will be faster. By default the value is set to true.

GET_REPOSITORY:=’true’|’false’

With this parameter you request for the extraction of the repository data from Alfresco. If the repository analytics are not your goal or interest, set it to false value and the extraction process will be faster.By default the value is set to true.

GET_PARENTS:=’true’|’false’

With this parameter you request for the extraction of the repository’s structure from Alfresco (NB: it affects only the repository’s structure). If you are not interested in having analytics on the repository’s structure (example: content of a site, who are the documents stored in a folder/subfolders, ecc.), set it to false value and the extraction process will be faster. By default the value is set to true. The parameter is not used if the GET_REPOSITORY set is to false.

HINT: With a big Alfresco’s repository, structured in a huge amount of folders/subfolders, this setting could be crucial and save the 95% of time of your extraction duration. 😉

GET_WORKFLOWS:=’true’|’false’

With this parameter you request for the extraction of the workflow data from Alfresco. If the workflow analytics are not your goal or interest, set it to false value and the extraction process will be faster. By default the value is set to true.

Conclusion

In this post I share some relevant parameters of the A.A.A.R. extraction process. The parameters are relevant to tune the solution and they are crucial in some practical use cases, for example with big Alfresco’s repositories.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.