A.A.A.R. – How to use A.A.A.R. – Extract from Alfresco using Pentaho

With this task you are going to extract informations from Alfresco directly in the A.A.A.R. Data Mart using Pentaho suite. This task is mandatory to make all the A.A.A.R. solution work properly with your data. The extraction is executed using a command line script (called AAAR_Extract), usually scheduled on time per day, during the night time.


You can find the script in the <biserver-ce>/pentaho-solution/system/AAAR/endpoints/kettle/script folder.

For the first time, I suggest you to execute the AAAR_Extract script manually but in a production environment you could schedule it with the silent parameter to have a real ETL process from you Alfresco to the A.A.A.R. Data Mart.

As you can see, everything is really automatic.

The script is able to accept several parameters to customize your import process. Please, read the advanced configuration to understand what is available to setup.

Once the extractions has been executed, you can check the execution tasks and the Data Quality checker into the dashboard below (available from the A.A.A.R. v4.4).

data quality check

You can read further details about the data quality dashboard in this post.

<< How to use A.A.A.R.     |     How to use A.A.A.R.     |     A.A.A.R. Publish >>

I like A.A.A.R.

12 thoughts on “A.A.A.R. – How to use A.A.A.R. – Extract from Alfresco using Pentaho

  1. Good day sir, i just followed your instruction and the previous step was good. But I have a problem now during the extract and publish task.. Am i correct that it is required to run both Alfresco and Pentaho at the same time? But regards on this, i can’t make it even the port number of that two is now different..I can’t run Alfresco and Pentaho at same time. What should I need to do?? Thanks!

    • Francesco Corti

      Hi Medel,
      Yes, Alfresco and Pentaho should run on different ports, in the same server or in two different ones. I’m sorry but I don’t understand why you cannot do it.

      • I also don’t know, when I’m running the alfresco server which in ‘8081’ port and try to run the ‘start-pentaho’ command which in ‘8080’ port, it give’s an error like, memory leak.. don’t know why

  2. Francesco Corti

    Hi Medel,
    Yes, both the processes are using more memory than you have in your server.
    In both cases you can manage the JAVA_OPTS variabile for that.
    You can find several tutorial around the web.

  3. Hi Medel,
    I had such problem. But in my case changing all ports was helpful. I mean, in $ALF_HOME/tomcat/conf/server.xml , for example, redirectPort, ports of AJP 1.3 Connector and so on…
    smth like this)

  4. Ciao Francesco,
    very good job so far – thank you very much for sharing your valuable work!
    My questions (sorry I’m a pentaho newbe):
    * we have many customers having several millions of nodes (1-25 mio). My first try on a repo with ~ 2 mio nodes already lasts 12 hrs. CMIS may not be the most performant option. Do you crawl the whole repo on every run or is it just for the first run and then handled by the audit events?
    I see batch sizes of 100 nodes in the AAAR_Extract script. How can I increase this number? I tried to change values in some xml in the kettle dir but without success …
    * why not using querying the repo db directly? this would be muuch faster.

    thanks in advance!

    • Francesco Corti

      Hi hi-ko,
      Thank you for your feedback.
      As usual from your part: what you say is wise. 😉
      Just some (partial) answers…
      Audit extraction is incremental.
      Repository extraction is not incremental (if you take a look at the log, the duration is related to this).
      I would like to submit to your attention this post:

      The topics you touch are very relevant… I’m sorry with the community but I’ll reply privately.



  5. Hi Francesco
    I ran AAAR_Extract.sh finish with error and I saw this error,

    “Call REST in json file – ERROR (version, build 1 from 2015-10-07 13.27.43 by buildguy) : I was unable to save the HTTP result to file because of a I/O error:{“types”:[],”aspects”:[]}&skip=0&limit=50000
    2016/10/27 14:57:32 – Call REST in json file – ERROR (version, build 1 from 2015-10-07 13.27.43 by buildguy) : java.io.FileNotFoundException:{“types”:[],”aspects”:[]}&skip=0&limit=50000″

    In AAAR dashboard have only details of Audit data with Actions per day and Actions per path.
    In Dashboard–>Repository, Folder info box show “Error processing component (folderComponent)”
    Don’t have any details of folders and documents.
    Do you have any suggestion about this?
    I use Pentaho 6.0 with AAAR 4.2 on Alfresco CE 5.1

    Thanks in advance!!

    • Francesco Corti


      What happen if you execute the “{“types”:[],”aspects”:[]}&skip=0&limit=50000” request directly into the web browser?

      Does Alfresco reply or not?
      If not, please take a look at the catalina.out log file into Alfresco for further details.
      If you need more support, please write me at fcorti at gmail dot com.


  6. Hi Franceso,
    I executed AAAR_Extract.bat and I looked at the Extranctions page and I saw there was an error in Nodes Staging and Workflows Staging.
    Also there was a KO in Actions in the Data quality table.

    I looked for details of the error in AAAR.log but i didn’t find anything.
    Right know I can only see the reports related to Actions types, Top ten users and Login analysis per day. When I open the other reports it says that there is no data.

    That’s why I wanted to know how can i solve this or in which log file should I search for the details of this errors.

    Thanks in advance 🙂

    • Francesco Corti

      Hi Natalia,

      You can find the logs in the catalina.out file into the /tomcat/logs folder.
      Please post the issue into the community.alfresco.com portal so I can help you directly there.
      I hope it will help you.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.