A.A.A.R. – Pentaho Data Integration

AAAR flowsPentaho Data Integration (http://community.pentaho.com) is used in A.A.A.R. to be a typical E.T.L. system. In particular, Pentaho Data Integration is used to: extract Alfresco audit data and repository composition into the Data Mart and create the defined reports uploading them back to Alfresco.

It’s easy to understand that the audit data are constantly created while the Alfresco E.C.M. is used. Each user access, each read of content, each creation or modification of the content defines a new audit action that the A.A.A.R. must extract. For that reason the two E.T.L. should be executed periodically to update the Data Mart and, consequentially, the reports in Alfresco. According to the Data Warehousing techniques, the update could be once a day, typically during the night time where the servers work less in average. To have more lighter and scalable processes, all the E.T.L. are developed with an “incremental” strategy where only the new actions since the last import are extracted in the Data Mart. The schedule of the execution of the two E.T.L. will be very easy according to the features of the Pentaho Data Integration (http://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation).

Talking about the first E.T.L., the one that extract Alfresco audit data and repository composition into the Data Mart, the Data Warehousing techniques suggest us to manage the flow of information in steps. The step are ordered and cannot be executed in parallel. The steps are:

  1. Upload of the audit data and repository composition in a staging area.
  2. Reconciliation of the audit data and the repository’s structure.
  3. Upload in the Star Schema.

In the first step, the uploading of the staging area, audit data and documents/folders informations are uploaded in a DBMS table with the same structure of the source. Thanking to the use of Pentaho Data Integration is very easy to get audit data from a REST call (the Alfresco web script) directly in a DBMS table according to its structure. Regarding the repository composition, the CMIS Input plugin is able to extract all the available informations very easily from inside the Pentaho environment.

In the second step, the reconciliation of the  data, they are “cleaned” from not relevant actions, “enriched” with dynamic informations (for example the description of the days, months, etc.) and “expanded” where they are compacted during the extraction (for example the arrays are exploded in many rows). The goal is to let the audit data and documents/folders informations ready for the Star Schema insertion.

In the third and last step, the extended audit data and documents/folders informations are uploaded in the Star Schema (representing the Data Mart). Thanking to the use of Pentaho Data Integration and according to the Data Warehousing techniques, is very easy to get data from the DBMS tables and upload them into the Star Schema defined with dimension tables and fact tables.

For more details about the two developed E.T.L. we suggest to read the dedicated chapter ahead in the documentation.

Both the E.T.L. are developed to work with parameters that are all stored in the DBMS tables and not in the E.T.L. sources. Probably we didn’t thought about all the needs in terms of parameters but in the future we could extend them according to the user requests. Talking about the state of the art, it is possible to:

  • Have multiple Alfresco sources (one for each row in ‘dm_dim_alfresco’).
  • Have multiple and customized reports (one for each row in ‘dm_reports’).
  • Each report can be uploaded in a different Alfresco target and in a different Alfresco path (according to each row in ‘dm_reports’).

For more details about the two developed E.T.L. we suggest to read the dedicated chapter ahead in the documentation.

Both the E.T.L. are developed as Jobs and Transformations all stored in a Pentaho Data Integration Repository. For the ones of you that doesn’t feel confident with Pentaho Data Integration, you have to know that the E.T.L. are all stored in a DBMS called ‘A.A.A.R._Kettle’ that can be retrieved with the Spoon application of the Pentaho Data Integration Suite (http://wiki.pentaho.com/display/EAI/Spoon+User+Guide).

A.A.A.R. E.T.L. are developed with Pentaho Data Integration Community Edition v4.3 Stable. The use of older version cannot be guaranteed to be compatible even if the used modules in jobs and transformations are more or less the most common ones. The only exception could be the Pentaho Reporting Output (http://wiki.pentaho.com/display/EAI/Pentaho+Reporting+Output) used for the creation of the static reports inside a transformation and released in the PDI starting from the version 4.1.0 (http://forums.pentaho.com/showthread.php?79085-Announcing-Pentaho-Reporting-Output-step-for-PDI).

For the Pentaho Data Integration Repository has been used PostgreSQL v9.0. The use of older version should work because no particular features of the DBMS are used. Please, suggest limits and tests to let it work even in other contexts.

PDI-logo<< Pentaho Report Designer    |    Up to description     |   Compatibility matrix >>

I like A.A.A.R.

5 thoughts on “A.A.A.R. – Pentaho Data Integration

  1. Hi Francesco Corti,

    Thanks for the excellent work on AAAR.
    As per the documentation I have installed everything in my machine but didn’t understand the ETL part to extract the audit and repository data.How the ETL jobs are implemented for this requirement.
    Could you please help me on that.

    Thanks in advance,
    Sayagoud

  2. Hi Francesco,

    I was trying to extract data by AAAR_Extract.sh from repository when process stoped for memory heap overhead.
    I try again assigning 8 Gb of memory (inside Kitchen.sh) but doesn’t work.
    The others extractions run well only GET_REPOSITORY fails.
    ¿Could you say me what could happen please?

    Ubunt 14.04
    java 1.8
    pentaho 6.1
    PDI 5.0.1
    AAAR 4.4

    thanks,
    Iñigo

    • Francesco Corti

      Hi Iñigo,

      Sorry for my late reply.
      My suggestion is to align the PDI and BA Server versions of Pentaho, first of all.
      My suspect is that this is the real issue with your use case.

      You can then reduce the number of nodes extracted at each iteration, customizing the table dm_dim_alfresco in the value of the node_limit field (setup by default at 50000).

      Please let me know how it goes.

      Cheers.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.