With this task you are going to extract informations from Alfresco directly in the A.A.A.R. Data Mart using Pentaho suite. This task is mandatory to make all the A.A.A.R. solution work properly with your data. The extraction is executed using a command line script (called AAAR_Extract
), usually scheduled on time per day, during the night time.
You can find the script in the <biserver-ce>/pentaho-solution/system/AAAR/endpoints/kettle/script
folder.
For the first time, I suggest you to execute the AAAR_Extract
script manually but in a production environment you could schedule it with the silent
parameter to have a real ETL process from you Alfresco to the A.A.A.R. Data Mart.
As you can see, everything is really automatic.
The script is able to accept several parameters to customize your import process. Please, read the advanced configuration to understand what is available to setup.
Once the extractions has been executed, you can check the execution tasks and the Data Quality checker into the dashboard below (available from the A.A.A.R. v4.4).
You can read further details about the data quality dashboard in this post.
<< How to use A.A.A.R. | How to use A.A.A.R. | A.A.A.R. Publish >>
Good day sir, i just followed your instruction and the previous step was good. But I have a problem now during the extract and publish task.. Am i correct that it is required to run both Alfresco and Pentaho at the same time? But regards on this, i can’t make it even the port number of that two is now different..I can’t run Alfresco and Pentaho at same time. What should I need to do?? Thanks!
Hi Medel,
Yes, Alfresco and Pentaho should run on different ports, in the same server or in two different ones. I’m sorry but I don’t understand why you cannot do it.
I also don’t know, when I’m running the alfresco server which in ‘8081’ port and try to run the ‘start-pentaho’ command which in ‘8080’ port, it give’s an error like, memory leak.. don’t know why
Hi Medel,
Yes, both the processes are using more memory than you have in your server.
In both cases you can manage the JAVA_OPTS variabile for that.
You can find several tutorial around the web.
Hi Medel,
I had such problem. But in my case changing all ports was helpful. I mean, in $ALF_HOME/tomcat/conf/server.xml , for example, redirectPort, ports of AJP 1.3 Connector and so on…
smth like this)
Good job, Olimpian.
Ciao Francesco,
very good job so far – thank you very much for sharing your valuable work!
My questions (sorry I’m a pentaho newbe):
* we have many customers having several millions of nodes (1-25 mio). My first try on a repo with ~ 2 mio nodes already lasts 12 hrs. CMIS may not be the most performant option. Do you crawl the whole repo on every run or is it just for the first run and then handled by the audit events?
I see batch sizes of 100 nodes in the AAAR_Extract script. How can I increase this number? I tried to change values in some xml in the kettle dir but without success …
* why not using querying the repo db directly? this would be muuch faster.
thanks in advance!
Hi hi-ko,
Thank you for your feedback.
As usual from your part: what you say is wise. 😉
Just some (partial) answers…
Audit extraction is incremental.
Repository extraction is not incremental (if you take a look at the log, the duration is related to this).
I would like to submit to your attention this post:
http://fcorti.com/2015/07/27/aaar-long-time-extraction/
The topics you touch are very relevant… I’m sorry with the community but I’ll reply privately.
Cheers,
-F
Hi Francesco
I ran AAAR_Extract.sh finish with error and I saw this error,
“Call REST in json file – ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by buildguy) : I was unable to save the HTTP result to file because of a I/O error: http://172.25.174.209:8080/alfresco/service/AAAR/getNodesModifiedAfter?baseType=cm:content&dt=2001-01-01&customProperties={“types”:[],”aspects”:[]}&skip=0&limit=50000
2016/10/27 14:57:32 – Call REST in json file – ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by buildguy) : java.io.FileNotFoundException: http://172.25.174.209:8080/alfresco/service/AAAR/getNodesModifiedAfter?baseType=cm:content&dt=2001-01-01&customProperties={“types”:[],”aspects”:[]}&skip=0&limit=50000″
In AAAR dashboard have only details of Audit data with Actions per day and Actions per path.
In Dashboard–>Repository, Folder info box show “Error processing component (folderComponent)”
Don’t have any details of folders and documents.
Do you have any suggestion about this?
I use Pentaho 6.0 with AAAR 4.2 on Alfresco CE 5.1
Thanks in advance!!
Hi,
What happen if you execute the “http://172.25.174.209:8080/alfresco/service/AAAR/getNodesModifiedAfter?baseType=cm:content&dt=2001-01-01&customProperties={“types”:[],”aspects”:[]}&skip=0&limit=50000” request directly into the web browser?
Does Alfresco reply or not?
If not, please take a look at the catalina.out log file into Alfresco for further details.
If you need more support, please write me at fcorti at gmail dot com.
Cheers.
Hi Franceso,
I executed AAAR_Extract.bat and I looked at the Extranctions page and I saw there was an error in Nodes Staging and Workflows Staging.
Also there was a KO in Actions in the Data quality table.
I looked for details of the error in AAAR.log but i didn’t find anything.
Right know I can only see the reports related to Actions types, Top ten users and Login analysis per day. When I open the other reports it says that there is no data.
That’s why I wanted to know how can i solve this or in which log file should I search for the details of this errors.
Thanks in advance 🙂
Hi Natalia,
You can find the logs in the catalina.out file into the/tomcat/logs folder.
Please post the issue into the community.alfresco.com portal so I can help you directly there.
I hope it will help you.
Hi, Francesco Corti
I have a problem with extraction of data with file “AAAR_Extract.sh”.
After run file this process not extracts info to database (before yes extracts data to database.) .
I read the log file located in:
“/LOCAL_ROUTE/AAAR/pentaho-server/pentaho-solutions/system/AAAR/endpoints/kettle/logs/AAAR.log” and find next errors:
Error 1.
Abort.0 – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : Row nr 1 causing abort : [null], [{INFORMATION TO MY DATABASE CONECTION}], [409]
Abort.0 – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : Aborting after having seen 1 rows.
Abort.0 – Finished processing (I=0, O=0, R=1, W=1, U=0, E=1)
_pentahoAddDatasource – Transformation detected one or more steps with errors.
_pentahoAddDatasource – Transformation is killing the other steps!
_pentahoAddDatasource – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : Errors detected!
Switch / Case.0 – Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)
HTTP Post.0 – Finished processing (I=0, O=0, R=1, W=0, U=0, E=0)
_pentahoAddDatasource – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : Errors detected!
_installFromConfiguration – Finished job entry [_pentahoAddDatasource] (result=[false])
_installFromConfiguration – Finished job entry [_installPostgreSql] (result=[false])
_installFromConfiguration – Finished job entry [Check databaseType = ‘PostgreSql’] (result=[false])
_installFromConfiguration – Finished job entry [_readConfigurationInVariables] (result=[false])
_installFromConfiguration – Finished job entry [Checks if files exist] (result=[false])
installFromConfiguration – Finished job entry [_installFromConfiguration] (result=[false])
installFromConfiguration – Job execution finished
Kitchen – Finished!
Kitchen – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : Finished with errors
Error 2.
HTTP tasks – Start of HTTP job entry.
HTTP tasks – Start of HTTP job entry.
HTTP tasks – Connecting to URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$2687?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : I was unable to save the HTTP result to file because of a I/O error: Server returned HTTP response code: 500 for URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$2687?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : java.io.IOException: Server returned HTTP response code: 500 for URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$2687?includeTasks=true
Error 3.
HTTP tasks – Start of HTTP job entry.
HTTP tasks – Connecting to URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$2774?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : I was unable to save the HTTP result to file because of a I/O error: Server returned HTTP response code: 500 for URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$2774?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : java.io.IOException: Server returned HTTP response code: 500 for URL: http://HostToMySite:8080/alfresco/service/api/workflow-instances/activiti$2774?includeTasks=true
Error 4.
HTTP tasks – Start of HTTP job entry.
HTTP tasks – Connecting to URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$3570?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : I was unable to save the HTTP result to file because of a I/O error: Server returned HTTP response code: 500 for URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$3570?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : java.io.IOException: Server returned HTTP response code: 500 for URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$3570?includeTasks=true
Error 5.
HTTP tasks – Start of HTTP job entry.
HTTP tasks – Connecting to URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$3701?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : I was unable to save the HTTP result to file because of a I/O error: Server returned HTTP response code: 500 for URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$3701?includeTasks=true
HTTP tasks – ERROR (version 7.0.0.0-25, build 1 from 2016-11-05 15.35.36 by buildguy) : java.io.IOException: Server returned HTTP response code: 500 for URL: http://HostToMySite.com:8080/alfresco/service/api/workflow-instances/activiti$3701?includeTasks=true
The error with I have more problems ist the Error 1. I really not have idea What to do with this error
Thank for you help, Cheers.
Note: Im learning English and, if I have been clear with my question excuse me and tell me To clarify the doubt. Thank
Hi Javier,
Can you please post the same question in community.alfresco.com?
It will be easier to follow the thread.
Many thanks,
-F
PS: Don’t worry about your English, few of us are native english speaking. 😉
Hello
help me please
My AAAR shows “no data found”
I check log file catalina.out and AAAR.log. I get ” ERROR (Version 8.1.0.0.-365 from 2018-04-30… by builguy)
Any idea?
Check the error’s reason first.
Feel free to open a question on community.alfresco.com and point me on it.
I’ll be happy to jump in the discussion and try to help you.
Hi Francesco
I ran “AAAR_Extract.sh” and found this error
Processing stopped because of an error:
Error connecting to the repository!
Error occurred while trying to connect to the database
Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
ERROR: Kitchen can’t continue because the job couldn’t be loaded.
Pentaho 7.0
Alfresco CE 5.1
I saw It can’t connect repository How can I solve this error.
thank in advance