25 Nov

How to solve the audit data explosion in the Alfresco database

It’s a long time that I have contacts from the Alfresco community with a similar question like the one described below.

After some time the audit trail is enabled in my Alfresco instance, the database dimension explodes. Is the audit trail connected in some way with the data explosion? How can I solve it?

Most of the times this question is connected to the A.A.A.R. analytics that is able to collect (and analyze) audit data together with repository data and workflow data. The bad news is that this issue is completely true. Of course if you request to track every action/event, you cannot avoid to store a lot amount of data. The good news is that there are some ways to solve this problem, directly with the bundled services in Alfresco.

In this post I would like to share a way to solve the problem, deleting the “old” actions audited in a very simple manner.

Description of the development environment

Before sharing the solution, let’s see how the development environment is defined. In a vanilla installation of Ubuntu 14.04 LTS with a brand new Alfresco CE 5.1-EA on board, let’s enable the audit trail following the instructions detailed here. Last but not least, pgadmin has been installed with the simple command below.

sudo apt-get install pgadmin3

How to check the Alfresco audit table

Using pgadmin3, let’s open the alfresco database and point to the alf_audit_entry table. As you can see, without any access or action into Alfresco, the table is empty.

alf_audit_entry_empty

At every action in Alfresco, one or more rows are stored in the table. More actions and users act on Alfresco, more data increases its dimension.

How to populate the audit trail

Now that we know where the audit trail is stored, let’s act into Alfresco accessing with a successful login several times. As you can see below, the new screenshot shows the alf_audit_entry table filled with some rows.

alf_audit_entry_full

From another point of view, using the bundled REST service to retrieve the audit trail in Alfresco, we can see the audit actions using a browser. Below you can see the enquiry of the alfresco-access application, used to audit every access to Alfresco.

audit_trail_full

How to clean the audit trail

Now that the audit trail is filled with lots of actions, we can start the core discussion on how to reduce it. In other words: how to clean it. Taking a look to the Alfresco documentation, you can find another bundled REST service, used in particular to delete partially or entirely the actions. The goal is clear: executing the REST clean service to delete (partially or entirely) the audit trail.

restClientLogoTo develop the task we use a Google Chromium addon called Advanced REST Client. The reason why this tool is used is the method requested from the Alfresco REST service. Infact, the Alfresco audit service clear, requests a POST method.

Below you can see the invocation of the Alfresco audit service clear. In our case without any limit (developing in that way an entire cleaning of the actions) but you can take a look to the Alfresco documentation, to easily develop an partial cleaning of the audit trail.

clear_rest

As you can see from the returning result of the service, in JSON, the cleared actions are totally six but the audited actions were only five. The sixth action deleted is the one generated during the access of the REST service itself.

Now that the audit trail is cleaned, let’s check again the alf_audit_entry table and we can see that it is empty.

alf_audit_entry_empty

Other possibile solutions

To complete the discussion, I would like to submit to the attention other solutions to reach the same goal.

1. http://docs.alfresco.com/5.1/concepts/prop-tables.html

Directly in the official Alfresco documentation you can find a description of how to schedule or manually trigger the propTablesCleanupTrigger script to clean up audit and property values tables (alf_audit_ and alf_prop_ tables). This is a nice approach, if you want to periodically apply the audit trail cleaning.

2. https://gist.github.com/AFaust/7513792

Axel Faust explains how to use an SQL script for properly clearing Alfresco unused audit data / property values. This solution is available for PostgreSQL and MySQL.

3. http://docs.alfresco.com/5.0/references/RESTful-AuditEntryDelete.html

Probably less useful in practice but with this Alfresco REST service you can delete a single audit actions.

Impact on the A.A.A.R. analytics

aaarBefore the conclusions, I would like to briefly discuss the impact of audit trail cleaning in case of analytics using A.A.A.R. What to say is simple: the A.A.A.R. extraction of data is incremental. This means that the actions extracted are all and only the ones generated after the last extraction (based on the id value). For this reason the cleaning of the audit trail is completely safe, under the condition that the AAAR_Extract script is executed periodically and with success.

Conclusions

In this post is shared a way to delete the audit actions from an Alfresco instance, in particular from the Alfresco database. This solution is one of the possibile solutions into Alfresco and it’s very useful when the audit trail generates an explosion of data (very common in practice). This solution could be used also when A.A.A.R. analytics is installed, because A.A.A.R. Data Mart stores the historical data without the need to maintain the same data into Alfresco.

2 thoughts on “How to solve the audit data explosion in the Alfresco database

  1. Great topic, thanks for writing this up!

    Please let folks know to be careful when cleaning audit entries. If you clean too many at once, you can end up with immense redo logs in the database, potentially reducing response time for users significantly. Depending on DB infrastructure, I’d try to keep the number of entries deleted per batch in the high hundreds or low thousands. Luckily it’s pretty simple to script this with something like Python or even bash.

    Secondly, it is worth nothing that when you delete the audit entries with the clear REST service, you are deleting the high level audit records, but the detailed data is stored in properties and is orphaned when ylthese audit entries are removed.

    Axel’s (AFaust) script for MySQL is one brute force way of cleaning these zombies properties. I do not recommend that folks use this script, especially if they are on a database other than MySQL. The prop tables cleanup job you reference actually wraps up AXel’s script in a way that works with all supported databases and also does batching to reduce database load.

    If folks have a lot of audit data to clean up it may be worth creating a custom job that cleans some audit entries and then triggers property cleanup before clearing more entries.

  2. Francesco Corti

    Hi Bindu,
    Thank you for sharing the interesting details.
    I have received contacts many times about this topic and, looking at the accesses to this article, it’s definitely an interesting topic for many persons.
    About the “careful use of cleaning”, it was obvious for me but I’m agree it’s extremely important to evaluate the task in terms of used resources.
    About the “orphan properties” I would like to share more informations.
    I’m pretty sure it could interesting for a lot of Alfresco maintainers.
    What about writing a post on this topic? (if you don’t have a blog, I can host you here).
    What about submitting a talk about this issue at the incoming BeeCon in Brussels (http://beecon.buzz/)?

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.