28 Jul

Building Hadoop clusters review

Building_Hadoop_Clusters

If you are interested in Hadoop technology probably this is an interesting video course you should evaluate. As you probably know, Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. All the modules in Hadoop are designed with the assumption that hardware failures are common and thus should be automatically handled in software by the framework.

Talking about the video course, we can divide the content in three main macro-sections:
1. how to create and set up a three machines cluster using Amazon EC2,
2. how to install an Hadoop cluster using Apache Ambari,
3. how to start using Hadoop cluster, in particular with Apache Hadoop User Interface (HUE).

The description of all the topics is clear and well done (Sean Mikha, the author, did a good job). All the relevant topics are always detailed before with an explanation of the logic structure and approach and only after with a demostration on how to do it in practice.

Useful also for other purposes, the creation of the virtual machines on Amazon EC2. The practical description and the step by step creation, is not limited to the server’s creation but is detailed also in what concerns the security and connection using, for example, putty ssh client.

apache hadoopIn my opinion the most relevant value of this video course is on the hidden details of the Hadoop cluster installation process. As you will see if you will decide to follow it, the tasks are quite easy to do (probably this a Sean’s merit) but the configuration details and settings are very important if you want to make it work in practice. Following the hints I’m sure every neophyte will gain days of work and lot of nights in googling. ūüėČ

Enjoy your Hadoop Cluster video course…¬†as usual by Packt Publishing.

01 May

Alfresco retrieves 1,000 results maximum or query for a couple of minutes with Apache Lucene

Did you know that Alfresco retrieves a maximum of 1,000 results in a single query, when using Apache Lucene?

Did you know that Alfresco stops a query after a couple of minutes of duration of a single query, when using Apache Lucene?

This is not a strange thing and it’s well known from most of the Alfresco’slucene_apache experts. Below some discussions about that:

  1. Alfresco Forum
  2. Stack Overflow

The reason is clear: Alfresco limits the number of results and duration of a single query to ensure a good level of performance of the whole system in the average case. And to do that Alfresco stops the execution of a query when only one of the two limits is reached.

For that reason you can obtain less than 1,000 results, even if the correct result should be 1,000 or more, because is reached a couple of minutes of the query. Or you can obtain a maximum of 1,000 results even if your query retrieves the results immediately.

Of course this is true in most of the cases but sometimes these settings could be a limit, you would like to change to solve your problems. To change the maximum number of retrieved results and/or the maximum duration of a query, you can modify the property file:

<alfresco>/tomcat/webapps/alfresco/WEB_INF/classes/alfresco/repository.properties

In these two settings:

#
# Properties to limit resources spent on individual searches
#
# The maximum time spent pruning results
system.acl.maxPermissionCheckTimeMillis=10000
# The maximum number of search results to perform permission checks against
system.acl.maxPermissionChecks=1000

But remember to pay attention on that tuning…

Another suggested solution is to switch to¬†Apache Solr Engine¬†but not all the people think it’s a good idea.

01 Apr

Deploy, manage, and orchestrate computer systems with Ansible

Are you bored on installing each time a new environment fron scratch loosing hours and hours and hours and hours…

Start in minutes and do not repeat yourself!

Ansible¬†is a radically simple IT orchestration solution that automates configuration, software deployment, and other IT needs. Ansible models your IT infrastructure by looking at the comprehensive architecture of how all of your systems inter-relate, rather than just managing one system at a time. It uses no agents and no additional custom security infrastructure, so it’s exceedingly easy to deploy — and most importantly, it uses a very simple language (called playbooks) that allows describing your automation in plain English, rather than writing things that have the complexity of software code. By using Ansible, you’ll be faster at automating your IT, but also be able to achieve new capabilities you haven’t been able to before.

ansible-logo

In this post is described how to prepare a concrete architecture where you can develop, more or less, everything as starting point of every work. The architecture is composed by two virtual machines: an Ansible server and a target server. Of course you can use phisical servers, in house or cloud, simply accessing them using ssh.

The role of the Ansible server is to manage the scripts (called playbooks) for the configuration, software deployment, and other IT needs on the target server. Very important is to underline that every activity with Ansible is idempotent. This means that executing the activities multiple times, the result is always the same.

Preparing the Ansible server

First of all let’s prepare the Ansible server using Oracle Virtual Box.

Create a new virtual machine with 512Mb RAM and 8Tb of space.¬†The connection to the LAN should be “bridged” to let the server be visible.

Start the brand new virtual machine installing XUbuntu 12.10.¬†In this evnironment is used XUbuntu instead of Ubuntu because it’s more light with the same features and capabilities.

During the installation create the user ‘ansible’ we will use as manager of the installation on the target server.

After all, reboot the system as requested to complete the installation.

Login as ansible user and open a terminal executing:

sudo add-apt-repository ppa:rquillo/ansible
sudo apt-get update
sudo apt-get upgrade

It’s suggested to reboot the system more times and execute the upgrade again. Finally:

sudo apt-get install ansible

Ansible it’s now succesfully installed!

Let’s configure the ssh key to make easier the connection to the target server.

ssh-keygen -t rsa
...: /home/ansible/.ssh/id_rsa
...: no passphrase
ssh-agent bash
ssh-add ~/.ssh/id_rsa

Now your Ansible server is ready!

Preparing the target server

Create a new virtual machine with the connection to the LAN “bridged” to let the server be reachable.

Start the brand new virtual machine installing XUbuntu 12.10.¬†During the installation create the user ‘ansible’ we will use as manager of the installation.¬†After all, reboot the system as requested to complete the installation.

Login as ansible user and open a terminal executing:

sudo apt-get update
sudo apt-get upgrade

It’s suggested to reboot the system more times and execute the upgrade again. Finally install the ssh server to let the Ansible server connect.

sudo apt-get install openssh-server
mkdir /home/ansible/.ssh

Configuring the Ansible server to connect to the target server

As ansible user on the Ansible server, open a terminal and execute.

cd /home/ansible/.ssh
scp id_rsa.pub ansible@<ip of the target server>:/home/ansible/.ssh
scp id_rsa.pub ansible@<ip of the target server>:/home/ansible/.ssh/authorized_keys

Let’s configure the Ansible server to work on the target server.

sudo chown ansible:ansible /etc/ansible/hosts
nano /etc/ansible/hosts

Add this group at the end of file.

[TargetServers]
<ip of the target server>

Now it’s time to connect to the target server from the Ansible server.

ssh ansible@<ip of the target server>

IMPORTANT: this command should open an ssh session without asking any password.

Now that everything is well configured let’s test the connection using ansible.

ansible TargetServers -m ping -u ansible

With the result:

<ip of the target server> | success >> {
    "changed": false,
    "ping": "pong"
}

Enjoy with Ansible…