Aggregating webservers logs for an Apache cluster

One of the ways of scaling a heavy-traffic LAMP web application is to transform the server into a cluster of servers. Some may opt to walk on the easy path by using an overpriced appliance load balancer, but the most daring [and budget-restrained] will go for free software solutions such as pound or haproxy.

Although excellent performers, these free balancers have lots of missing features when compared with counterpart commercial solutions. One of the most embarrassing misses is the lack of flexibility in producing decent access logs. Both pound (LogLevel 4) and haproxy (option httplog) may generate Apache-like logs in their logfiles or the syslog, however none offers the level of customization encountered in Apache. Basically, you're left with using the logs from the cluster nodes. These logs present a couple of problems:

- the originating IP is always the internal IP of the balancer - there is one log/node, while log analysis tools can usually cope with a single log file/report

First problem is relatively easy to solve. Start by activating the X-Forwarded-For header in the balancing software : for instance configuring haproxy with option forwardfor. A relatively unknown Apache module called mod_rpaf will solve the tedious task of extracting the remote IP from X-Forwarded-For header and copying it in the remote address field of Apache logs. For Debian Linux fans, it's nice to note that libapache-mod-rpaf is available via apt.

Now that you have N realistic Apache weblogs, 1 per cluster node, you just have to concatenate and put them in a form understandable by your log analysis tools. Just simply cat-ing them in a big file, won't cut it [arf] because new records will appear in different regions of the file instead of appending chronologically to its tail. The easiest solution in that case is to perform a sort on these logs. Although I am aware of the vague possibility of sorting on the Apache datetime field, even taking the locale into account, I confess my profound inability of finding the right combination of parameters. Instead, I choose to add a custom field in the Apache log; using the following log format:

LogFormat "%h %V %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Cookie}i" %c %T "%{%Y%m%d%H%M%S}t"" combined

where %{%Y%m%d%H%M%S}t is a standard projection of current datetime in an easily sortable integer, like for instance 20050925120000 – equivalent of 25 Sep 2005 12:00:00. Now, considering the quote as a separator in the Apache log format, is easy to sort upon this custom field [the 10th]:

sort -b -t """ -T /opt -k 10 /logpath/access?.log > /logpath/full.log

And there you are, having this nice huge log file to munch on. On a standard P4 with 1GB of RAM it takes less than a minute to obtain a 2GB log file…

In case the web traffic is really big and log analysis process impacts the existing web activity, use a separate machine instead of overloading one of the cluster nodes. For automated transfer of log files, generate ssh keys on all the cluster nodes for paswordless login from the web analytics server in the web logfiles owner account. Minimization of traffic between these machines is done by installing rsync on them and them using rsync via ssh:

rsync -e ssh -Cavz www-data@node1:/var/log/apache/access.log /logpath/access1.log

Now, you know all the steps required to fully automate the log aggregation and its processing. One may ask why all the fuss when in fact a simple subscription to a ASP style web analytics provider should suffice. Yes, it's true however… The cluster that I've recently configured with this procedure has a few million hits per week. Yes, we're talking about page hits. At this level of traffic, the cost for a web analytics service starts from 10.000\$/year. It's certainly a nice amount of money, which will allow you to afford your own analytics tool [such as for instance Urchin v5] and keep some cash from the first year. Some might say that this kind of commercial tools have their own load balancer analysis techniques. Sure, but it all comes with a cost. In the case of Urchin, you just saved 695\$/node and some bragging rights with your mates. Relax and enjoy.

PS: Yes we're talking millions of page hits LAMP solution not J2EE… Maybe I'll get into details on another occasion, assuming that somebody is interested. Leave a comment, send a mail or something.