perl at aaroncrane.co.uk
Wed Mar 8 10:49:39 GMT 2006
Jonathan McKeown writes:
> A 10Kline CGI script, with most variables global and including its own
> CGI parameter parsing.
We looked at Awstats, to the extent of actually running it for a while.
Then we stopped; we'd found plenty of reasons to avoid it:
- The known vulnerabilities in its CGI mode may have been fixed, but
spaghetti code like that is just too hard and/or unpleasant to
audit. I can't even say with confidence that letting Awstats parse
your log files off-line is definitely safe.
- It can't actually parse Apache logs. Since 1.3.25, Apache has used
a backslash escaping scheme for things like user-agents and
referrers, so that you can actually parse log lines where the client
sent a double-quote in one of those. Awstats doesn't care about
that, so it misparses those lines.
- You can't just point it at a batch of log files; instead, you have
to configure it to know where you store your log files, and the
pattern used for the filenames. That means you can't prime it with
the last month's (or year's) worth of logs -- you just have to run
it for a month before it can give you any real history.
- It really really wants each vhost analysed to have exactly one log
file. In each time period, we have one log file per public-facing
server, each containing results for several vhosts. It wants us to
split log files up by vhost, but then merge then by public-facing
server, before we even have it look at them.
- It doesn't seem particularly fast. Admittedly, we generate about 4
GiB of uncompressed logs in a day, but our home-grown stuff (which
does actually parse, you know, Apache log files) seems rather faster
at the basic work of parsing logs, throwing away robotic traffic,
and aggregating data from the rest.
It's possible it's not as bad for other people. In particular, to
handle the vhost/server issue, we were effectively making Awstats
run through our logs once per vhost. But I became convinced that
the time complexity of Awstats is supra-linear in the number of
requests anyway. As it gathered more data over the course of a
month, it became apparent that it was soon going to need more CPU
time than we had available. That's when we turned it off.
In general, Awstats seems to be a tool that's intended for relatively
small sites, hosted by low-end providers, with limited or no shell
access, and exactly one log file per customer. If you don't fall into
that category, I don't think Awstats is going to be particularly
> Is this really the best option, or can anyone suggest an alternative
> which can parse Apache logfiles and successfully separate out robots
> and spiders (about 80-90% of our hits) from real users?
We wrote our own, sad to say. We use the ABCE robot list; I've looked
at CPANning our code, but most of it's the data file, and I think ABCE
own the copyright on the list.
Note also that having home-grown log analysis stuff does mean that we
can do things that a general-purpose tool couldn't. For example, our
software can examine popularity of site sections, rather than just of
More information about the london.pm