Aaron Crane perl at
Wed Mar 8 10:49:39 GMT 2006

Jonathan McKeown writes:
> A 10Kline CGI script, with most variables global and including its own
> CGI parameter parsing.

We looked at Awstats, to the extent of actually running it for a while.
Then we stopped; we'd found plenty of reasons to avoid it:

  - The known vulnerabilities in its CGI mode may have been fixed, but
    spaghetti code like that is just too hard and/or unpleasant to
    audit.  I can't even say with confidence that letting Awstats parse
    your log files off-line is definitely safe.

  - It can't actually parse Apache logs.  Since 1.3.25, Apache has used
    a backslash escaping scheme for things like user-agents and
    referrers, so that you can actually parse log lines where the client
    sent a double-quote in one of those.  Awstats doesn't care about
    that, so it misparses those lines.

  - You can't just point it at a batch of log files; instead, you have
    to configure it to know where you store your log files, and the
    pattern used for the filenames.  That means you can't prime it with
    the last month's (or year's) worth of logs -- you just have to run
    it for a month before it can give you any real history.

  - It really really wants each vhost analysed to have exactly one log
    file.  In each time period, we have one log file per public-facing
    server, each containing results for several vhosts.  It wants us to
    split log files up by vhost, but then merge then by public-facing
    server, before we even have it look at them.

  - It doesn't seem particularly fast. Admittedly, we generate about 4
    GiB of uncompressed logs in a day, but our home-grown stuff (which
    does actually parse, you know, Apache log files) seems rather faster
    at the basic work of parsing logs, throwing away robotic traffic,
    and aggregating data from the rest.

    It's possible it's not as bad for other people.  In particular, to
    handle the vhost/server issue, we were effectively making Awstats
    run through our logs once per vhost.  But I became convinced that
    the time complexity of Awstats is supra-linear in the number of
    requests anyway.  As it gathered more data over the course of a
    month, it became apparent that it was soon going to need more CPU
    time than we had available.  That's when we turned it off.

In general, Awstats seems to be a tool that's intended for relatively
small sites, hosted by low-end providers, with limited or no shell
access, and exactly one log file per customer.  If you don't fall into
that category, I don't think Awstats is going to be particularly

> Is this really the best option, or can anyone suggest an alternative
> which can parse Apache logfiles and successfully separate out robots
> and spiders (about 80-90% of our hits) from real users?

We wrote our own, sad to say.  We use the ABCE robot list; I've looked
at CPANning our code, but most of it's the data file, and I think ABCE
own the copyright on the list.

Note also that having home-grown log analysis stuff does mean that we
can do things that a general-purpose tool couldn't.  For example, our
software can examine popularity of site sections, rather than just of

Aaron Crane

More information about the mailing list