Simon Wilcox essuu at
Wed Mar 8 11:57:37 GMT 2006

On Wed, 8 Mar 2006, Aaron Crane wrote:

> We looked at Awstats, to the extent of actually running it for a while.
> Then we stopped; we'd found plenty of reasons to avoid it:
>   - The known vulnerabilities in its CGI mode may have been fixed, but
>     spaghetti code like that is just too hard and/or unpleasant to
>     audit.  I can't even say with confidence that letting Awstats parse
>     your log files off-line is definitely safe.

Agreed. Every time I look at the code I want to scream. It's just crying
out to be refactored into decent modules.

>   - It can't actually parse Apache logs.  Since 1.3.25, Apache has used
>     a backslash escaping scheme for things like user-agents and
>     referrers, so that you can actually parse log lines where the client
>     sent a double-quote in one of those.  Awstats doesn't care about
>     that, so it misparses those lines.

I've not foud this to be a problem with 6.4 but perhaps I'm not looking in
the right place. Which version did you try ?

>   - You can't just point it at a batch of log files; instead, you have
>     to configure it to know where you store your log files, and the
>     pattern used for the filenames.  That means you can't prime it with
>     the last month's (or year's) worth of logs -- you just have to run
>     it for a month before it can give you any real history.

A simple shell script allows you iterate over as many logs as you want. We
rotate logs weekly and have had to rerun a whole year's worth before now.
Wildcards would be nice though.

>   - It really really wants each vhost analysed to have exactly one log
>     file.  In each time period, we have one log file per public-facing
>     server, each containing results for several vhosts.  It wants us to
>     split log files up by vhost, but then merge then by public-facing
>     server, before we even have it look at them.

Kinda. You do need to merge the logs into timestamp order but you can lok
for specific vhosts with the %v modifier in the log format.

>   - It doesn't seem particularly fast. Admittedly, we generate about 4
>     GiB of uncompressed logs in a day, but our home-grown stuff (which
>     does actually parse, you know, Apache log files) seems rather faster
>     at the basic work of parsing logs, throwing away robotic traffic,
>     and aggregating data from the rest.

It's not very fast and admits as much but it's fast enough on our logs
that are about 450Mb/week.

>     It's possible it's not as bad for other people.  In particular, to
>     handle the vhost/server issue, we were effectively making Awstats
>     run through our logs once per vhost.  But I became convinced that
>     the time complexity of Awstats is supra-linear in the number of
>     requests anyway.  As it gathered more data over the course of a
>     month, it became apparent that it was soon going to need more CPU
>     time than we had available.  That's when we turned it off.
> In general, Awstats seems to be a tool that's intended for relatively
> small sites, hosted by low-end providers, with limited or no shell
> access, and exactly one log file per customer.  If you don't fall into
> that category, I don't think Awstats is going to be particularly
> convenient.

I would agree with that. It's definitely not up to the job of managing
large sites.

> > Is this really the best option, or can anyone suggest an alternative
> > which can parse Apache logfiles and successfully separate out robots
> > and spiders (about 80-90% of our hits) from real users?
> We wrote our own, sad to say.  We use the ABCE robot list; I've looked
> at CPANning our code, but most of it's the data file, and I think ABCE
> own the copyright on the list.

We're tending towards doing this too. I just looked at webtrends and it's
almost $10,000 for the licence we need.

> Note also that having home-grown log analysis stuff does mean that we
> can do things that a general-purpose tool couldn't.  For example, our
> software can examine popularity of site sections, rather than just of
> URLs.

This is the problem we're now experiencing with awstats. We need
granularity that awstats doesn't have.


"You've really gotta know where your towel is."

More information about the mailing list