perl at aaroncrane.co.uk
Thu Mar 9 23:42:50 GMT 2006
Simon Wilcox writes:
> On Wed, 8 Mar 2006, Aaron Crane wrote:
> > - It can't actually parse Apache logs. Since 1.3.25, Apache has used
> > a backslash escaping scheme for things like user-agents and
> > referrers, so that you can actually parse log lines where the client
> > sent a double-quote in one of those. Awstats doesn't care about
> > that, so it misparses those lines.
> I've not foud this to be a problem with 6.4 but perhaps I'm not looking in
> the right place. Which version did you try ?
I'm pretty sure it was 6.4. Our logs contain both referrer and
user-agent, and occasionally stupid clients include a double-quote
character in one or (worse) both. Something that ignores backslashes in
those fields therefore can't reliably work out where they end. I'm
afraid I don't have any notes on that. But I do have a (fairly clear,
though still possibly flawed) recollection that testing Awstats on a
sample from our real logs revealed a small percentage of log lines which
weren't accurately parsed.
> > - It really really wants each vhost analysed to have exactly one
> > log file. In each time period, we have one log file per
> > public-facing server, each containing results for several vhosts.
> > It wants us to split log files up by vhost, but then merge then by
> > public-facing server, before we even have it look at them.
> Kinda. You do need to merge the logs into timestamp order but you can
> lok for specific vhosts with the %v modifier in the log format.
We have one file per public-facing server per hour; they get pulled from
each server to our log-processing server hourly, and put into a
reasonable place. A few years ago, we had a much more complicated
scheme, where logs from a given time period across all servers were
merged into one file, sorted by time. That was so much pain to deal
with that I'm particularly unwilling to go back to it at all, let alone
just for something as patently cruddy as Awstats. As I say, we generate
4 GiB (uncompressed) of logs per day; it probably isn't a great idea to
sort all of that data if you don't have to.
One other note: you can't really guarantee that a single Apache log file
contains no out-of-order lines. Even though Apache opens log files with
O_APPEND, you're at the mercy of scheduling vagaries. Sometimes, the
kernel will context-switch away from a process (or thread, if you're
that way inclined) immediately after it's generated the line to write to
the log. And if that process doesn't get scheduled again until after
the next clock tick, well, there you go.
So if Awstats really requires logs to be in timestamp order, that's
potentially awkward. I've just looked at a random recent log file from
one of our servers. There are hundreds of lines (out of 160,000 ish)
that are out of sequence by 2 to 10 seconds, and a few going into the
tens-of-seconds range, and that's just within a single hourly file.
More information about the london.pm