I had a discussion with one of the regulars here, about how we find new referrers. Which means either new people linking to us, or new spammers. Shrugs…
Anyway, he used to look at the stats, while I’ll only check the first 10 rather quickly.
In my opinion, the problem with website stats, is that they’re for the whole month. And if you want to check out what’s happened since yesterday, you’ll have to slog through the whole list, going googly eyed in the process, trying to remember which ones are new.
So, here’s what I do:
I download my raw log files. Not necessarily the whole file. I might grep for the last two days and download gzipped versions of those. If you’re on a cpanel webhost without shell access, use cron for that. Here are some pointers that can be adapted. But really, it’s as simple as:
grep ‘19/Feb/’ /path/to/yourlog | gzip -9 > /home/username/19feb.gz
Remember that paths are different from host to host, and you may need some time to figure out yours.
OK, so, then I unzip them and copy the contents into one file.
And then I fire up TextHarvest
(this only works for windows machines. For *NIX and Mac I recomment GREP and batch files, though it requires more coding).
I start by removing anything from the /Keep list
and add one by one referrers I don’t need to be reminded of in the /Delete list
Start each keyword with \
I think default is /, but that doesn’t work with log files, because there are two many instances of the /. \ is my favorite. It hasn’t broken yet with log files.
The trick here is to keep the list in a text doc, because it will grow over time. TextHarvest manages a very large list of exclusions, but if you enter several K worth of keywords, it’ll barf.
When you’ve run the query and browse the results, you can add more keywords to the list. Here’s a small part of mine:
\annelisabeth.com\”"\”-”\W3CRobot\metafilter\403 \kuro5hin
What you want to filter out depends on what you’re looking for. New linkers or spammers. I like to look for anything I haven’t seen before. So almost everything gets added to my list with time.
But the beauty of keeping this list in a text doc, is that at any time you can delete the list from TextHarvest and just search for say the error code 403. Remember to put a space afterwords, or you’ll get a lot of false positives. Most of our .htaccess blocks produce 403 errors, so it’s a nice way of keeping track of the spamming activities of the Bulgarians and Alexander.
Any questions?