I noticed that one of my servers was using quite a bit more of its CPU resources than normal, yet my Analytics software wasn’t showing a spike in traffic. I have a rather large Apache access_log file, and I wanted to see how many times a particular bot scraped my web pages. Looking through it by hand isn’t practical since the log is over 1GB in size.
Instead, what I did was this simple grep command:
grep -c “myregex” access_log
In the quotes, I put the real string that I was searching for. The c flag refers to “Count”, which returns the number of times that regular expression occurs in the file.
In this case, the scraping program that I thought was the culprit had downloaded less than 100 web pages, but the true culprit had downloaded many more. It was using a browser’s User Agent so it’s either a really active visitor, a browser plugin, or a spider spoofing a real browser. To resolve this, I used IPTables to block their IP address. Problem solved.