Login to download the latest version of Mint and your favorite Pepper, purchase additional licenses, or post in the Forum. Don't have an account? Create one!

In Partnership with Media Temple

Mint Forum

Older | Newer   Pages 1 2 35

Crawlers Pepper

mls
Minted
Posted on Feb 21, '07 at 06:37 pm

I thought I’d share a pepper that I’ve been working on recently. It tracks web bots/crawlers that visit your site and displays them in Mint.

If anyone would be interested in me releasing the Pepper or has any ideas on features to add please let me know :)

Right now it’s pretty basic, but here’s a screenshot:

Crawlers Pepper

Umm, yeah! Sign me up!

I was just wishing the other day that I was able to see when my site gets spidered.

Does it also allow you to enter custom agents? Is MSNBot included?

Either way, I’ll take it :) One can never have enough peppers!

dwilkinsjr AAATTT dwilkinsjr DOT COM

fork
Minted
Posted on Feb 21, '07 at 09:13 pm

Great idea.

Is there a way to show how many pages were crawled per session?

Seeing as how Slurp wants to spend every waking moment on my site, is there a way to throttle that puppy down, since every page view tends to set off a flurry of activity?

(I’ve actually got a crawl delay of 60 seconds per page just to keep them at bay)

Otherwise, I’m interesting in what you say, and I would like to subscribe to your newsletter.

Sam Brown
Third-Party Pepper Developer
Posted on Feb 22, '07 at 05:32 am

Nice one mls. Can’t wait to get my paws on this!

Me too please, also with the crawled page count.

mls
Minted
Posted on Feb 22, '07 at 07:06 am

As I said it’s very basic right now… I’ve added in the ability to see which pages are being crawled. I’ll add an updated screenshot later today so you can see some more of my progress.

MSNBot is included in the tracking, it just hasn’t hit my site yet.

What I’ve noticed is the really weird the way crawlers visit. At least what I’ve seen so far, there is 5-15 minute breaks between each page hit sometimes so it’s very hard to tell what a “session” is. Sometimes the crawlers will hit pages that do NOT directly link to each other. And what makes it even harder to track a session is that sometimes back-to-back hits come from different IPs.

Maybe it’s just the way the crawlers hit my site, but I can’t tell a real pattern between sessions so far. Sometimes they’ll just hit one page an hour or so, very strange. I’ll have to do a little bit more research about how they operate.

Ronald Heft
Third-Party Pepper Developer
Posted on Feb 22, '07 at 11:47 am

I don’t think all crawlers are JavaScript aware, which may be the problem.

mls
Minted
Posted on Feb 22, '07 at 03:14 pm

I’m actually not using JavaScript to do this, because to my knowledge almost all crawlers are not JS aware like you said. I think it’s just more accurate to not use JS in this instance.

To track crawler hits I include a PHP file site-wide and if the User-Agent matches with one of the set crawlers it inserts it’s information into the database to be read by Mint.

Here’s an updated screenshot for anyone interested. After you expand one of the crawlers it will show you the locations of pages that it crawled and at what time:

Crawlers Pepper

Sam Brown
Third-Party Pepper Developer
Posted on Feb 22, '07 at 04:08 pm

Again, looking great mls, can’t wait to get my hands on a copy!

If it’s gonna segregate it like that, it’s gonna be great. I trawl for scrappers by looking through the raw server logs, so if it’s picking up activity by somebody trying to run stealthy through the pages, this would be a boon to my activity.

I look forward to it as well. Sounds like a great pepper!

Could be some great info from this pepper. Excellent work :)

Looking forward to this too.

Will it have the hit counters too?

mls
Minted
Posted on Feb 23, '07 at 03:55 pm

Just another quick update… I’m looking for people to test out this pepper, if you’re interested send me an email at admin [at] mlslatest.com

Here’s a screenshot of another tab I added in: Crawlers Pepper - Most Crawled Pages

@MatthewM I’m not sure what you mean by “the hit counters”. Are you talking about how many hits each page gets?

One more thing… here’s a list of included crawlers that it detects right now. If anyone has anymore to add to the pepper before release just post a reply.

  • Googlebot
  • Yahoo! Slurp
  • MSNBot
  • Ask/Teoma
  • ia_archiver (Alexa)
  • archive.org_bot (Wayback Machine)
  • Gigabot (Gigablast.com)
  • mozDex

“@MatthewM I’m not sure what you mean by “the hit counters”. Are you talking about how many hits each page gets?”

Yes I am.

SDJL
Minted
Posted on Feb 25, '07 at 08:44 am

I think that’s what the latest tab added by mls will now do. Look at the above screenshot.

livid
Minted
Posted on Feb 25, '07 at 11:36 am

And how about this spider(UA string)?

Baiduspider+(+http://www.baidu.com/search/spider.htm)

SDJL
Minted
Posted on Feb 25, '07 at 12:07 pm

It would be nice to be able to add our own crawlers to a text file or via the preferences interface.

I’ve sent you an email so I can test this out :)

mls
Minted
Posted on Feb 25, '07 at 05:21 pm

I found a major flaw in the pepper, so anyone who is testing it out please remove the tracking code from your site to prevent any problems with robots crawling your site.

I’ll work on a fix and send out an updated version.

Thanks

SDJL
Minted
Posted on Feb 25, '07 at 08:15 pm

Just out of curiosity, what was the major flaw you found?

mls
Minted
Posted on Feb 25, '07 at 10:14 pm

Crawlers are getting randomly redirected to the homepage on certain pages.

I say major because crawlers will be indexing your pages and just be redirected to your homepage instead of being able to access the requested page.

Any luck fixing the problem? This pepper completes my mint install :-)

mls
Minted
Posted on Mar 01, '07 at 03:02 pm

I have not had a lot of time to look into it, hopefully this weekend I can fix it and get a public beta out to everyone.

Older | Newer   Pages 1 2 35

You must be logged in to reply. Login above or create an account