Scivillage.com Casual Discussion Science Forum
Online Bots/Spiders - Printable Version

+- Scivillage.com Casual Discussion Science Forum (https://www.scivillage.com)
+-- Forum: Site Related (https://www.scivillage.com/forum-119.html)
+--- Forum: Site Feedback (https://www.scivillage.com/forum-120.html)
+--- Thread: Online Bots/Spiders (/thread-326.html)



Online Bots/Spiders - stryder - Nov 25, 2014

This is just a list of the current Bots/Spiders that frequent the site.  (at this current time there's more of them than us... lets hope they don't get too clever or we'll be in for some chop.)  
I've included a link for most of the bots that have a specific page that addresses their usage.
You will notice on the "Who's online" list now, that Bot's

[Image: bot.png]
[Image: bot.png]

, Spiders

[Image: spider.png]
[Image: spider.png]

and Monitoring Agents

[Image: monitor.png]
[Image: monitor.png]

have icon's with their names to aid in working out if it's another poster or some automaton trying to pull the electronic wool over your eyes. (P.K.Dick pun)


Edit:
I've changed how the bot's are now listed (Along with the members) they are now shown using Font Awesome font's, hopefully it will look nicer.

Known:
Google Bot
Yahoo Bot
Bing Bot (includes Bingpreview and MSNBot Soon to be retired)
Alexa Bot
archive.org_bot
Ask Bot
AddThis Bot
Twitter Bot
Facebook Bot
Yandex Bot
Baidu Bot
Jetmon

Lesser Known:

Ahrefs Bot
Meanpath Bot
DotBot
Exabot
Easou Spider
CCBot (Uses Nutch)
XoviBot
Blekko Bot
Majestic12 Bot (Currently has problems since it's still using a lesser HTTP protocol)
360Spider
SemRushBot
SeznamBot
spbot
oBot
DomainSigmaCrawler
AiHitBot
SurveyBot
HubPages
SISTRIX
Sogou
magpie-crawler
SiteChecker
linkdex

Poorly Behaved:
Semalt Crawler (Doesn't identify it's a bot in the Agent information just in the Referrer URL)
WebTarantula Crawler (Lacks a URL in the Agent information)
XYZBot (Seems to be a bot searching for Torrents)
NerdyBot (Lacks a URL in the Agent information)
Nutch Bot (A bot that can be run by anyone, which means it can do things it's not suppose to)
BLEXbot (Creating rubbish URL's and attempting to load them from the site)


RE: Online Bots/Spiders - C C - Nov 26, 2014

Nice to have some of their anonymity removed.


RE: Online Bots/Spiders - stryder - Nov 26, 2014

On a slightly different subject (but generally the same topic). I've altered a part of the forum code to deal with adding a rel=nofollow entry to all URL's that point externally. This is something that's usually done for search engine optimisation since it pretty much suggests to bots not to wander off too far.

Normally such trivial pieces aren't mentioned too much on forums, since it's all "backend" related. However it's a little experimental, so tell me if it breaks.
Edit:
I'm still in the update of Tweaking the site to be easily usable by humans while keeping abuse at bay.

Originally I was blocking/banning any thing attempting to connect via protocols that were < HTTP/1.1 The problem with this is that some of the spiders still use the HTTP/1.0 protocol for spidering websites, so the site was "invisible" to them (well a 403 page telling them that the site was forbidden). This wasn't an issue to any human coming here and posting, just to the spiders that populate the search engines. So in essence it was hurting the forums coverage.

I've just completed a fix to this particular quirk, the spiders are now able to access the site using HTTP/1.0, however I've blocked any POST requests using less than HTTP/1.1

What this should mean is less robot spammers being able to post, however it could mean some quirky errors for those that post here on the forum. so please keep an eye out for any errors that occur (I'll keep an eye in the logs too)


RE: Online Bots/Spiders - stryder - Sep 1, 2018

A new page covers which bots are currently identified by this site (it doesn't contain links to their HQ's, or information about their behaviour, just the Name and the Useragent string that is used to identify them)
https://www.scivillage.com/bots.php

Spiders that access this website are limited to READ-ONLY access and ideally should follow the robots.txt entry (either for the generic catch all '*' or for their specific named bot)
I'm currently looking into new methods of handling bots which will likely cause a bit of an upset in how they operate in relationship to this (and potentially other) website(s).