Online Bots/Spiders - Printable Version +- Scivillage.com Casual Discussion Science Forum (https://www.scivillage.com) +-- Forum: Site Related (https://www.scivillage.com/forum-119.html) +--- Forum: Site Feedback (https://www.scivillage.com/forum-120.html) +--- Thread: Online Bots/Spiders (/thread-326.html) |
Online Bots/Spiders - stryder - Nov 25, 2014 This is just a list of the current Bots/Spiders that frequent the site. (at this current time there's more of them than us... lets hope they don't get too clever or we'll be in for some chop.) I've included a link for most of the bots that have a specific page that addresses their usage. You will notice on the "Who's online" list now, that Bot's , Spiders and Monitoring Agents have icon's with their names to aid in working out if it's another poster or some automaton trying to pull the electronic wool over your eyes. (P.K.Dick pun) Edit: I've changed how the bot's are now listed (Along with the members) they are now shown using Font Awesome font's, hopefully it will look nicer. Known: Google Bot Yahoo Bot Bing Bot (includes Bingpreview and MSNBot Soon to be retired) Alexa Bot archive.org_bot Ask Bot AddThis Bot Twitter Bot Facebook Bot Yandex Bot Baidu Bot Jetmon Lesser Known: Ahrefs Bot Meanpath Bot DotBot Exabot Easou Spider CCBot (Uses Nutch) XoviBot Blekko Bot Majestic12 Bot (Currently has problems since it's still using a lesser HTTP protocol) 360Spider SemRushBot SeznamBot spbot oBot DomainSigmaCrawler AiHitBot SurveyBot HubPages SISTRIX Sogou magpie-crawler SiteChecker linkdex Poorly Behaved: Semalt Crawler (Doesn't identify it's a bot in the Agent information just in the Referrer URL) WebTarantula Crawler (Lacks a URL in the Agent information) XYZBot (Seems to be a bot searching for Torrents) NerdyBot (Lacks a URL in the Agent information) Nutch Bot (A bot that can be run by anyone, which means it can do things it's not suppose to) BLEXbot (Creating rubbish URL's and attempting to load them from the site) RE: Online Bots/Spiders - C C - Nov 26, 2014 Nice to have some of their anonymity removed. RE: Online Bots/Spiders - stryder - Nov 26, 2014 On a slightly different subject (but generally the same topic). I've altered a part of the forum code to deal with adding a rel=nofollow entry to all URL's that point externally. This is something that's usually done for search engine optimisation since it pretty much suggests to bots not to wander off too far. Normally such trivial pieces aren't mentioned too much on forums, since it's all "backend" related. However it's a little experimental, so tell me if it breaks. Edit: I'm still in the update of Tweaking the site to be easily usable by humans while keeping abuse at bay. Originally I was blocking/banning any thing attempting to connect via protocols that were < HTTP/1.1 The problem with this is that some of the spiders still use the HTTP/1.0 protocol for spidering websites, so the site was "invisible" to them (well a 403 page telling them that the site was forbidden). This wasn't an issue to any human coming here and posting, just to the spiders that populate the search engines. So in essence it was hurting the forums coverage. I've just completed a fix to this particular quirk, the spiders are now able to access the site using HTTP/1.0, however I've blocked any POST requests using less than HTTP/1.1 What this should mean is less robot spammers being able to post, however it could mean some quirky errors for those that post here on the forum. so please keep an eye out for any errors that occur (I'll keep an eye in the logs too) RE: Online Bots/Spiders - stryder - Sep 1, 2018 A new page covers which bots are currently identified by this site (it doesn't contain links to their HQ's, or information about their behaviour, just the Name and the Useragent string that is used to identify them) https://www.scivillage.com/bots.php Spiders that access this website are limited to READ-ONLY access and ideally should follow the robots.txt entry (either for the generic catch all '*' or for their specific named bot) I'm currently looking into new methods of handling bots which will likely cause a bit of an upset in how they operate in relationship to this (and potentially other) website(s). |