Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5

SEO: Blocking certain words from being crawled

#1
stryder Offline
SEO: Blocking certain words from being crawled

Search engines have a nasty habit of stripping a lot of data from sites and most of the time that data has a lot of repetitive occurrences; usernames, navigation words, timestamps, number of posts etc. 
It's a lot of junkdata that shouldn't really be used to measure your keywords by, however some search engines aren't particular intelligent at stripping out such information so you end up with your main keywords being something like "login" or "forum", "discussion" etc.

There isn't really much that has been said or done in regards to how to deal with this problem.  It's not like the HTML consortium has added a nose (no search engines) attribute to every tag to allow you to toggle which element is spiderable (and even if it did it would increase a webpages markup content considerably)


So that leaves it down to what search engine's provide in the form of webmaster tools.  It's possible for some search engine companies to allow you to login into such tools and remove common words from being weighed.  Others companies went with adding their own exclusion method, in Yahoo's case that's including a "robots-nocontent" class of trying to identify to the search engine what content to skip (This however again can lead to some seriously inflated mark-up.)


Method using CSS ::before
That's why I've devised a method that I think works (it still needs testing by the world at large to see how well it works)

It considers certain points:
  • It's based on CSS2 so should be compatible with most outputs.
  • Due to CSS classes being capable of be named anything (There isn't a restricted/reserved list), a class name should not be relied upon to identify to a search engine to skip content.   (If a class name they provide clashes with one that your site already has, it could cause issues.)
  • Most HTML tag's have the title element available for use and this method relies upon that.
The following is how this method is applied.  (You can change the class name to anything you want, I've kept it small because if you are adding this to multiple elements in a document then it's going to increase the mark-up size.  There might be a way of using Javacript to insert the class through DOM manipulation if you wanted to further reduce document size.)

Firstly a CSS style is applied:
Code:
.x:before{
content:attr(title);
display:inline;
}

This CSS literally says that a tag should output the content of it's title attribute before what's contained as the value.

It's application into html would look something like this:
Code:
<ul>
  <li><a href="#"><span class="x" title="Homepage"></span></a></li>
  <li><a href="#"><span class="x" title="Contact" /></a></li>
</ul>

This would cause the words as links of Homepage and Contact to be shown, even though the span tag doesn't have any value.

Homepage uses a standard tag method where Contact uses a self closing tag. Apparently it's "Invalid" according to one validation system however it still works.
I'd suggest not using this class directly with a (anchor) tags, as anchors can be used for sitemaps and might be used to next images to create image links.

While search engines can strip out title attributes for important information, I would hazard a guess that if a tag doesn't have a value then the title is likely skipped.  (I can not of course confirm this)

I would suggest that if any search engine developers out there spot this post/thread that they consider making their search engine behave how I've mentioned above, as it would make sense to have some form of easily applied standard like this that doesn't get mistreated as "an attempt to inflate keywords through hidden variables."

Hopefully this will prove over time to be the most effective way of blocking particular words from crawlers,
Please feel free to add any comments, suggestions or additions.

If you find this helpful please feel free to syndicate this page or spread a mention of www.scivillage.com
Reply
Reply
#3
stryder Offline
A few hours on and I'm still busy with this, here's another way of doing it if your doing it from scratch.
Rather than using already built tags like <span> or <strong> to then add a class and title to, it's possible to actually create your own prototype tag.

Code:
<html>
<head>
<!-- below script is meant for older browsers that can't create prototype tags on the fly -->
  <script>document.createElement("norobo")</script>
<style>
norobo:before {
content:attr(s);
display:inline;
}
</style>
</head>
<body>
<norobo s="I'd suggest keeping to plain text strings where possible as it doesn't like any mark-up" />

</body>
</html>
The above code has a norobo (no-robot)  tag with an s (string) attribute.

The great thing about it being a prototype is that people can't complain if a validator throws out an error, afterall it's my prototype so their validators wrong Wink

Using a prototype method cuts back on the amount of markup that is otherwise needed to keep writing class and title everywhere.

To be honest you probably don't even need to use the attr CSS trick with a prototype as spiders might well ignore their content anyway since they won't recognise the tag.
Reply




Users browsing this thread: 1 Guest(s)