Scivillage.com Casual Discussion Science Forum

Full Version: Search Engine Algorithms( or "Search Engines the proverbial Lemonade Stand")
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
While I would gather a majority of people aren't particular interested in getting the top spot in a search engine.  I have currently been falling headlong into the rabbit burrow to try and fathom how to do some rather interesting tasks.  I'm not going to spoil what it is I'm up to, at least not yet (I've still a way to go to reach and significant milestones).

In any event it doesn't mean I can't talk about some of the things I've been working "with" while working "on" things.

Search Engine Algorithms are pretty much the next generation version of Dissertation statistics.  Such as the "Density of Keywords". 
According to the rather bland article Keyword Density (wikipedia.org) the mathematics employed for the most part is:

WordDensity = (WordFrequency / TotalWords) * 100

From what I've currently been toying with however, I don't think this would anywhere equate an actual value that on it's own could reflect a keyword within a body of text, especially in relationship to forum discussions and the like.  The main problem is in regards to whether you test against the true total of words or if a dictionary of extremely common words is actually omitted.  The problem then is "Key Terms", where it's not just a keyword but a number of words that string together. 

This means actually identifying the substrings and treating them with a combined weight (not just and individual one)

There is then the problem of actually "Scraping" a page of information and then "sorting" it so that you aren't clustering all the heavy density words together where they are completely disenfranchised by having all the common dictionary omitted words removed.

On top of those fair points, is the fact that the written word (at least electronic one) is time sensitive in nature especially in regards to social network standards.  I mean you could take a copy of all the graffiti from Hadrian's Wall (wikipedia.org) from 19 hundreds years ago and create a keyword density output, however it doesn't mean that what the Romans talked about then is popular now or holds any actually relevance nowadays.

It's one of the main reasons why large search engine giants have spiders frequenting peoples websites to keep an updated conclusion of what exists and where it exists, so they can constantly tweak their algorithms.  Especially so when you consider that an algorithm to them isn't some puzzle to solve or something to understand, it literally is a source of revenue through those sites that actually do use advertisement and they want to make sure that they squeeze every penny out of what they can.