Thoughts on Latent Dirichlet Allocation (LDA) and Search

First off, kudos to the SEOMoz team, and specifically, Ben Hendrickson for their stunning study and continued focus on building a data and research-driven approach to SEO. That being said, I feel like some grains of salt need to be thrown into this recent study regarding the recent study regarding the relationship between LDA (topic modeling) and search rankings.

To begin, let’s make it clear that it is generally accepted that a good portion of the search algorithm is textual relevance. Similarly, it would not be unreasonable to believe that Google uses just a few sophisticated mathematical tools to accomplish this relevance measurement, making it the easiest to discern and most singularly shocking. Whenever someone struck upon the meat of this part of the algorithm, it was going to be clear, obvious, and striking – and that may have just happened. However, something more is going on in this situation that I think is work taking note.

The Scarcity of Links

Let’s say that you are a coach UNC football team and you are now tasked with replacing a huge subsection of your defense. **sigh** You receive applicants from across the school but are only looking at 2 different variables, Height and Weight. You receive 100 different applications. However, you find out that of the 100 applicants only 5 mention their Height, while everyone mentions their Weight. Your original intent was to simply rank people based on a percentage of their Height and Weight but now you are forced to rely upon Weight alone.

This is the case for many mid-tail and long-tail keywords and many non-commercial head terms. While EVERY page in the index has a relevance score, most have no external links and certainly no external links with the right anchor text upon which Google can rely. There are not enough relevant pages with links for Google to use link qualities to differentiate between competing pages. This scarcity of links forces Google to rely upon a different set of measurements – on site and on page – in the majority of situations because there are too few external, inbound links. In the same way that the football coach must rely on a cruder method of ranking (Weight alone) to get a full roster, so must Google to get 10 top results. Of course relevance is the primary measurement when popularity can’t be determined.

Now, imagine if 1000 football programs went through the same ordeal. Half the time they got enough applicants who gave both Height and Weight, the other half, they did not. Thus, 50% of the time the football coaches relied upon both Height and Weight, and the remaining 50% of the time they relied on just Weight. When you average these metrics out, height is going to look like a less important factor.

This would explain why Search Engine Optimizers like myself perceive a world where links reign supreme – because in the majority of cases in which our services are solicited, the keywords are competitive enough that inbound linking is common, thus allowing Google to rely upon this preferred ranking factor.

The more pressing question is what is the LDA correlation on keywords where each of the top 10 results all have external, inbound links with the correct anchor text (ie: semi-competitive to competitive spaces).

Differentiating Factors

My serious concern with on-page optimization remains the ease of duplication. For any factor to be valuable in a ranking equation, it must offer some differentiating quality. For example, there would be no reason for the football coach above to have “Weight above 1 pound” be a factor in determining his team. No applicant will be under 1 pound, thus he cannot use the factor to differentiate between candidates. It is useless.

Similarly, on-page factors can be so easily manipulated (ie: I could simply copy and paste the content of your page onto mine with a little manipulation) that in a competitive space they offer no meaningful differentiating quality. Every page is heavily optimized for relevance. While it will be sufficient for non-competitive mid-tail and long-tail phrases, where it is unlikely that anyone would focus the time and energy to manipulate the page to be exactly optimized for one phrase or another, it will not be sufficient and, arguably, will be without value in a more competitive environment.

The Negative Link

Finally, a more general concern with these general methods is that Google’s algorithm, by most accounts, can and does penalize excess. Penalties or devaluations that can occur due to links are difficult to tease out from the general rankings, meaning that the most-link-optimized page may bias the results. Returning to our football analogy, let’s say that one applicant claims to be 11 feet tall. The applicant is clearly lying, so he is placed at the bottom of the list. This will pull down the correlation between height and making the team in the same way that a penalized page will pull down the correlation between links and ranking.

Conclusions and Questions

I do think that LDA could be a primary method by which Google determines the relevance of a page to a particular query. I do think that on-page relevance to a particular query is an important ranking factor (although not essential – pages with only images that rank based on anchor text alone). I do not think that it is a more important ranking factor than quality inbound links.

1. Is it possible to perform an additional study where we limit the universe of keywords to those where links are common enough to be a differentiating factor?
2. Could SEOMoz please stop doing interesting stuff that makes me think?

No tags for this post.

0 Comments

Trackbacks/Pingbacks

  1. Latent Dirichlet Allocation (LDA) Correlations Clarified - [...] to competitiveness, that their data would be skewed by non-competitive spaces. I mentioned it in my previous post in…

Submit a Comment

Your email address will not be published. Required fields are marked *