Latent Dirichlet Allocation (LDA) Correlations Clarified
Upon SEOMoz’s announcement regarding the relationship between LDA Cosine values and Google search rankings, I immediately had reservations about the way that many individuals in the community were reading the results. Admittedly, Rand and Ben have been careful about taking some of these observations with a grain of salt, making it clear to state that by no means does LDA represent the majority of Google’s ranking algorithm.
That being said, I took special interest because, like many other SEO’s who work in competitive spaces, I have long regarded on-page factors as being only valuable for long-tail searches. My first and primary concern was that because SEOMoz’s team was looking at a large keyword set without regard to competitiveness, that their data would be skewed by non-competitive spaces. I mentioned it in my previous post in greater depth, but I will cover it more briefly here.
Let’s say you have 100 applicants for 10 new developer positions and you consider a college degree as the most important factor. Unfortunately, no applicants have college degrees. You can’t use college degrees to determine who gets the job. Instead, you may have to depend on who has the most experience. If I then run a study on who you chose to hire, my numbers would say that college degrees barely matter at all, and experience is more clearly more important. In reality, if you had 10 applicants with degrees, you would have hired all of them.
Similarly, if we analyze a long-tail search where none of the relevant pages have any backlinks, it will appear that links don’t matter and on-page factors, like LDA, are highly correlated. In reality, Google is forced to rely on relevance in these cases. Later, when we try to compare this factor with others, like inbound links, this will appear stronger not because it actually is but because in a certain percentage of cases, there were no link measurements to consider at all. My analysis appears to bear this out.
Although I do not pretend to have the scientific or mathematical background to back this up, I am handy with PHP and Excel.
I pushed 100 competitive keywords and 100 long-tail keywords through Google to get their top 10 rankings. I then pushed their content through SEOMoz’s LDA tool. Then, with some minor data scrubbing, I averaged out the LDA scores for each position 1 through 10 and aggregated them based on whether the term was or was not competitive.
The end result? In non-competitive, long-tail keywords, there is a very strong relationship between LDA and rankings. In competitive, short-tail keywords, there is little relationship between LDA and rankings. Most importantly, when you aggregate the data, the correlation from long-tail slope overwhelms the lack of a trend in competitive terms…
What does this mean
- Does this mean LDA doesn’t matter?: Were you listening? Relevance is still a key factor, it is just not nearly as important once popularity measurements can be considered. Don’t spend all your time trying to tweak your LDA score. Once your site is suitably relevant, focus on external factors
- Does this mean SEOMoz was wrong?: Actually, in my opinion, this data indicates that SEOMoz has discovered the primary tool by which Google determines page relevance. It is hugely important.
- What should be done about it?: Hopefully, the good folks at SEOMoz will run some tests to help us figure out how much it matters in more competitive spaces. My quick Excel-Fu (admittedly with the assistance of Jeff Staub, our COO) hardly meets scientific standards.