W3C HTML Validation and Search Engine Optimization
It has been a while since I have posted some of Virante’s research to the blog, and a good friend and former COO Bob Misita called me out on it. I figured I would release some of the data from a recent study we did on the relationship of W3C HTML Validation and web page rankings. Because validation is quite complex, we chose to take a macro-look rather than our traditional methodology of getting individual sites into the SERPs via sitemaps and then tweaking individual independent variables.
In particular, we looked at the W3C validation of approximately 100 separate keywords in Google, Yahoo, MSN Live and Ask. For each keyword, we extracted the top 10 ranking sites, measured the number of errors via a W3C validation check, and used multiple statistical models to determine whether the individual rankings of the sites could be associated with validation error numbers.
The more rudimentary statistics are all we needed to fairly easily dismiss the assumption that validated content will perform better in the search engines – that is, in G,Y,M or A.
The erratic nature of average # of validation errors compared to the ranking position is fairly evident from the graph above. But, rather than assume that the data from the averages of all 100 keyword searches was accurate, we decided to look at the least squares regression for each and every keyword on each engine (400 different result sets).
As you can see, the slope of the Least Squares Regression Line is barely positive, the largest being Yahoo’s at 3/1000. If the confidence levels were high, you could assume that for every 333 validation errors removed from your page, you could see your rankings rise by 1 point. However, the confidence levels were not sufficient and, perhaps most glaring, fewer than 2% of the sites tested had greater than 333 validation errors (meaning the vast majority of sites could not benefit from such a change).
Even though validating sites appear to do better in Live and Ask than in Google and Yahoo, we can quickly counter this by looking at the aforementioned regression slopes. It is possible that W3C validation may play a role in being indexed (although I think this is unlikely). Importantly, we saw similar variation in the sites the 4 search engines allowed to rank – meaning that there appears to be no threshold score required to rank in any of these search engines.
So, there you have it. One less thing to worry about. While I still think HTML Validation is a worthy cause in-and-of-itself, one would be hard-pressed to prove that it is directly, positively correlated, much less causal, in regards to one’s search rankings.