The Google Cache: Search Engine Marketing, SEO & PPC

Google “Black Owned Business” Attribute in Google

admin — Wed, 05 Aug 2020 15:15:54 +0000

Google recently added a “Black-Owned” business attribute to Google My Business, stirring controversy among some members of the SEO community. I wanted to take the time to be absolutely clear where I fall on the issue. Not many people know this, but my double major in college was Political Science and African American Studies, so this issue is close to my heart.

Let’s Start with a Story

Imagine inviting some neighbors over to play Monopoly. Three players are white, one is black. This is the 1960’s, so the white players create unfair rules: the black player can only buy property on the first row, gets randomly thrown into jail and collects 1/2 when passing go. We don’t know who would win the game, but we know who would lose: the black player. As far as analogies goes, this is fairly generous towards U.S. history as it ignores slavery altogether, but I think just considering Monopoly with Jim Crow-like laws is sufficient to make my point.

Now, imagine the next morning the kids of those same players show up and decide to play Monopoly as well. The white kids know that racism is wrong and say “we aren’t going to do that anymore, from now on the rules are completely equal.” The black kid then reaches to the bank to start the game over but the others exclaim, “wait, no, we start where our parents left off.”

This is American capitalism. It is relatively uncontested that parental education influences future child Quality of Life. We also know that wealth in general influences scores on cognitive skill tests in early childhood, behavioral problems in schools, high school completion, teen pregnancy, and most importantly, future wealth. I am 38 years old. My parents went to segregated schools in the south. That means that many if not most of my Black peers were born into a system rigged against them, just like the Black Monopoly player described in my analogy. This is because the largest, unequal value transfer in capitalist economies is the transfer of wealth and knowledge from parents to their children. Capitalism claims that it is a moral distribution of wealth insofar as people succeed based on how much value they add to the economy. However, we know this is false. Money and knowledge are transferred both annually in the form of expenditures on children (~$12,800 per US family) and then in a lump sum via inheritance later in life. Baby boomers are prepped to transfer $30,000,000,000,000 to their progeny for no more than winning the genetic lottery. Just like starting a game of Monopoly where the previous players left off. Thus, the Monopoly analogy is a double edged sword…

Either (1), the system is very deterministic (playing the game well and having resources causes one to win) which means that generational play will result in a reinforcement of the outcomes of the early game-play’s unfair rules or (2) the system is non-deterministic and playing the game well and having resources have nothing to do with success, it is just totally arbitrary. Neither of these are morally acceptable.

Modeling Interventions

Well, since Monopoly is a fairly straightforward game, we can create simulations and play out what happens in the scenario described above. Several years ago I programmed a simple Monopoly simulation (there are tons of better open-source ones now on Github if you want to recreate the experiment). I set parameters such that players used the same strategies but, at the same rate, made random choices outside the strategy. This allowed me to control for ability (adopting the principle that two individuals who make the exact same decision, averaged over time, ought to have similar outcomes). I first ran several generations of the game where one player was unable to own property and was thrown into jail whenever she/he landed on an owned property. Of course, this is an under-representation of how bad it actually was in the antebellum South because I let the players start with the standard amount of money distributed among players, rather than the nothing that was given to slaves upon entering the United States. After several generations, I changed the rules to Jim Crow restrictions, where the one player could buy some property and could earn some money, but at a significantly reduced pace. And then, after multiple generations, I equalized the rules for all players. Finally, I ran a Monte Carlo simulation (running the same program hundreds of thousands of times, modifying variables to compare outcomes) to determine what, on average, were the outcomes based on various interventions.

So, what was the outcome? Well, in the scenarios where I only equalized the rules, the discriminated-against players never caught up. Let me make this clear because it is probably the most important finding. In a sufficiently deterministic economic system with generational wealth transfer, it is impossible for a class of individuals to recover from meaningful economic discrimination, even thousands of generations later, without counter-balanced interventions. Burn this into your head. “Leveling the playing field”, so to speak, will not undo previous discrimination. Now that we know this truth and can demonstrate it mathematically, if you consider the outcome of previous discrimination something worth fixing, we must move on to other interventions. You can’t hide behind “I don’t see color” or “everyone should be treated equally” if you also wish to undo the racial outcomes of our past.

Briefly, I want to address a common response that does have some merit. There is, to some degree, an unfairness in that we should have to remedy the sins of our fathers. Why is this our moral burden? This is where one must decide whether they intend to do only that which they are obligated to do, or will rise to do what that which they are able to do. Will I do what I must, or what I can?

So What Does This Mean

This is why the dichotomy isn’t “racist” vs “non-racist”, but rather “racist” vs “anti-racist”. No one in their right mind would voluntarily play Monopoly from a position of such great disadvantage, yet society demands it of minorities and call it “reverse-racism” when they point out ways to address those disadvantages. However, with the computational evidence in place, we know that we actually have to do more than equalize the rules in order to undo the unfair outcomes. I tested multiple methodologies in order to address the disparities in the Monte Carlo simulation. These included (1) the formerly discriminated player making fewer mistakes to simulate educational improvements, (2) the formerly discriminated player paying lower rates for rent when landing on properties and (3) various forms of cash outlays to the formerly discriminated against players. Efficacy of bringing outcomes into line followed that same pattern, where education helped a little, rent assistance helped a little more, but cash outlays worked the best. Ultimately, the last method worked the best because it solved all problems… extra money meant a player who made mistakes wasn’t ruined, who was unlucky and landed on a wealthy property wasn’t bankrupted, and when one of the few opportunities struck to buy properties to increase their earning potential, they could. This is the case for reparations.

What Does This Have To Do With Google

The political will in this country to address historic racism and its obvious impact is still far too little to expect action at a federal level. Subsequently, voluntary attempts to address these problems is our only choice for now. Google’s actions to give business owners the ability to label their business as Black-Owned (or hopefully in the future all sorts of historically oppressed groups) is a voluntary opportunity to counter the deterministic tide of capitalism which keep the symptoms of our racist history alive, whether or not you believe contemporary U.S. culture is racist.

So, in a word, Yes. I support Google’s addition of the Black Owned business attribute to GMB.

On Mathematics, Experimentation and Value

admin — Thu, 16 Jul 2020 18:55:43 +0000

Jeff Ferguson of Amplitude Digital recently authored a piece entitled “Do We Have the Math to Truly Decode Google’s Algorithms?” on the venerable Search Engine Journal with substantial assistance from Data Analytics Consultant Jennifer (Fields) Hood. Please read the article before continuing with mine. While the article has much to commend itself, I believe that it presents faulty logic, mischaracterizations, and ultimately misleading conclusions.

The answer to the question “Do we have the math to truly decode Google’s algorithms” is an emphatic No. However, the primary culprit behind this unfortunate reality is not the incompetency of industry practitioners chiseling away at the algorithm, rather the algorithm itself now employs sufficient black-box machine learning techniques as to render it impossible for anyone without a priori knowledge of the inputs to “decode” the algorithm. Nevertheless, it is essential that we don’t forget the maxim that “perfect is the enemy of good” in our search for knowledge about how to help our clients’ sites perform better in the search engines.

The Central Thesis

The central thesis of Ferguson’s piece seems to go something like this…

Studies conducted by “Gentlemen Scientists” without “any form of testing or certification”, “with rare exceptions” are incapable of analyzing the “complex systems found in search engine algorithms and the information they organize.”

So, how does he proceed to defend this claim? The argument appears to rest on the following premises:

Given the complexity of Google’s algorithm, studying that algorithm requires a person of a certain degree of formal mathematics education.
Specific criticisms of a handful of study models
Weak correlations are not worth our consideration.
The standard for publishing should be “proof”.
Research which leads to the creation of a product or service is unethical.
Sampling is biased.
There is no peer review in our industry.
Epistemic positivism regarding truth and knowledge.
A false equivalency between certainty and usefulness.

Whew. Ferguson has certainly given us a lot to work with in his piece, so let’s go to work. Before we dive into each individual critique, let me start with a broader consideration.

Building a Dam

Ferguson begins his piece with an analogy. He recalls the story of William Mulholland, a self-taught civil engineer who, despite several innovations and successes, ended his career in tragedy when a dam he inspected and approved failed, causing the deaths of hundreds. While no doubt this is a tragedy, I think it is wholly disanalgous to the work of an SEO studying the search engines. To be clear, my concern with the analogy is not the risk involved (to which Ferguson alluded at the end of his post), but rather with the projects themselves. The correct analogy would be if an untrained SEO were attempting to study the SERPs for the purpose of creating their own search engine, or if Mulholland were not attempting to build a dam, but rather to find a way over, under, or through that dam (consequences be damned, pun fully intended). An SEO does not need the requisite knowledge to build a search engine in order to poke holes in the algo, nor does a layman need a formal degree in civil engineering to poke holes in a dam. This analogy is one giant false equivalency.

But I think there is another issue at stake here. There is no evidence presented that Mulholland’s failure would have been prevented had he received a formal education. Mulholland consulted on several dams that still stand today, and several dams constructed by engineers with formal education have failed. In fact, Mulholland himself raised concerns about the “perilous nature of the face of schist on the eastern side of the canyon in his annual report to the Board of Public Works in 1911” that was dismissed by construction manager Stanley Dunham. I do not mean this to be an indictment of formal education, rather that the failure of the St Francis Dam isn’t concrete evidence (pun fully intended again) of Ferguson’s central thesis.

Addressing Critiques

I will now respond briefly to each of the critiques leveled by Ferguson in his writings. I do not intend to, in any way, hand-wave away concerns about the quality of studies presented by SEOs. We can improve on our studies and I have personally written on that subject and spearheaded with a group of technical SEOs a peer-reviewed research contest which has always included at least one statistician as a judge. My position is simply that we can gain valuable insights from studies of varying degrees of sophistication and confidence. With that being said, let me begin.

Claim 1: Studying Complexity Requires Formal Education

Aside from the obvious response that “having knowledge” is the key to success rather than how one attained it, I don’t think it is necessary for someone to have received formal statistical education, much less received a degree in mathematics, to perform valuable research. Take for example the Designer Jewelry Case Study on Amplitude Digital’s site. They claim to have employed techniques like generating “high quality backlinks” at a “healthy pace” which “send powerful credibility signals that Google… can’t ignore.” These claims are not backed up by any experimental design at all, much less one that is scientifically rigorous. Yet, Amplitude Digital makes truth claims about not only what factors matter in Google’s algorithm, but that their usage of those directly affected search traffic. Maybe it did, or maybe their competitors all screwed up their own sites, accounting for the organic growth of their client. I happen to think that Amplitude Digital’s use of case studies is completely justified, is valuable to potential clients and to the community as confirmatory evidence of search engine optimization, and in no way needs statistical validation to provide those modest contributions. I wonder if Ferguson feels the same way? Or will he say that “it wasn’t meant to be scientific”, a retort he finds eminently objectionable.

Claim 2. Specific criticisms of a handful of study models

Ferguson begins by taking aim at Rob Ousbey’s 2019 presentation which included, among many other findings, a relationship between user engagement and rankings. He leans on the expertise of Jen Hood who indicates concerns with the correlation model used stating “the easy test would be: if you can rank on Page 1, especially the top of the page, without previously having any engagement, then the engagement is most likely driven by placement, not the other way around.”

Unfortunately, this would not suffice as an effective experimental model for determining whether engagement or links are more important. Imagine an algorithm made up only of three features: relevancy, content, and links. Now, imagine that you attempt to rank a page only with links, only with content and only with engagement. If you rank the page only with links, does that mean engagement isn’t also a factor? If you rank only with relevant content, does that mean links and engagement aren’t factors? What if you set up two sites and used links on one and engagement on the other and the links won, would that mean links matter more? The answer to all those questions is No. In fact, the relative weight of these factors is practically unknowable, only their relative weights in terms of independent metrics from which we could derive equivalency (X increase in links is equivalent to Y increase in CTR).

Now, we could run experiments that could give us some knowledge. For example, we could start with a random set of keywords, choose a random link in the top 10 for each of those, and mimic various engagement behaviors (click through, pogo sticking, etc) and measure ranking changes for the cohort which received the engagement vs the others which did not.

Correlation studies are not and never have been intended to show causal relationships, but rather give us hints about how the algorithm may work and thus spur on further investigation. Often that investigation is not an individual study but instead the combined behaviors of hundreds of SEO practitioners testing the interventions on their own sites.

Weak Correlations are not Worth Our Time

I find this objection particularly frustrating. We know that no single potential ranking factor is going to explain the majority of the algorithm. We could imagine a simple ranking formula that involves 10 metrics, each weighted the same. A correlation study would yield that each individual factor would have a fairly low correlation coefficient. If we discovered all 10, we could potentially build a model that explains 100%, despite each individual ranking factor having a small coefficient. Weak correlations are going to be part of any complex system. Anyone can try this on your computer right now with Excel. Create 10 columns x 10 rows and fill each with =RAND(). Create an 11th column that is the sum of the previous 10. Now choose any column and the 11th column, go to Data > Data Analysis > Correlation. I just ran this test and the Pearson correlation coefficient was .16.

The Standard for Publishing Should Be Proof

The word prove and proven are thrown around quite a bit. I think we need to be very careful here with our terminology. Inferential analysis by its very nature cannot produce “proof” or “certainty”. It can only produce likelihood under the assumption that the future will behave the same as the past. Correlation studies in particular are not intended to prove anything, rather to provide evidence towards some conclusion. Typically, SEOs implement changes on their and their customers sites building anecdotal evidence (case studies), look to correlation studies for validation or or invalidation of their findings at scale, poll the community through interactions on social media, blogs and forums, and, if it is very important, might go so far as to perform a controlled experimental study. Would it be nice to have a peer reviewed journal and dedicated researchers with proper credentialing regularly submitting? Sure, many of us have discussed forming one and, for a while, SEMJ attempted this very thing. However, the question is whether such sophistication is required to produce valuable information. I think not.

Research which leads to the creation of a product or service is unethical.

Let me quote the exact text to which I am referring…

When I (Ferguson) mentioned to Jen Hood how many of the studies she reviewed have spawned new guiding metrics or entirely new products, she was surprised anyone takes those metrics or products seriously.

“Anyone claiming that they have a metric which mimics Google is asserting that they’ve established many cause-effect relationships that lead to a specific ranking on Google,” Jen wrote, referring to Moz’s Domain Authority.

No. It is difficult for me to take seriously Ferguson at this point (although I give Jen the benefit of the doubt as she is not part of the industry and does not key background information on Domain Authority)

Moz, with regard to Domain Authority, makes no causal claims. Humorously, the masthead on my Twitter account for two years was “DOMAIN AUTHORITY IS NOT A RANKING FACTOR” because so many people made incorrect assumptions about it. In fact, even though I now work for System1, my pinned tweet is still Google Does Not Use Moz’s Domain Authority as a Ranking Factor.

First, Moz doesn’t claim to have a metric which mimics Google. Moz claims to have a metric which predicts with some degree of accuracy the likelihood a site will rank based solely on domain-level link metrics. Domain Authority is a machine-learned metric trained on SERPs. We make no claim that there is a cause-effect relationship between increasing Domain Authority and increasing rankings, or increasing any of the constituent features of Domain Authority and increasing rankings. Perhaps Jen would like to read up on some of my articles on Domain Authority:

So, what is the value of Domain Authority if it doesn’t purport to play a causal role? Well, I give a handful of scenarios in my piece “In Defense of Domain Authority“. Perhaps the most obvious usage is to compare one’s Domain Authority with your competitors in order to help determine whether ranking difficulties are more likely due to links or to poor content. Used as a rule of thumb, DA like DR and CF/TF can be very useful.

Perhaps what is most frustrating about this particular claim is that the data scientists, engineers and mathematicians who work on Domain Authority are eminently qualified – they include ex-Google engineers, a Statistics Professor with a PhD in Applied Mathematics, and an Artificial Intelligence expert with a BS and PhD in Applied Mathematics. I was, by all accounts, the least qualified person in the room, but what I did bring was domain knowledge – an incredibly important part of being an effective data scientist which is curiously missing from Ferguson’s piece.

Sampling is Biased

This depends on the study. Traditionally, I have approached sampling from a number of directions, recognizing that there is no perfect solution. At Moz, when studying link graphs, we created an approach to sampling URLs from the web based off of a methodology originated by Google for a similar purpose. When sampling keywords, we would normally use a stratified sample of keywords based on search volume and CPC.

With regard to Jumpshot in particular, data was acquired from Avast and AVG (desktop and android) users which represented a significant proportion of the United States. We were well aware of the biases in the data (no Mac users, for example).

It is important to point out that simply because we can identify a way in which a sample is imperfect does not mean that it necessarily creates inaccurate results. Take, for example, national polls for presidential campaigns. A relatively small number of respondents can give accurate prediction to within a few percentage points. However, we know that sampling is imperfect in polling for a wide array of reasons (types of people who won’t answer calls from unknown numbers, people with unlisted phone numbers, people who do have land lines, people who are available at a certain time of day, people who are unwilling to give political opinions over the phone).

If one wishes to be contrarian about the outcome of a particular study and believes the cause is poor sampling, then they need to explain the causal chain which converts the sampling issue into a biased outcome. And even if that causal chain exists, the results can be valuable as long as we are aware of the bias.

There is No Peer Review in our Industry

There is no formal peer review, but there is certainly scrutiny. This is not unique to our industry – in fact, it is such a big problem in academia that it has been dubbed the “Replication Crisis“. In 2016, 70% of scientists claimed they had failed to reproduce another scientist’s experiment. While I certainly encourage reproducing studies, if we set that as a necessary standard we pull the rug out from under far more than SEO… from all of modern science.

Epistemic positivism regarding truth and knowledge

Ok, so this is a little esoteric, but it is important to respond. The claim “there is no truth that does not exist without experimental verification of that truth” is a self refuting claim. It literally undermines itself because you cannot run an experiment to prove that experiments are the only source of truth… it would presuppose that experiments are the source of truth and thus be circular in its reasoning. Where is the experiment which shows that Ferguson’s article is true?

We can get along making reasonable inferences to the best explanation by considering testimony, correlation, direct experience, and experiments. And we can give greater epistemic warrant to experiments over testimony, for example, but we have to be careful. Ferguson’s entire argument is based on the testimony of one analyst. And is he qualified to understand her? And how would he know if he was or was not?

This is all nonsense. This degree of skepticism devolves into into meaninglessness.

A false equivalency between certainty and usefulness

I think this is the most important of all the critiques. We don’t have to be certain that a tactic works or a ranking factor is real in order for it to be useful – we merely need to be right more often than our competitors. That’s it.

Concluding Thoughts

If the article followed the title, I would have no concerns with its contents. Of course we do not have the mathematics to decode Google’s algorithm. Such mathematics do not exist. But this article quickly moved away from the question of unraveling Google’s algorithm to a much broader question: do we have the mathematics to learn about the algorithm and optimize accordingly. To that question, the answer is an emphatic Yes. It is born out every day by the successes of our fellow SEOs. I should hope that Jeff Ferguson believes that; otherwise, what is he selling?

How to Send High Five on Peloton

admin — Sat, 18 Apr 2020 15:31:05 +0000

The search results were not specific enough so I thought I would write this out. In order to send a high five to a person on Peloton you need to tap directly on their face. Tap anywhere else and it won’t work. Apply directly to the forehead. Apply directly to the forehead. Apply directly to the forehead.

One Way You Can Help: COVID-19 Testing Sites

admin — Sun, 22 Mar 2020 01:50:25 +0000

This pandemic has caused so much hurt, but it has also brought out so much heroism, so much love and beauty and care and compassion. Everyone seems to be looking for a way to help. Well, here is just one way more.

It has been hard to cut through all the news, there has been so much, to find good information. Luckily there have been some major efforts behind reporting COVID-19 cases. However, one thing lacking in the United States has been comprehensive listing of testing sites. If you have time, I would really appreciate your help.

I have put together a site called COVID Testing Sites that intends to build as comprehensive a list as possible of COVID-19 testing sites in the United States. I have already funded the collection of the first 200+ testing sites, making it the most comprehensive list online so far, but I know it represents only a fraction of the sites online. You can see the latest results here.

There is a simple form on the homepage which allows you to put in an address of a location. If you put in the address, I or one of my volunteers or paid assistants will verify the location and then add it to the list. It is that simple. So, how do you find testing sites? Well, if you click on any of the states below, you can find the latest search results for COVID-19 testing sites in your state either in Google or Google News. Just reading through one or two articles will often allow you to find the announcements of multiple sites. If we just had one person in each state do this each day, we could create a complete list. Thank you in advance for your help! Be safe.

Click on one of the links below
Visit the first few results
Copy and paste any addresses into the form at COVID Testing Sites

Google	Google News
Alabama	Alabama
Alaska	Alaska
Arizona	Arizona
Arkansas	Arkansas
California	California
Colorado	Colorado
Connecticut	Connecticut
Delaware	Delaware
Florida	Florida
Georgia	Georgia
Hawaii	Hawaii
Idaho	Idaho
Illinois	Illinois
Indiana	Indiana
Iowa	Iowa
Kansas	Kansas
Kentucky	Kentucky
Louisiana	Louisiana
Maine	Maine
Maryland	Maryland
Massachusetts	Massachusetts
Michigan	Michigan
Minnesota	Minnesota
Mississippi	Mississippi
Missouri	Missouri
Montana	Montana
Nebraska	Nebraska
Nevada	Nevada
New Hampshire	New Hampshire
New Jersey	New Jersey
New Mexico	New Mexico
New York	New York
North Carolina	North Carolina
North Dakota	North Dakota
Ohio	Ohio
Oklahoma	Oklahoma
Oregon	Oregon
Pennsylvania	Pennsylvania
Rhode Island	Rhode Island
South Carolina	South Carolina
South Dakota	South Dakota
Tennessee	Tennessee
Texas	Texas
Utah	Utah
Vermont	Vermont
Virginia	Virginia
Washington	Washington
West Virginia	West Virginia
Wisconsin	Wisconsin
Wyoming	Wyoming
Washington DC	Washington DC

A New Journey Awaits – Moving from Moz to System1

admin — Tue, 25 Feb 2020 16:49:28 +0000

It is with crazy mixed emotions that I let folks know that today is my last full-time day at Moz. I joined Moz back in 2015 after working 10 years with Virante (now Hive Digital). As Principal Search Scientist, I saw a bright future of research, proof of concept development, and evangelization of both Moz and SEO in general. I was able to accomplish much of that with the help of some amazing people at Moz – from Rand and Sarah to my bosses Adam, Rob and Rob – from my fellow SMEs Britney, Dr. Pete and Miriam, to engineering leadership and staff (Shawn, Scott, Kshitij, Ben, Chas, Brian, Neil, David, Evan and Tony and so many many more) – from Felicia and Brittani and Mallari and Christina and Rebecca the list just goes on and on. And I hope to continue to do that with Moz via a consultative relationship.

There is one thing that I want to make clear. Moz cares. For all the mistakes and errors and complaints we can think about, there is one thing that is undeniable – they love and care for their employees.

However, coming up on my 5 year anniversary, I realized I just wasn’t having the impact that I wanted. I recognize that it is completely a “1st world problem”. Moz has been good to me in so many ways. But I want to have a profound impact on the web so I think I need a different platform.

What is in store for me?

I am heading to System1. System1 is a unique in that they have an incredible portfolio of powerful sites, many of which date back to Internet 1.0, most of which have little to no search engine optimization, much less content optimization. I have worked with the leadership of the company in the past on multiple occasions on non-SEO projects.

I am very much looking forward to getting my hands dirty in the SERPs once again. Don’t read anything into that @JohnMu, it’s just an expression.

See you in the SERPs!

Bad Faith: GOP, Impeachment and Simple Statistics

admin — Thu, 14 Nov 2019 18:52:20 +0000

Ostensibly, the purpose of public impeachment hearings is to collect testimony from witnesses to determine whether the President has committed impeachable offenses. Unfortunately, these types of public affairs tend to become political theater, as individuals from both parties try to grandstand rather than conduct a proper investigation. Once again, legislators have come under attack for grandstanding rather than properly questioning the witnesses. But is this a fair characterization of the Republican representations? Well, luckily we can use some fairly simple statistics to compare the behavior of the Democrat and Republican representatives in order to determine whether one is acting in bad faith.

If the goal of the hearings is to extract testimony from witnesses, then we have at our hands a few simple metrics which describe how effective each side is at eliciting testimony.

Question Density: To what degree did a political party’s representatives ask questions vs. make statements.
Asker-Answerer Ratio: To what degree did a political party’s representatives spend their time speaking vs. their witnesses.
Answer Length: To what degree did a political party’s representatives allow the witnesses to give expository answers vs. yes/no leading questions.

Thanks to the good folks over at rev.com, we have a handwritten transcript of the Taylor/Kent testimony on Wednesday, making this simple analysis possible.

Question Density

How often did each party’s representatives use their time to ask questions of the witnesses? How often did they use their time to make statements? The higher the question density, the more engagement with the witness.

The question density for Democrats was 3.01 vs Republicans at 2.28 (measured as number of questions asked vs words used). Thus, for our first metric, it is clear that Democrats were more intent on hearing what the witnesses had to say than the Republicans.

Asker-Answer

The next metric we can use to determine whether the investigators are grandstanding rather than soliciting testimony is simply to compare the amount of time they spend talking vs. their witnesses. Expressed as a function of number of words used, this also gets at whether the questioner is asking leading questions.

As we can see, the witnesses gave 60% longer responses relative to their Democrat questions than Republicans questions, even though Republicans spent roughly identical lengths of “questioning”. This indicates that the Republicans were either using their time to make statements or to ask leading questions that only deliver a yes/no answer.

Answer Length

Finally, we can look at the simple average length of answers to questions by Democrats and Republicans respectively.

Answers to Democrat questions were 20.54 in length on average, while answers to Republican questions were 16.41. While that might not seem like a lot, over the course of several hours of testimony, it adds up.

Concluding Thoughts

In all three simple areas of analysis, Republicans spent much more time making statements than actually trying to solicit testimony from witnesses. This is indicative of a hearing where one group is interested in bringing out testimony while the other is interested in suppressing testimony. The Republicans in the House are acting in bad faith. If they are right, the truth shall set them free. But we can’t know the truth without asking questions, something Republicans seem very hesitant to do.

Google Cache Banking Accounts

admin — Wed, 13 Nov 2019 19:08:33 +0000

It appears that Google has decided to jump into the banking account game. I am guessing that I will receive some form of communication from Google to hand over this site given its name. The Google Cache started off as a protest site regarding the use of a “Cache” in search results. To this day, I still hold to the principle that “Caching” should be opt-in while “Indexing” should be opt-out. Nevertheless, over the years I have transitioned this into a place of discussion regarding the SEO industry and my research. Rather than offer commentary at this point, I will just point to a number of news articles on the subject matter as details unfold:

Nov 13, 2019

SEMRush Employee Doubles Down on Bizarre, Misleading Link Data

admin — Sun, 15 Sep 2019 21:45:40 +0000

I’d normally let this kind of thing slide, but my integrity has been impugned (Olga Andrienko has accused me of “yellow press type headlines and incorrect data” in her comment on the original post) so I feel obligated to respond. First, let me start by why I used the words “bizarre” and “misleading” and why I stand by them. To be clear, my original post said explicitly that I did not think SEMRush was falsifying data and “I do not think SEMRush is intentionally inflating their numbers“.

What makes SEMRush’s Data Bizarre?

In SEMRush’s IP reports, they will display all the IP addresses that link to you… [click image for larger picture]

However, when you click on the domain number (in this case [1]), you land on a page with an advanced filter which shows no domains associated with that IP.

This is Bizarre. You can’t have an IP address without a domain in a link index. You just can’t.

What makes SEMRush’s Data Misleading?

SEMRush entered Matthew Woodward’s Best Backlink Checker contest which determined the size of indexes based not on root linking domains but IPs and Subnets. Unlike Moz, Majestic, Ahrefs, Webmeup and, to my knowledge, any other link index on the web, SEMRush stores every IP address they encounter for a domain over a 6 month period and report them independently. Because sites change IP addresses or use multiple IP addresses as a load-balancing method, a crawler will sometimes encounter more IPs than referring domains. This can cause their index to show more IPs than referring domains.

Because of SEMRush’s different collection and counting method, they destroyed the competition – not just Moz. Importantly, if I had not discovered and corrected this difference, SEMRush would still be enjoying the false narrative that their index is the largest by a landslide, and SEMRush would never correct it. But there are two possibilities. Either…

SEMRush did not know their collection methods were different
SEMRush did know their collection methods were different

As an aside, I knew that Moz did not collect IPs or Subnets, so I wrote a conservative model to predict these numbers and fully disclosed to Matthew in advance as to the limitations of our data. I wrote to Matthew, which he was able to put in his original post, “The [my] calculations should, therefore, be taken with a grain of salt.” This is the kind of proactive disclosure we should all make during research.

Given the two options above, it seems to me that if SEMRush did know they collect the data differently (and given their responses where they claim their method is better), then they had a responsibility to explain that essential difference and how it would make an apples-to-oranges comparison between their index and their competitors.

If they didn’t know there collection methods were different, then while this was a mistake, it calls into question what other assumptions they have made about their data when comparing their index to others and when marketing their index as the biggest or the best.

Responding to Olga Andrienko

First, let me acknowledge that she was not speaking in an official capacity on behalf of SEMRush. Here opinions are in Italics and my responses immediately follow.

“The SEO tools industry had always been known for the respect among the members of our vast community”
I agree, and I still think it is. This is why I twice indicated that I believed SEMRush’s data was not maliciously constructed.
“yellow press type headlines and incorrect data published about us”
I believe I have defended the headline adequately above. Moreover, I published no incorrect data.
“You should have used the Referring IPs report”
I used both the Referring Domains and the Referring IPs report. The first showed the discrepancy in counts, and the second showed you had IPs that were not associated with a root linking domain (ie: “Nothing Found”). Both reports were highlighted.
“each domain can have more than one unique IP address”
Yes, and I made this abundantly clear in my article when I mentioned Round Robin DNS as an example.
“we have no discrepancy in our reports… Referring IPs report, we show all domains per unique IP.”
Yes, you did and still do. In the Referring IPs report, I can click on an IP and it will show me no associated root linking domain (ie: “Nothing Found”). This is a discrepancy.
“It would be incorrect to download the info from the Referring Domains report and filter it”
On the contrary, if you did this with any other tool, you would find the exact same number. I couldn’t rely on the referring IPs report because IPs were not and still are not adequately associated with domains in that report.
“we think the actual numbers are, in fact, a lot less misleading than prediction-based formulas and estimates”
I agree. If I provide a model, I am clear to the data partner and, as Matthew’s post indicates, I told him to take the data with a grain of salt. The important thing is knowing your index and disclosing any idiosyncrasies.
“We are very transparent about what we show on the dashboards and have a lot of tooltips with explanations”
If I had not discovered this difference in calculation, would SEMRush have proactively reached out to Matthew for a correction?
“so if you knew SEMrush as your employer’s competitor before”
I never used SEMRush’s UI for links because, frankly, until recently it wasn’t a contender. And, honestly, I don’t use most sites’ UI/UX. I just interact with their API.

Concluding Thoughts

Don’t give data to a contest without proper disclosures.
If you can’t give a proper disclosure because you are ignorant of your data’s differences, that is your problem.
Don’t accuse me of “yellow headlines”.

And a Question to Olga Andrienko:

If I had not found this difference between the way SEMRush reports IPs and everyone else in the industry, would SEMRush ever have sent in a correction to Matthew?

SEMRush IP Link Data Bizarre, Misleading

admin — Wed, 11 Sep 2019 14:49:33 +0000

Disclaimer: I am Russ Jones and I work for Moz, which is a competitor of SEMRush. These are my opinions and do not represent those of Moz. That being said, the data speaks for itself.

A response from a SEMRush employee, although not speaking officially, is below.

I must admit that I was taken aback when Matthew Woodward’s recent Best Backlink Checker analysis came back so heavily in favor of SEMRush. I knew Matthew did good work and took this project seriously, vetting each provider to the best of his and his teams’ ability given the data they were provided, but it just didn’t mesh with the comparisons I run daily and weekly against Moz’s competitors. (I have spoken with Matthew about this issue and he is currently investigating independently).

Matthew’s study compared the number of unique linking IPs and Subnets reported by various link indexes across a whopping 1,000,000 domains. Unfortunately, Moz does not collect IP data, so I had to construct a conservative model to predict IP numbers. I wasn’t expecting Moz to perform well given this state of affairs. However, I certainly didn’t expect SEMRush to have grown so rapidly and dramatically in such a short time. It would have taken a technological miracle in my eyes. So I took a closer look.

Methodology

Rather than rely on the numbers reported in SEMRush’s API or tool, I collected the data myself using their own exports of root linking domains. The process was rather straightforward.

Download list of all root linking domains
Count Unique IPs in export.
Compare to number in UI.

Well, here is where things get really dubious for SEMRush. [Download Raw Data]

First site: thegooglecache.com
Number of Reported Unqiue IPs: 859
Actual Based on Root Linking Domains: 636
Discrepancy: +25%

Second site: matthewwoodward.co.uk
Number of Reported Unqiue IPs: 7700
Actual Based on Root Linking Domains: 5669
Discrepancy: +26%

Third site: grepwords.com
Number of Reported Unqiue IPs: 551
Actual Based on Root Linking Domains: 436
Discrepancy: +21%

These were just the first 3 sites that I tested. Obviously this was disconcerting, so I decided to dig deeper just within the app. The picture is not pretty.

First, SEMRush shows IP addresses which are reported as linking to a domain in the IP address tab that are not in fact tied to a domain in their system. That is to say, SEMRush reports that a domain has a link from a certain IP, when you click on the IP to find the domain it is assigned to, SEMRush reports “Not Found”.

Second, there are thousands of instances where SEMRush reports more unique IPs than domains.

In fact, a test of ~3600 random domains yielded approximately 70% with more IPs than domains. The complete [csv can be downloaded here]

So, what is going on here. Is SEMRush outright falsifying data?

I don’t think so. What I think is going on is an issue of how SEMRush stores IPs at a link level rather than a domain level. This means that if they visit a site that uses a round robin DNS to load balance, they could get multiple IP addresses for the same domain. While services like Majestic and Ahrefs likely store a single canonical IP address per domain, SEMRush seems to store per link, which accounts for why there would be more IPs that referring domains in some cases. I do not think SEMRush is intentionally inflating their numbers, I think they are storing the data in a different way than competitors which results in a number that is higher and potentially misleading, but not due to ill intent.

What does this mean for users?

If my intuition is correct, the number of unique IP addresses is not a safe metric to use in SEMRush. If your site has a sitewide link from a website that has a round-robin DNS, you could accumulate dozens if not hundreds of unique IPs based on the crawl of that site and the number of IPs across which the site is load balanced. Whereas another site, with a similar sitewide link from a website with no round-robin DNS would report only 1 IP address. Furthermore, comparing IP and Subnet counts between providers (Moz, Ahrefs, Majestic, SEMRush, Spyglass, etc.) is not an apples-to-apples measurement unless and until SEMRush changes their collection or reporting methods.

Important Takeaways

I have repeated this many times before, but it is worth repeating again. Ask your data providers how they collect, store, and report on metrics. Teasing out the differences between providers is essential to making fair comparisons and determining what might be best for you and your team. This IP address issue is just one of many difficult questions that companies which crawl the web must answer, and the way they answer these questions can dramatically change reporting.

Also, I want to point out the importance of research like that done by Matthew Woodward. If he had not run this extensive analysis, you and I and the rest of the community may never have noticed this important distinction. And, in missing this distinction, we would be potentially misled by our assumptions on what particular metrics mean. We need you, the independent SEOs of this world, to keep our feet to the fire (Moz included).

SEMRush vs Moz Link Index Re-verified, Data Provided

admin — Mon, 24 Jun 2019 13:30:11 +0000

Mea Culpa: It looks like I screwed up the exported CSVs from my code below. Thank you to Malte Landwehr of SearchMetrics for finding the bizarre issues. Having an unbiased 3rd party (especially of high reputation like Malte) review is hugely helpful. Malte also identified a high % of .jobs domains in the random data set. Because some crawlers have difficulty with the new TLDs, (I am not sure if this is true of SEMRush), I limited the Domains and URLs to .org, .net, and .com. There were no meaningful changes in the outcomes of the reports EXCEPT for Total Referring Backlinks to URL, in which Moz wins 2x rather than 6x.

Comparing Domains [XLSX]
Comparing URLs [XLSX]

I often do not post data along side these mini research projects such as my most recent brief comparison of SEMRush’s link index and Moz’s for a number of reasons. The primary reason is that publishing data often comes with risk such as data rights (publishing raw data from competitors is equivalent to giving that data out for free to users). However, if I am doing a comparison piece and the one of the compared providers requests the publication of the data, that is no longer a concern.

Well, it seems that Oleg Shchegolev, CEO and Founder of SEMRush, was not confident in the results of the study and asked that I publish the data. I was out of town over the weekend, but I will happily oblige.

I will make one quick retraction. I can’t say with certainty that they remain “well behind the big 3”. Rather, all I can say given this research is they remain well behind Moz.

The first step I took was to run the experiment all over again. If the results aren’t repeatable, then they aren’t valid. As expected, the results turned out to be nearly identical to the previous test.

Before I show the updated graphs and the raw data, I want to make sure everyone understands the methodology. A more complete writeup is here which explains in depth how we go about getting a random sample of unique domains and URLs from the web. This is actually quite a cumbersome task and is the backbone of any successful link index comparison. Assuming you have taken the time to read the process we use to select random URLs, the second most important part of understanding the methodology is the usage of “adversarial metrics”. What I mean by an “adversarial metric” is that the scores are derived from comparing how each index performs on a particular URL or Domain one at a time. We then repeat the exercise over and over again and tally the number of wins, losses and ties between indexes. The reason why I use this methodology is as important as the methodology itself. SEOs have no use for the descriptive statistics of the index as a whole. SEOs need the most and best data they can get at a URL and domain level. It is perfectly possible to build an insanely large link index which shows no backlinks to any relevant domains or URLs if you are crawling the wrong pages. So, when you look at Moz’s index size (36 trillion links) vs a competitors, that number may be utterly meaningless to users if the index doesn’t contain their domains, their URLs and their backlinks. SEOs want to know which index is going to give them the most data about their domains and URLs.

So, in constructing an adversarial metric, we randomly select domains and URLs from the web and then determine which link index provides the most data for each of the URLs and domains, one by one. We then tally wins, losses, and ties, to identify which link index is most likely to be useful to an SEO.

The Results

Total Backlinks to Domain

Total Referring Domains to Domain

Total Backlinks to URL

Total Referring Domains to URL

The Data

~~Alright, so here is the relevant data…~~

~~Data from 1000 Randomly Selected URLs [CSV]~~
~~Data from 1000 Randomly Selected Domains [CSV]~~

Note: Something is wonky about these CSVs. So, I re-ran the test AGAIN and here you go…

Comparing Domains [XLSX]
Comparing URLs [XLSX]

Concluding Thoughts

One easy test you can run is to perform descriptive statistics on the individual raw data columns in the CSVs so you can better understand the reasoning for the methodology. You will see, for example, that in Moz’s worst performing category (Referring Domains to URLs), that we only average about ~20% more referring domains to URLs. That doesn’t seem so significant, but when that difference is consistent across the network, which an adversarial metric will expose, it means that you are 4x more likely to get more data from Moz than SEMRush.