Microsoft scientist says 'Bing It On' is no lie; Ayres experiment “wildly uncontrolled”

Yesterday, we published an editorial questioning whether or not Microsoft’s “Bing It On” campaign and its claims are a sham or fair play.

Yale law professor, Ian Ayres, conducted a study with a collection of 1,000 people who were asked to take the “Bing It On” challenge and report their results. The outcome of Ayres’ experiment was nowhere near Microsoft’s claim that people prefer Bing 2 to 1 causing a media storm of accusations and negative press.

We spoke with Bing Behavioral Scientist, Matt Wallaert, to help clear up the situation.

Why the change

The first issue is Ayres’ challenging Microsoft’s older “2 to 1” study. If you visit the campaign’s website today, you will notice that Microsoft has changed their headline to “people prefer Bing over Google for the web’s top searches.” Wallaert, explained that Microsoft started with a study in which users could pick any search query they wished – this study is the basis of the “2 to 1” claim and it was reported back in September of 2013.

Microsoft then performed a new study in which they used Google’s top queries instead of user dictated ones. You might expect Bing to not perform as well, as these are Google’s top and most handled searches. The results were surprising. While Google did gain some edge Bing handled Google’s top searches better.

Wallaert states that there was a “significant time between the two studies” and that “both Bing and Google change their algorithms almost constantly, in a race to make search better and better”. Despite not reaching the 2 to 1 claim with Google’s own queries, the Bing team was happy.

“Theoretically, [Google] should have been really good at [their own queries]”, says Wallaert, “it just turns out they weren’t as good as we were.”

The Aryres study and Unconscious decision making

So what about Professor Ayres’ study – why did Bing and Google perform so similarly without the large gap Microsoft had reported?

Wallaert stated that “it is hard to know for sure”, but sampling errors and test conditions may have had a role. In terms of sampling errors, Wallaert told us that Mechanical Turk (the source of people, Ayres used for his study) may simply attract a specific type of person:

“Ayres’ used Mechanical Turk to recruit subjects, a site that is known to very few people on the web. While he measured things like gender, age, and race, and showed that his sample was representative of the internet-using population, one strong possibility is that those aren’t the relevant variables along which people pick search. For example, it may be that the more likely you are to use Mechanical Turk, the more technology-inclined you are, and that being technology-inclined is correlated with a preference for Google results over Bing results.”

In addition, Wallaert stated that Microsoft does know that “most people prefer Google’s brand":

“...imagine that you prefer Google to Bing and then you get told in a study to go to this super heavily branded Bing site. Is it possible that maybe, just maybe, you might have a little bit of a twinge of “I don’t like Bing” and be looking for the Google results and picking them, at an unconscious level? I have no idea if this actually happened, but there is abundant psych evidence to suggest that it could at least be a possibility.”

The conclusion that Wallaert wants us to draw is that Ayres’ test was conducted with a website that may attract a typical sort of subject type and that keeping Bing branding everywhere on the page may invoke a negative connotation and subconsciously have users looking for Google results.

I can personally agree with Matt Wallaert in his summary. As a technology journalist who writes about Microsoft, I constantly use Bing every day to conduct my search queries and get the information I need. I myself decided to try out the Bing Challenge when it was first introduced and was surprised at how “non-geeks” reacted when they chose Bing – in summary, they were disappointed. When I jokingly quoted the commercial and said “you’re a Bing man, man!” they were upset and most had a distaste for the search engine before they had even tried it.

In Wallaert’s original blog post disputing Ayres’ claims, he stated that “I have no idea if he is right… we don’t track the results from the “Bing It On” challenge. I asked him to clarify and he stated that he was “responding to [Ayres’] claim that using the Bing suggested queries results in people being more likely to select Bing” – not whether or not Microsoft had accurate data.

Wallaert wanted us to remember that Microsoft had hired a third party to perform the experiments and that “[Microsoft’s] claims are reviewed by our lawyers”. “Believe me”, Wallaert stated, “[our lawyers] are very strict about what they let us say… it is also important to note that Ayres’ so far has undergone zero validation.”

Experimental Design

Matt Wallaert referred to the idea of experimental design to back up his claims:

“When two results disagree, that’s the thing to do: look at which experiment was a better test of the thesis. The thesis here is about comparing web results. So let’s look at experimental design. Which is a more representative sample about a technological issue: Mechanical Turk users or people selected by a 3rd party research firm specialized in representative sampling? Advantage, Microsoft test. Which has greater statistical power: 1,000 people or 333 people? Advantage, Microsoft test. Which is better: telling people “do exactly these five selected queries” selected from a list of 25 or letting them choose a query from a set of five that they can refresh at any time from a list of around 500? Advantage, Microsoft test.”

In the end, Microsoft still can’t release any data performed by the third party research company, so we aren’t allowed to see the full picture. After Wallaert had spoken to us, we can agree that Microsoft’s experimental design was carried out with a higher degree of precision than Ayres’ experiment.

As I spoke with Matt Wallaert, there was no doubt about his passion working on the Bing team, and is personally upset that such a “wildly uncontrolled” experiment could be taken so seriously.

“People forget that companies are made up of people and let themselves get distracted by how they feel about Microsoft the brand. But I’m here; I’ve spent valuable time refuting Ayres that I could have spent building Bing for Schools. So if people want to call me a liar, then there just isn’t much I can do about that.”

Wallaert certainly makes some convincing arguments, but will it be enough to turn back the anti-Bing tide? We'll have to wait and find out.

Why the change

The Aryres study and Unconscious decision making

Get the Windows Central Newsletter

Experimental Design