A/B Testing is Harder Than You Think
Published: May 3, 2013
Author: Calvin Vu
A dirty secret of the online marketing world is that A/B testing can be a sloppy process. A cautious SEM who tests two ads into each ad groups, rotates the ads evenly, and tests for statistical significance still likely has holes in his/her test methodology. Let’s first establish the elements of a good test before explaining potential pitfalls.
What is the perfect ad test?
A perfect ad test should demonstrate:
1) A plain-language hypothesis
2) A “clean” dataset
3) Statistically significant results
4) Consistent performance across test elements
#1 and #3 are the easy parts of an ad test. Design an experiment that changes one variable, then split traffic equally. The larger your data set, the faster you can accept or reject your hypothesis. But every test is only as good as the conditions under which they’re run, and stretching yourself too thinly will damage the integrity of your results.
Lesson 1: Loose or inconsistent ad group and campaign theming can throw a wrench into seemingly good test results.
Consider the example above. The results seem clear at first glance: a significance calculator shows virtual certainty that Ad 1 > Ad 2. However, a deeper look into the results reveals that Ad 1 < Ad 2 in two of the top five highest-traffic ad groups, including the highest traffic ad group. Ad group level placement data later revealed to us that each ad group contained dramatically different placements. Thus, scaling your test across too many disparate elements can create inconsistencies in your data.
Failing to achieve data consistency (#4) forced us to make a decision whether to optimize by ad group or whether our aggregate data was sufficient to pause the losing ad across the campaign.
Failing to achieve a “clean dataset” (#1) forced us to question the integrity of our results. Google’s ad rotation settings make clean A/B testing very hard. Rarely does “even rotation” actually mean a 50/50 split in impression traffic. An alternative to even rotation is to “rotate indefinitely,” but this setting sometimes stifles traffic.
Lesson 2: A clean, logically-structured account is the key to executing a clean ad test.
In a perfect world, ad testing belongs at the query and placement level and can be scaled up as long as the elements remain similar. This means testing your ads on tightly themed ad groups and testing on exact match keywords when you can.
Go ahead and test your “online dating” ads across dating ad groups and campaigns, but think twice about adding them to your online chat ad groups. If done correctly, your data aggregates will look as clean as your granular ad group-level data. A test that fulfills the criteria of significance, consistency, and cleanliness appears below:
A logical account structure will also allow you to run a different ad test across each of your campaigns according to a regular schedule. After all, a clean account makes future ad tests easier to implement later, and ending a test when it’s time means that you can start a new one right away.
– Calvin Vu, Account Manager – Search