What Tools You Need When You Start a Testing Program

Andrew Anderson

9 years ago

One of the great truths that people ignore when it comes to optimization is that you can fail with any tool. It’s only when you are trying to succeed that differences in tools really matter.

Once you’ve established the right mindset for a successful program you are still going to need a number of tools to enable you to test and to get value from your actions. I’ve helped setup hundreds of programs, and there is a huge difference in both how tools get implemented and how people think about their tools as either the entire purpose of the integration or as a means to an end.

Like all things it is not a matter of falling for Maslow’s Hammer but instead about looking at your tools for how they enable you and how they allow you to do things you have not done before.

Image credit

I will present a few of the key tools that you will need in order to get a program underway. Instead of talking about specific tools it is better to think in terms of what you need and why, as well as what you don’t need and what to avoid. You can have all the best tools and get no value, just as you can have the cheapest tools and get great value. The key is in how you use the tools you have. This will allow you to view any tool and make it fit within your organization and customize it to your needs instead of trying to find a perfect turnkey solution that likely does not exist.

The Biggest Tool – Your Testing Solution
Other Tools
Conclusion

The Biggest Tool – Your Testing Solution

The first thing most people think about is their testing solution. Its hard to have a program if you can’t run a test. That being said it is actually one of the least important parts of the program as there is surprisingly little differentiating most tools.

Vendors try and make it sound like their tool is the greatest thing since sliced bread. The important thing to understand is that despite so many promises and bells and whistles, most of what is talked about has no bearing on achieving success. There is no perfect tool– hey all have some good and some bad. The key is to figure out what fits best for enabling you to do the right things while also helping you avoid the wrong actions. Instead here are some of the key factors to look for and key items to ignore when you choose your tools.

Speed

When thinking about speed there are two different factors that matter. The first is after your initial deployment and set-up, how fast can you get a test from concept to execution to live? The goal for any program is to get most tests with at least 5 experiences through that concept in 30 minutes or less. That is a hard target to meet but it does express just how important it is to prioritize speed in testing and the need for general knowledge of CSS, HTML, and Javascript by the operator.

This means that items such as templatized rules, easy site interaction, easy interface, and easy navigation are vital. It also means that tools that require complex URL targeting rules, constant IT deployment, or require manual jQuery set-up should be lowered on your priority scale.

An example of a tool that for the most part does this right is Optimizely while an example of those that doesn’t is SiteSpect.

The other concept of speed is how much load time does it add to the site, in both weight and in speed execution. The general rule of thumb is that the human eye notices things in the 200-300ms time range, and noticing a change changes people’s actions. This means you need a tool that does not overly weigh down the page and one that does not overly cause post page interaction changes. Tools that rely on heavy jQuery can seem like amazingly easy to use, but they can also cause page flicker or interaction effects. Learn how to reduce the flicker effect here.

Tools that allow for multiple ways to interact with a page and/or allow you to control items as the page loads have major advantages here.

Note: While lots of tools provide easy-to-use visual interfaces to create treatments, the automagically created jQuery code is often terrible – which results in cross-browser compatibility issues and slowness. Always have a developer check and improve the code. Or learn jQuery – it will really help.

Consistency

Another key factor is to avoid any tool that looks at people on a session basis, or enables the ability to drop people from an experience while it is active. While most people are used to thinking about things in terms of sessions, the truth is that it doesn’t matter if something happens today, tomorrow, or 3 sessions from now.

What matters is being able to influence a behavior for the better. It is vital that you use a true visitor based metric system, and test design during the test and that you look at behavior over time. Tools that do not explicitly do this and even tools that allow for this view and behavior but do not do so out of the box can cause havoc.

Adobe Target is a tool that does a great job of leveraging a visitor based metric system while Google Content Experiments is visit based.

Segmentation

While it is possible to use other data sources as your system of record for analysis this behavior is not without its drawbacks. It is vital that you focus not on targeting capabilities, but instead focus on the ability to look at many different views of the same users. Tools that do not have easy to use and robust segmentation abilities or that focus solely on targeting instead of segmentation can limit the value of your program. Even worse, these tools can enforce bad behaviors and enable groups to think they have accomplished something when they have not.

You should be able to create many different custom segments based on a series of factors. Tools that limit the number of active segments for analysis, or that make it difficult or cost prohibitive to use certain types of segments can be problematic.

Even more problematic are ones that encourage the use of segments that are so small that they suffer from a massive barrier for value in a mathematical sense. You have to always do your own thinking (e.g. check the sample size of segments).

No tool does a better job on segmentation then Adobe Target while segmentation is a problem for Optimizely (which is why you might want to use Google Analytics to do post-test segmentation and analysis).

Data

Data is a big part of the picture. Most tools enable you to track many different things, the key is both what they encourage and how easy it is to track what you need to track. Avoid tools that focus on tracking pointless behaviors like engagement, clicks, or time on site. Instead you want tools that allow for consistent and meaningful behavior of your single success metric (e.g. transactions, revenue per visitor).

Tools that focus on pointless things like heat maps, click tracking, or that highlight the ability to track a large number of metrics at the same time are doing a great job of telling you that they are deeply invested in you feeling like you are successful even when you are not getting value from your tests.

VWO is can lead you astray with heatmap data while Google Content Experiments, for all its other limitations, does a good job of focusing on one key metric.

A second dimension of data is how the tool purports to tell you when you can act on a test. To put this simply, ignore all of it as there is not a tool out there that does a good job and that does not overly rely on rough statistical measures to express confidence. There are minor improvements from a single tail to a two tail to Bayesian or Monte Carlo types of evaluations but even in the best case they are functionally not enough to ever tell you when a test is done. The fact that vendors make this as much of a focus as they do helps highlight how far the distance is from what they present as a successful test and what is a viable outcome.

Another dimension that is often overlooked in choosing a tool is the graphical representation of the data. Any tool that makes it hard or does not present the ability to look at a cumalative data graph in a meaningful way makes it nearly impossible to make good decisions.

While I would strongly suggest that you do all tracking in an outside medium such as Excel, you can often tell far more from the graph then you can from just a basic summary screen. Too many people look at their tools for what they say in a summary screen or just as a basic reporting level and spend way too little time thinking about the readability of the graph or the ability to look at as many comparative data segments as possible meaning they miss a huge amount of a value a tool should be providing.

Integrations

Having the ability to easily integrate with other tools or your other reporting systems can be a good thing, but it can also lead to really bad behavior. Remember that the analysis of causal data is very different than normal correlative analytics or even qualitative data. How you think about, how you report, and what the data means is completely different.

While it is possible to enrich it with those, it is even easier to get side tracked or fall back on more comfortable disciplines when you have access to those tools. s such you should think before just moving forward with any integration. Making it easier to have poor discipline is not a value add and is instead a value detractor, so always keep discipline in mind when looking at how you are going to integrate your testing tool with your other existing tools. While a blind squirrel might get a nut every now and then the odds are not in his favor in the long run.

That being said the ability to easily integrate with another tool or two should be as simple as a plugin or a simple switch. If it involves a lot of back end support or an API call it is not the end of the world but it can lead to maintenance cost and speed errors in the future. Having the ability to tie into your CMS can be of high value, but not if it drags down test creation speed to an absolute halt or if you are not able to test layouts and real estate because you are beholden to a template.

Having the ability to pass useful information about users in a runtime environment can help you get a lot more value from segments, as long as the focus is on the discovery of value and not just blind targeting. Anything that is not done on initial pageload is going to dramatically impact the population on which it can be leveraged so keep that in mind when looking at the value of different data imports.

Optimizely is extremely easy for integrations with a number of tools while Adobe Target has the capability, but is very resource intensive to set-up.

Implementation

There is always going to be an upfront implementation cost to set-up any tool, no matter how easy some one-tag solutions purport to be. Going into the set-up of your tool with this in mind and with the goal of setting up things in a way that are universal from day one will save you a lot of headaches down the road. Too often things get deployed on an as-needed basis and this leads to a lot of slowdown and misconceptions about the speed and value of a tool.

A lot of tools have exploited this gap by focusing on how easy they are to set-up and get going. This is true in some cases, but ultimately the first few days or weeks of your tool use should not be a deciding factor in which tool you leverage. Always plan the right time and resources to set-up a tool so that you can meet that 30 min rule in the future and so that needed information is there from day one. Don’t go with a tool just because you can add a single tag at the top of the page for that sole reason. Also do not choose a tool just because your group does not have to do the work. Asking your net ops team, like SiteSpect does, to do the heavy lifting is not a good reason to justify a tool.

Ultimately once you are done with implementation your tool should become ubiquitous. You should rarely have to change your set-up, and you should be able to treat the tool as an after thought. It is just the thing that you leverage, it should not be the end all be all of your program.

Knowledge of how to setup a test or run a tool is required but it is so low on the list of priorities that determine success that once you get past the initial pain it has to be a background hum instead of a constant beat. If it is not then you need to really look at if you have the right tool or if you set up things correctly. A small bit of pain now can save you from massive headaches later.

Natural Variance

No matter what you do if you compare the same experience to copies of itself it won’t match. This is known as variance and it is very important to know what the range is for your tool and site as this is not accounted for in terms of population error rate in confidence and other measures. It will be a big factor in how much you can act on small lifts and what you can test, so it is not something to take lightly. his is the number one most overlooked and misunderstood fact for most tools and programs.

Every tool has a degree of natural variance just because of the nature of data collection. That being said the data system and the way it is set-up on your site can have a major impact to just how much variance exists. If you can run a beta for your tool before you make a full purchase, that is great, but even if you can’t you need to know your variance before you can do just about anything. Tools that focus on a visit based metric system and sites with lower populations or limited product catalogs are going to have higher variance.

Many people mistakenly think running just an A/A variant is enough to get a read on variance, but the truth is that variance has a range and normalization pattern, just like every point of data. This means you need to do a full study, and you need to know how to look at that data. I normally suggest you do 5-6 experiences of the same thing and run under the same conditions of most tests on your site. 6 experiences gives you 30 data points which means that you will have a much deeper view into averages and normalization, as well as the beta at given times in the test cycle. You can do this as little as once a year, though I suggest more often than that.

Natural variance can shape your entire use of other parts of the tool such as confidence, and it can have a major influence on what you test and what you can call a winner. Most sites on sitewide metrics overtime end up around a 2-3% variance range.

I have worked with top 50 websites that have had average variances as low as 1% and worked with smaller lead based sites that had variance as high as 6-7% after two weeks. What is important to note is that whatever that range is you have to treat all results in the plus or minus of that as neutral or impossible to get a read on, even if you get 100% confidence (which happens surprisingly often).

If you are not going to be able to call winners at 3% lift, then you are going to have a lot fewer winners and will have to focus on things that have higher betas like real estate over smaller content or cosmetic changes.

Other Tools

Test Tracking Sheet

I do all analysis in a simple Excel spreadsheet that I share via Google Drive. I have leveraged other mediums, and I have used other sharing tools, but ultimately the goal here is that you should have clean tracking and have it accessible to other parties. The tool needs to show performance, estimated impact, and most importantly the cumulative graphical outcome. Another key value of using an external spreadsheet is that you can use it combine distinct actions into a cumulative revenue if that is required for your organization.

You should keep a record (screenshot) of what each variant looks like as part of this document. I also do a survey of my common test group for what they think will be the outcomes, and keep that as a tab as part of this document so that we have a record of the total votes to see how often we are right, which is almost never.

Program Tracking Resource

You should have an easy repository of all tests that have run as well as your future roadmap in a commonly accessible location. I currently use a wiki for this but have also used another shared spreadsheet and/or a project management tool to accomplish this.

The key here is that people should never have to ask what was run, where, or what the impact was. They should be able to see all past experiences and be able to get a good idea of when something ran as well as when things in the future might run (all roadmaps have to be extremely flexible as you build them around resource instead of get resources to meet your roadmap).

A dedicated tool for test documentation is in private beta right now, but should be launched soon.

Rules of Action

It is important to have an agreed on and available document that expresses how you are going to act on a test. Having people agree on this outside of a specific test and making this an easy to access resource greatly reduces headaches down the line. This doesn’t have to be complicated but it should cover: what is necessary for a test to be called, what will happen when it is called, how and when you will kill lower performing experiences, the results of your variance studies, and what is required for a test to be run in the first place.

One of the great benefits of the creation of this document is that it helps you make people aware of these things and help them know what you will be doing, even if they are not directly involved. Even if this is just another page on your wiki having it as a separate item is of great value and is often overlooked.

Here’s what it can look like (I have seen it done like 10 different ways):

Living Knowledge Base

By far the most valuable thing you will have from a successful program are the lessons you have learned from your testing. What does matter, what doesn’t, where you were right and most importantly where you were wrong. While it is very easy to take too much away from a single test, things you see consistently are a gold mine as they will allow you to avoid fruitless conversations and focus resources on things that do matter going forward. Because of this it is vital that you have a constantly shifting and living resource that everyone has access to, not just your normal testing group.

You can either use the same medium as your other tracking or make another more accessible medium. Wikis again work great for this especially if they are part of your larger organizations intranet. Making this information available to everyone and constantly reminding or pointing people to it will also greatly improve the visibility and reach of your program. Nothing gets people to buy into testing like consistent and meaningful results.

Conclusion

There are a lot of things that go into a successful program. Having the right mindset is the most important, but secondly are tools that reinforce good behaviors and make your life easier. While individual tools ultimately are fungible, the needed outcomes are not. Too many programs think that just having a testing tool is the same as having an optimization program. Few also put all the other pieces in place to allow everything to run at full speed.

Just like with a test success and failure is determined before you launch a test, not after. Doing the necessary groundwork to have the tools you need and leveraging them in the right way can greatly improve a program. Avoiding a tool or taking an easy way out can hinder a program much later down the line. It requires discipline to suffer the small amount of pain upfront and get all your ducks in a row before you get too caught up in specific tests.

“Men have become the tools of their tools.” – Henry David Thoreau

Featured image credit

Join 95,000+ analysts, optimizers, digital marketers, and UX practitioners on our list

Emails once or twice a week on growth and optimization.

This form is used explicitly for the https://cxl.com/subscribe/ landing page.

Table of contents