Why You Should Combine KPIs in Experiments

Presentation tips - blog5

Imagine a company that sells a line of products and services. This company will likely have multiple goals for its website:

  1. to sell its products and services online
  2. to collect user information for sales prospects
  3. to drive brand awareness and loyalty
  4. to provide online support for existing customers

Let’s say the company has identified 20 KPIs (Key Performance Indicators) that measure the success of these four goals, and it is committed to optimizing the conversion of these goals by running many experiments. Should the company launch the treatment if some KPIs perform better but others perform worse than the control (i.e. the original site)?

You could take a scorecard approach – launch the treatment if it does more good than harm. But what if you have four organizational teams, each with a stake in the success of a different goal, and the treatment is favorable for one team but unfavorable for another? Deciding whether to launch the treatment could be contentious.

Ron Kohavi, the founder of the Experimentation Platform at Microsoft, and his colleagues have published articles in Data Mining and Knowledge Discovery advocating the need for companies to combine multiple KPIs into a single metric called the Overall Evaluation Criterion when analyzing experimental results. The Overall Evaluation Criterion, in this example, would be a combination of the 20 KPIs.

There are many advantages to analyzing just one Overall Evaluation Criterion as opposed to multiple KPIs.

  • It aligns the business behind a clear, consistent objective.
  • It forces tradeoffs to be made once, no matter how many experiments you run.
  • It simplifies decision making.
  • It takes the relative importance of each KPI into account (not all KPIs may be equally important).
  • It’s easy to interpret; its values range from 0 to 100, with higher scores meaning better performance.
  • It reduces the chances of incorrectly concluding that the treatment made a difference. If you compared the treatment vs. the control by doing significance testing on each of the 20 KPIs, you should expect to get one metric that appears to be statistically significant (p < .05) when using the conventional 95% confidence level, even if the treatment had no impact.

Wondering how to calculate the Overall Evaluation Criterion? Read my blog post 5 Steps to Compare Multiple KPIs.