Top Metrics Growth Marketers Need to Know

Top Metrics Growth Marketers Need to Know

As a follow up to our article about what Growth Marketing is, we next take a peek into the metrics that every growth marketer must measure. (Depending on your exact industry, there will be other metrics as well, which we will cover in later articles.) As a refresher, we defined Growth Marketing:

“Growth marketing drives increased user engagement, by extending the boundaries of the product into marketing channels.”

Growth Marketers Are ALL About the Metrics

We’ll focus on the first half of the definition that states “Growth marketing drives increased user engagement”. This statement is all about measurable results. In order to measure the results, we must first understand what KPIs and metrics we will use to gauge success. To say that growth marketers are numbers oriented is an understatement. Growth marketers are obsessed with metrics — they must look at deltas across time and cohorts to show growth in customer acquisition, customer retention, and customer win-backs.

How are we trending? What is our engagement looking like? How many users are we churning? How long does it take us to be profitable for each new user/customer? Am I really growing the business?

The metrics in this article will build the foundation for any Growth Marketer to be able to answer these questions. To provide order to the metrics, we will categorize the metrics through a simple series of lifecycle stages:

Activation – new prospects
Retention – drive incremental engagement and revenue from existing users/customers
Win-back – bringing churned users/customers back

Activation Metrics for the Growth Marketer

Activation is a stage reached when a user completes an action that’s indicative of getting value out of a product. What constitutes activation might be different for different services; e.g. a social app like Twitter might consider a user activated when they follow a certain number of other users within a given time-period; an e-commerce company might consider a user to be activated when they make their first purchase, or on a rolling basis, consider someone to be active if they have made a purchase in the last 6 months.

1-day & 7-day activation rates:
This metric gives a quick leading indication of how activation rates from a channel are trending. Marketers know that activation could often take months after acquiring a user, but they want a quick indicator of activation for new or recent sources of traffic. 1 & 7-day activation rates, coupled with simple data science models, can help forecast long term activation from the given cohort of users, and can be used to quickly estimate time-to-payback.

Time to payback by channel:
The amount of time it takes to recoup the cost of customer acquisition (CAC), through profits from customers. This is a measure not only of the efficacy of activation, but also of retention & monetization efforts.

Abandonment rate:
The percentage of customers who fail to complete a “conversion” event inside a single session.

Abandoner retargeting conversion rate:
The percentage of abandoners who are successfully converted based on retargeting efforts across multiple channels. Typically measured within a well-defined window of time, like 7 or 30 days.

Retention Metrics for the Growth Marketer

User retention is about continuing to engage activated customers so that they stay active. Customer engagement is the most important area that Growth Marketers must focus on when users/customers are in this stage. Always provide value. How do you know if you are providing value or that your users see value in what you provide? A savvy marketer will start with these two metrics:

Churn rate:
The annual percentage rate at which customers stop being active.Stickiness:
Typically measured as the ratio of DAU/MAU, this measure if most used in categories like gaming that truly depend on daily & frequent engagement. Stickiness is a good indicator of whether customers are returning frequently.

Win-Back Metrics for the Growth Marketer

Users who were once active, but have since lapsed, can be won-back into becoming active customers again. This is one of the hardest ways to gain active users, since these users potentially lapsed due to the product losing some relevance for them. It’s like raising the dead, however, it must be a part of a Growth Marketer’s strategy. To measure the success of Win-Back campaigns, there is on metric in particular to focus on:

Re-activation rate:
The percentage of previously lapsed customers who become active again within a given time period.

The Tip of the Iceberg…what next?

This is not an exhaustive list. It is meant to give Growth Marketers the metrics they need to identify success and address areas of improvement. These are the foundation to being successful as a growth marketer and owning these metrics for your organization gives you tremendous insight into the health of your business, marketing strategy, and customer base.

Watch out for more posts about growth marketing, and check out our comprehensive guide here for everything you need to know about the subject.


data code

How do you know if your model is going to work? Part 3: Out of sample procedures

Continuing our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?” we develop out of sample procedures.

Previously we worked on:

Out of sample procedures

Let’s try working “out of sample” or with data not seen during training or construction of our model. The attraction of these procedures is they represent a principled attempt at simulating the arrival of new data in the future.

Hold-out tests

Hold out tests are a staple for data scientists. You reserve a fraction of your data (say 10%) for evaluation and don’t use that data in any way during model construction and calibration. There is the issue that the test data is often used to choose between models, but that should not cause a problem of too much data leakage in practice. However, there are procedures to systematically abuse easy access to test performance in contests such as Kaggle (see Blum, Hardt, “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”).

Notional train/test split (first 4 rows are training set, last 2 rows are the test set).

The results of a test/train split produce graphs like the following:



The training panels are the same as we have seen before. We have now added the upper test panels. These are where the models are evaluated on data not used during construction.

Notice on the test graphs random forest is the worst (for this data set, with this set of columns, and this set of random forest parameters) of the non-trivial machine learning algorithms on the test data. Since the test data is the best simulation of future data we have seen so far, we should not select random forest as our one true model in this case- but instead consider GAM logistic regression.

We have definitely learned something about how these models will perform on future data, but why should we settle for a mere point estimate. Let’s get some estimates of the likely distribution of future model behavior.

Read more.



How do you know if your model is going to work? Part 2: In-training set measures

Continuing our guest post series!

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.

Previously we worked on:

  • Part 1: Defining the scoring problem

In-training set measures

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

Run it once procedure

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.



What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).

There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

  1. The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.
  2. The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure).The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

Read more.


Model Data

How do you know if your model is going to work? Part 1: The problem

This month we have a guest post series from our dear friend and advisor, John Mount, on building reliable predictive models. We are honored to share his hard won learnings with the world.

Authors: John Mount (more articles) and Nina Zumel (more articles) of Win-Vector LLC.

“Essentially, all models are wrong, but some are useful.”
George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?

Bartolomeu Velho 1568
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)


Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.

Read more.

The AB testing Paradox

The A/B Testing Paradox

Has this ever happened to you:

  • You ran 30 a/b tests over 6 months
  • 20 of which showed an average lift of 2%, and you promoted the winning variant to show to the entire traffic base. Yay!! 10 of the tests did not show a lift over the base configuration, and you killed the variant you were trying out. No big deal, of course.
  • 6 months later, you would expect that your business is up ~40% from the 20 winning variants.
  • Yet, overall revenue has hardly improved.

Maybe seasonality is the issue (i.e., revenue improved over the base, but this time of the year, seasonality leads to a 40% decline anyway)? Or maybe there were other factors that would have driven the business down 40%, and your a/b tests neutralized that?

While possibilities like these could be true, an often overlooked fact is that our understanding of how to measure “lift” through A/b testing may be flawed. Here are some ways in which A/b testing might flatter only to deceive, some of which can be solved by improving your testing methodology:

    1. You never bothered to understand the extent of random fluctuations in metrics: Consider an experiment where you toss a coin 100 times twice. Just through sheer randomness, you are likely to get anything but 50 heads twice – say you might get 49 heads once, and 51 heads the other time – a . A null test, which randomly slices users at all times, and measured the variance between the 2 slices, gives you the the variance in metrics that’s purely random. Without accounting for this random variance, you would never be able to measure the true lift caused by your experiment’s variations. It’s a good idea to create a dummy test (where the two variants are entirely the same), and running it at all times. When you run other a/b tests, make sure that they show a bigger observed deviation than the null test.
    2. The tests were not independent: One of the assumptions in an a/b testing system is that you can run multiple tests at the same time, and that those tests are independent of each other. This assumption often breaks down when you run tests that apply only to a sub-segment of users. For example, consider a test X (with variants XA and XB) that triggers only on a segment of users who have clicked the “buy” button on an ecommerce site; consider that it’s running alongside a test Y (with variants YA and YB), where YB users show a higher propensity to click on the buy button. As a result, the population that sees experiment X is now composed of more YB users, breaking the independence.This is a really hard pitfall to avoid in a/b testing, but can be resolved through multivariate testing. One way to minimize the impact of this issue in a/b tests is to run only a few tests in parallel, and to be careful when running “filtered” tests (tests that only get triggered on a small segment of qualifying users).
    3. You didn’t run the test long enough before you declared victory: A test needs a sufficient amount of observations before you know for sure if the measurements are statistically significant. So how long do you need to run the tests to be able to be confident in the results? The answer is that it depends on the following:
        1. The value of the current metric you’re looking to optimize. e.g. if the metric is conversion rate, and the current value is 5%, you will need to run the test for more time than if it’s 3%. The reason is that a 1% improvement in the conversion rate has a higher chance of being “noise” if the baseline is larger, around 5%, than if the baseline is smaller, around 3%.
        2. The minimum difference you hope to detect: If you want to detect a statistically significant lift of 0.1%, you will need to run the test for much longer than if you wanted to detect a statistically significant lift of 1%
        3. How much you want to minimize the risk of false positives and false negatives: If you hope to reduce false positives, or in other words, increase the statistical significance, you would need to run the test for longer. Similarly, to reduce the risk of false negatives, or to increase the statistical power, you will need to run the tests for longer.

This sample size calculator makes it easy to figure out how many observations you need.

As you internalize some of these observations to your testing practices, you will find fewer tests “succeeding”, but the ones that will succeed will produce sustainable results. Happy testing!

Taming your customer lifecycle metrics

Taming your Customer Lifecycle Metrics

In recent years, the idea of “pirate metrics” has gained wide adoption. Pirate metrics stand for AARRR: Acquisition, Activation, Retention, Referral and Revenue. The precise definition of the metrics may differ based on your business. For example, for some e-commerce businesses, acquisition could mean getting a visitor to sign up to your newsletter, activation (or reactivation) could be measured as the first purchase in the last 6 months or since joining, retention as repeat purchase, revenue as total sales, and, and referral could be measured as the number of friends invited by a user who signed up to the newsletter or made a purchase; for a subscription commerce business, retention could be measured as the churn rate (the percentage of customers who cancel their subscription).

The work of lifecycle marketing or CRM typically begins after the initial acquisition, and is about optimizing the activation, retention, repeat revenue and referral rates. How can you achieve marketing success and improve these metrics?

Here is a 5-step guide to taming these metrics:

  1. Construct a “state diagram” of the lifecycle stages for your business: The pirate metrics map to changes in the lifecycle state of the user: e.g. the activation rate metric calculates the change between the new user state and the active customer state. As a first step to optimizing the lifecycle, draw the state diagram of how users could transition between these states. Define active customer based on activity within a time window (e.g. at least one purchase in the last 3 months). Customize the diagram to include the states that makes sense for your business, to define additional states like “core users”, who might not only be active, but making frequent purchases.

    A sample state diagram

    A sample state diagram

  2.  Calculate the percentages along the “edges”: Every month, look at the new/core/active/lapsed users from last month, and understand what new states they have transitioned to. Calculate the percentages of these transitions. The following 2 tables illustrates this.



  3. Assess opportunities by benchmarking and monitoring over time: By looking at the percentages along the edges, you discover where your opportunities and challenges lie. For example, you may discover that only 60% of previous month’s active users stay active in the current month, and that might be a good metric to try and improve through a targeted effort.
  4. Construct targeted experiments for each step: Once you have assessed the opportunities, you can create experiments that might improve the metrics. For instance, in a subscription commerce environment, you might have a hypothesis that you could increase the retention rate by focusing on the edge between following 2 states: customers who have subscribed for 3 or more months, and lapsed customers. In order to improve this metric, you might come up with multiple experiments; an example of an experiment could be to give the users who have subscribed for 3 months a heavy discount to sign on to an annual plan. You could communicate this discount over email, and measure if the email improved the metrics on the relevant edge.
  5. Measure, and iterate: Once you start experimenting, you need to measure how well the experiments are working, and iterate. Successful experiments are

How does this approach compare with the analytical approach known as cohort analysis? Analyzing cohorts a great tool for a couple of analytical use cases:

  • Calculating the lifetime value of a user
  • Understanding if more recent cohorts are performing better than older cohorts

However, white cohort analysis is a great analytical tool, it doesn’t by itself does not provide you the actionable insights you need to improve your lifecycle metrics. The key difference is that cohort analysis primarily classifies users by when they first signed up, rather than their current activity level. In our state transition model outlined above, we group all active users together, irrespective of when they joined, making it easier to see just a few metrics of interest that line up well against experiments you can create.

Once you get started on this approach, there is no limit to the number of experiments you can run to optimize the metrics between all the edges in your state diagram, except limitations imposed by the lack of right tools for measurement, and running the experiments. Creating an experiment might involve some form of messaging that includes stitching together content and offers, delivering these messages, and measuring the impact.

At Blueshift, we are building tools that will enable marketers to monitor user states and create the right experiments easily. Stay tuned for updates from us, but in the meanwhile, we would love to hear how you think about driving your lifecycle metrics.