Monday 11 September 2017

The Easiest Guide to Cohort Analysis

Cohort is a group of users experiencing a common event within the same time period.

An oft-repeated but very relevant example of a cohort is- a group of students joining in the same year. So the class of 2017 is a cohort and so is a class of 18, and so on and so forth.

What is cohort analysis?

Cohort analysis is an analytical modeling employed to study the cohorts characteristics over a period of time and the elements that influence change in those characteristics. It traces its roots to medical research where cohort studies are done to identify the cause of a disease.

“In a prospective cohort study, researchers first raise a research question, forming a hypothesis about the potential causes of a disease. The researchers then observe a group of people, the cohort, over a period of time (often several years), collecting data that may be relevant to the disease. This allows the researchers to detect any changes in health in relation to the potential risk factors they have identified.” via Medical News Study

So, to identify the cause of lung cancer doctors would create a hypothesis that it is caused by smoking. Then they will take two groups- smokers and non-smokers.

Thereafter, both groups would be studied to identify the influence of smoking on the person’s likelihood to get lung cancer.

How do we employ this in business analytics?

In business applications, we compare cohorts- users sharing a common experience in a given time frame- or analyze the behavior of a single cohort, to identify a pattern that supports a growth hypothesis. That hypothesis could be anything.

For instance, we may create a hypothesis that users getting acquired via display ads have higher LTV than the ones getting acquired by Facebook. To prove the hypothesis we would do the cohort analysis.

Likewise, let’s suppose we want to identify the cause of the aggregate dip in your retention.So we would form a hypothesis that retention has a correlation with the first purchase of the customer.

To establish the relation we shall cohortize users on the basis of their first purchase and plot their, say monthly, retention %.

Cohort analysis example

From the graph above it is apparent that the users who purchased marshmallows the first time displayed higher LTV than the others. This despite the fact that overall retention of the product has declined. Naturally, the intent of business now would be to get more users purchase marshmallows post acquisition.

Important- That’s not to say that Marshmallows are the cause of retention. Our analysis simply told us that there is a correlation between marshmallows and retention. Correlation doesn’t amount to causation. So we have to test if Marshmallows really amount to higher retention or not.

Cohort analysis gives us insight into the trend and basis for testing. Not the cause.

Cohorts and Segments are not the same

Most folks interchangeably use ‘Cohort’ and ‘Segment’ which is not correct.

For two users to be part of the same cohort they have to be bound by the common event and time period. Eg 2017 graduates, 1990 born men.

However, to create a Segment you could use almost any condition as a basis which cannot necessarily be time and event based. Eg graduates, men.

Cohort is a subset of Segment. So, there can be a cohort of ‘new users this week’ and likewise, there can also ‘segment of new users this week’.

Now that we have understood fundamentals of cohorts, let’s understand some business use-cases.

Some powerful use-cases of Cohort Analysis

To explain the use-cases start with the google sheets (linked below) where you can start with the cohort chart for every use-case.

Cohort Analysis | Worksheet

1. Understanding customer retention

But before we do that, a little throwback to how to read a cohort chart. We are skipping the data crunching part and jumping right into the presentation.

How to read a cohort chart?

Table 1

Link- Cohort by Active users- Sheet 1 | Excel

Let’s go through row and column one by one. You could well see that column is for activation month and row is for the number of returning customers.


So, B4 represents the number of new customers we acquired in the month of Jan. C4 tells us the number of customers who were acquired in Jan but they returned in Feb. Likewise

C4- number of customers acquired in Jan who returned in March.
D4- the ones who returned in April

And so on and so forth.

Basically, as we move along the Jan’s row. we understand how the retention of new customers acquired in Jan fluctuated until Dec.


Column represents the number of returning or new customers. D4 represents the number of customers acquired in Jan who returned in March. D5- the number of customers acquired in Feb who returned in March. D6 is the number of new customers acquired in March.

The same pattern repeats as we move along the row.

Table 2

Now, let’s understand how the each cohort, retention wise, behaves over the period of time.

To do that, we would slightly pivot the above table. We would change the column from the actual month to the ‘# of months since acquisition’. From Jan, Feb to 0. 1, 2 which would pull all the row data to the left.

You may notice that the table changed from right aligned triangle changed to left aligned.

So, in the first row, as we move along, we would know how many customers acquired in Jan returned in the succeeding months.

Table 3

In this table, we changed the numbers into percentage to get better view of the data.

Now looking at each row we may get the retention curve of the corresponding month. However, what if we want to understand how the retention has been over the past 12 months?

So, in the final row, we have calculated the aggregate. The aggregate gives us the retention curve of the past 12 months.

2. Correlation between category and retention

A friend of mine had worked on the cohort analysis of one of the world’s largest retailer. He told me that one of the conclusions from their analysis was that the users who purchased baby products in their first visit showed higher propensity to visit again. This prompted the retailer to promote their baby section more aggressively.

One can create a hypothesis that there are some categories which trigger maximum stickiness among users when they are the first purchase.

To determine that category let’s cohortize users on the basis of category of their first purchase and plot their retention.

Link- Cohorts by Category- Sheet 2

From the chart it is evident one can draw the following conclusions:

Users buys Sportswear in the first purchase showed higher retention than the rest.
Users buying Jewelleries in the first purchase showed the lowest retention rate.
5th month is critical as the churn seems to increasing beyond that.

Some possible inferences can be that the marketing expense for sportswear needs to be decreased. Likewise, the retention strategies for Jewellery purchasers need to be relooked. Retention strategy for users entering 5th month since their acquisition has to be evaluated.

3. What features correspond to maximum retention

report by Quettra shows that an average app loses 77% of the DAUs within 3 days post install. Now, if your product itself isn’t deserving, then nothing can evade uninstall. However, if it is not, then apparently the first three days are critical and determinant of the user’s retention.

3 days was the average trend and your critical number could accordingly vary.

You could determine your own critical number through the method that we discussed in #1.

Let’s suppose it is x days for the time being then you have to do something within the first x days post install to hook users.

How cohort analysis comes into picture

Let’s create a hypothesis that there are some features in the app which when used increases the stickiness among users.

Create an aggregate retention curve of the last 12 months like we did in #1.

Note- The retention curve of the mobile app unlike a web-app is going to decrease linearly because a web-app doesn’t need to be installed on your device. A user can login any time he wishes. With mobile app, once it is uninstalled you potentially lose the user forever.

Now, screen the users who have retained and jot down the features used by them on the first day. Suppose you are analysing for a e-commerce app and concluded the following traits to be common among all retained users.

Let’s say “push notification clicked” and “added to wishlist” are two most common actions

Now we would narrow our analysis for both of these events and do a comparison between them

The result

Cohort Analysis | Cohort by Features

Visit the above sheet and change the value for each feature from the drop down to see how the graph changes.

From the above chart, it would be clear that users who added-to-wishlist display higher propensity to retain than the rest. The ones who clicked push notification perform even worse than the average.

Again, this graph gives us the correlation not the cause of retention.

P.S. This is a very interesting method and extensively used by consumer businesses. I just discussed the basic framework and there are various edges that can lead you to a more definite conclusion.

4. How customers react to a new feature release

Inversely the above cohort analysis could also be used to figure out what are the obsolete features that needs some rework.

For instance, the cohorts curve of users who clicked on push notification fare poorly than the average retention curve. Push notification is obviously meant to complement your retention so the above chart prompts us to rethink our strategy.

Creating cohorts in Mixpanel, Amplitude, Adobe- First event and Returning event

If you are using Amplitude or Mixpanel, or any of the similar products, to do your cohort analysis, these are the two fields that you have to specify for creating cohort chart

First event
Returning event

Let’s see some examples





First event is the primary criteria to build the cohort- the ‘experience’ element in creating cohort that we discussed in the very beginning.

Returning event is the baseline that you want to track for your users. In the above charts, retention has been the baseline of our analysis. In analytics, retention could be defined as ‘any event performed by the user’ on your platform.

So, if we have create cohort in Amplitude then it would somewhat look like this


Cohort analysis is a respite from vanity metrics.

At any time momentary growth can be bought which may give you temporary pleasure but cohort analysis allows to be cynical. It gives a very critical view of churn and doesn’t let it get masked by growth.

For instance if you are investing into acquisition there can be instant surge in the MAU but high MAU is not the indicator of growth. A cohort analysis will tell how many of those acquisitions are actually sticking with you.

Similarly, a particular channel might be amounting to highest acquisition. But a cohort analysis will tell which of them contribute to maximum profit.

Whatever your key metrics may be you would be able to see how it evolves over the customer lifecycle or product lifecycle.