GeoLift by Recast
Breadcrumbs

REBA Model Technical Documentation

The REBA model is a Bayesian synthetic control model.

Given a target time series and a set of donor time series, REBA constructs a twin (a “synthetic control”) of the target time series from the donors using historical data, then measures the difference between the target and the synthetic control in another time period (the “test period”) during which there is some intervention in the target. Since the donors did not receive the intervention, the difference between the synthetic control and the target during the test period quantifies the effect of the intervention.

In this document we’ll discuss exactly how REBA constructs the synthetic control and how it calculates the intervention effect estimate, but first we need to give an overview of the data we use.

The data

In a geographic lift test, some regions receive an intervention and others do not. Examples of common interventions would be things like doubling spend in a marketing channel or turning off marketing across a swath of channels. Some KPI of interest, like revenue, is tracked in all regions.

Suppose there are three test regions that received the intervention (it could be any number of regions). If we call their revenue time series R1(t), R2(t), and R3(t), then we can define R(t) to be the average of their time series:

R(t) = ( R1(t) + R2(t) + R3(t) ) / 3

We use this R(t) as the target for which we construct the synthetic control.

We will also have a number of donor regions that did not receive the intervention. Call their revenue time series X1(t), X2(t), all the way up to XN(t). Sometimes we will refer to these as Xi(t), where i could be any number from 1 to N.

Finally, there is some time step t = T at which the intervention starts. The time after that point, t >= T, contains the intervention and any cooldown period. The time before that point, t < T, is the period we will use to construct a synthetic control from the donors Xi(t) to match R(t).

Generally we like to use around one year of historical data leading up to the test to construct the synthetic control.

The model and the effect estimate

REBA finds a combination of fixed weights W1, W2, …, WN (as many weights as there are donor regions) such that

W1*X1(t) + W2*X2(t) + ... + WN*XN(t) ≈ R(t)  for t < T

Note we used the approximately equals sign “≈”. The closest match will generally still not be a perfect fit.

We place some restrictions on the weights Wi:

  1. Every weight must be positive or zero: Wi >= 0

  2. The weights must sum to 1: W1 + W2 + ... + WN = 1

These restrictions basically mean that we are combining scaled-down versions of each donor in the synthetic control, and that the final synthetic control can’t be smaller than the smallest donor or larger than the largest one.

Supposing we’ve found a good combination of weights, define the synthetic control S(t) to be that weighted sum of donor regions:

S(t) = W1*X1(t) + W2*X2(t) + ... + WN*XN(t)

The quantified effect of the intervention is then the total difference between the target R(t) and the synthetic control S(t) from the test start at t = T until the end of the test and any included cooldown period. We also need to multiply that by the number of test regions to account for the fact that R(t) was an average of the actual test regions. So, in the end,

effect = (number of test regions) * sum_{t >= T} ( R(t) - S(t) )

When calculating this, we account for the imperfect match between the test and synthetic control by adding simulated errors into this S(t).

Implementation as a Bayesian model

We place a flat prior on the weights Wi (equivalent to a Dirichlet(1) distribution) and use a Gaussian likelihood to match R(t) and S(t).

The Stan code for the first part of the Bayesian model roughly looks like this:

C++
data {
  int n_timesteps;  // pre-intervention timesteps
  int n_donor_regions;
  vector[n_timesteps] R;
  matrix[n_timesteps, n_donor_regions] X;
}

parameters {
  simplex[n_donor_regions] W;
  real sigma;
}

transformed parameters {
  // the synthetic control
  vector[n_timesteps] S = X * W;
}

model {
  // priors
  W ~ dirichlet(1, 1, ... , 1);
  sigma ~ exponential(...);

  // likelihood
  R ~ normal(S, sigma);
}

Note that we have a parameter sigma that measures the error between the intervened time series R and the synthetic control S over these pre-test time steps. This parameter gets a weak exponential prior that depends on the scale of the data.

Then, to calculate the effect estimate, we pass in the during- and post-intervention data, and simulate using the parameters we estimated in the previous step:

C++
data {
  int n_test_timesteps;   // test timesteps
  int n_exposed_regions;  // number of regions with the intervention
  int n_donor_regions;
  vector[n_test_timesteps] R_test;
  matrix[n_test_timesteps, n_donor_regions] X_test;
}

generated quantities {
  // build synthetic control in the test timesteps
  vector[n_test_timesteps] S_test = X_test * W;

  // add simulated errors to the synthetic control
  vector[n_test_timesteps] S_test_sim = normal_rng(S_test, sigma);

  // calc the measured effect
  real effect = n_exposed_regions * sum(R_test - S_test_sim);
}

The quantity effect is the model’s estimate for the effect of the intervention.