Tests¶
mmm-eval provides a wide-ranging suite of validation tests to evaluate MMM performance. This guide explains each test and how to interpret the results.
Overview¶
mmm-eval includes six main types of validation tests:
- Accuracy Tests: Measure how well the model fits the data
- Cross-Validation Accuracy Test: Assess model generalization
- Refresh Stability Tests: Evaluate model stability over time
- Robustness Tests: Evaluate model sensitivity to data changes
Accuracy Tests¶
Accuracy tests evaluate how well the model fits the data using different validation approaches.
Accuracy can be considered a necessary, but not sufficient indicator of a good model - a model can perform well on accuracy tests but still get the causal relationships in the data wrong. However, it is very effective for identifying poor models, as poor in-sample and/or out-of-sample performance almost always implies that the model is failing to capture the causual structure of the problem at hand.
Holdout Accuracy Test¶
The holdout accuracy test evaluates model performance by splitting data into train/test sets and calculating metrics on the test set.
Process¶
- Data Split: Data is split into training and test sets
- Model Training: Model is fitted on training data
- Out-of-Sample Prediction: Predictions made on held-out test data
- Performance Metrics: Calculated on out-of-sample predictions
Metrics¶
- MAPE (Mean Absolute Percentage Error): Average percentage error
- SMAPE (Symmetric Mean Absolute Percentage Error): Symmetric version of MAPE
- R-squared: Proportion of variance explained by the model
Interpretation¶
- Lower MAPE: Better model performance
- Lower SMAPE: Better symmetric model performance
- Higher R-squared: Better model fit (0-1 scale)
In-Sample Accuracy Test¶
The in-sample accuracy test evaluates model performance by fitting the model on the full dataset and calculating metrics on the training data.
Process¶
- Full Dataset Training: Model is fitted on the complete dataset
- In-Sample Prediction: Predictions made on the same data used for training
- Performance Metrics: Calculated on in-sample predictions
Metrics¶
- MAPE (Mean Absolute Percentage Error): Average percentage error
- SMAPE (Symmetric Mean Absolute Percentage Error): Symmetric version of MAPE
- R-squared: Proportion of variance explained by the model
Interpretation¶
- Lower MAPE: Better model fit to training data
- Lower SMAPE: Better symmetric model fit
- Higher R-squared: Better explanatory power
- Comparison with holdout: Helps identify overfitting (much better in-sample than holdout performance)
Cross-Validated Holdout Accuracy Test¶
A cross-validated version of the holdout accuracy test. The generalization performance of the model is tested more rigorously by splitting the data into multiple train/test "folds" and averaging over the results.
We use the leave-future-out (LFO) cross validation strategy, which is widely used for out-of-sample testing of timeseries models. For a dataset with time indices 0, ..., T
, this involves fitting on [0, ..., T-X]
and testing on [T-X+1, T-X+1+k]
, then incrementing X
in order to increase the size of the training set while keeping the test set size k
fixed. (N.B. X
and k
must be strictly positive integers)
Process¶
- Time Series Split: Data is split chronologically
- Rolling Window: Model is trained on expanding windows
- Out-of-Sample Prediction: Predictions made on held-out data
- Performance Metrics: Calculated on out-of-sample predictions
Metrics¶
- MAPE: Out-of-sample prediction accuracy
- SMAPE: Out-of-sample symmetric prediction accuracy
- R-squared: Out-of-sample explanatory power
Interpretation¶
- Consistent performance: Similar in-sample and out-of-sample metrics
- Overfitting: Much better in-sample than out-of-sample performance
- Underfitting: Poor performance on both in-sample and out-of-sample data
Refresh Stability Tests¶
The refresh stability test evaluates how much media ROI estimates change as more data is added to the model.
NOTE: we define ROI as 100 * (R/S - 1)
, where R
is estimated revenue and S
is paid media spend for a particular media channel. Under this convention, a ROI of 0% implies $1 spend yields a $1 return, a ROI of 100% implies $1 spend yields a $2 return, and so on.
Process¶
- Baseline Model: Train on initial dataset
- Incremental Updates: Add new data periods
- Parameter Comparison: Compare parameter estimates
- Stability Metrics: Calculate change percentages
Metrics¶
- Mean Percentage Change: Average change in parameter estimates
- Channel Stability: Stability of media channel parameters
- Intercept Stability: Stability of baseline parameters
- Seasonality Stability: Stability of seasonal components
Interpretation¶
- Low percentage changes: Stable model parameters
- High percentage changes: Unstable model (may need more data)
- Channel-specific stability: Some channels more stable than others
Robustness Tests¶
The robustness test evaluates how sensitive the model is to changes in the input data.
Perturbation Test¶
The perturbation test evaluates how sensitive the model is to noise in the input data by adding Gaussian noise to media spend data and measuring the change in ROI estimates.
Process¶
- Baseline Model: Train on original data
- Noise Addition: Add Gaussian noise to primary regressor columns (usually spend or impressions, depending on the model spec)
- Retrain Model: Fit model on noisy data
- Compare estimated impacts: Compare ROI estimates across the two models
- Sensitivity Metrics: Calculate percentage changes
Metrics¶
- Percentage Change: Change in ROI estimates for each channel
- Channel Sensitivity: Which channels are most sensitive to noise
Interpretation¶
- Low percentage changes: Robust model (good)
- High percentage changes: Sensitive model (may need more data or regularization)
- Channel-specific sensitivity: Some channels more stable than others
Placebo Test¶
The placebo test (also known as a falsifiability test) evaluates whether the model can detect spurious correlations by introducing a randomly shuffled media channel and checking if the model assigns a low ROI to this spurious feature.
Process¶
- Channel Selection: Randomly select an existing media channel
- Data Shuffling: Randomly permute the rows of the selected channel's data to break time correlation with the target variable
- Model Training: Fit the model with the shuffled channel added
- ROI Assessment: Record the estimated ROI for the shuffled channel
- Validation: Check if the shuffled channel ROI is appropriately low
Metrics¶
- Shuffled Channel ROI: Estimated ROI for the spurious channel
Interpretation¶
- Low ROI (≤ -50%): Model correctly identifies spurious correlation (good)
- High ROI (> -50%): Model may be overfitting or detecting spurious patterns (concerning)
- Test Skipped: Indicates reach and frequency regressor type not supported
Purpose¶
This test helps validate that the model is not simply memorizing patterns in the data or detecting spurious correlations. A well-performing model should assign a low ROI to a channel that has no meaningful relationship with the target variable.
Running Tests¶
All Tests (Default)¶
mmm-eval --input-data-path data.csv --framework pymc-marketing --config-path config.json --output-path results/
Specific Tests¶
mmm-eval --input-data-path data.csv --framework pymc-marketing --config-path config.json --output-path results/ --test-names holdout_accuracy in_sample_accuracy cross_validation
Available Test Names¶
holdout_accuracy
: Holdout accuracy tests onlyin_sample_accuracy
: In-sample accuracy tests onlycross_validation
: Cross-validation tests onlyrefresh_stability
: Refresh stability tests onlyperturbation
: Perturbation tests onlyplacebo
: Placebo tests only
Test Configuration¶
If you'd like to modify the test pass/fail thresholds, you can fork the branch and modify the thresholds in mmm_eval/metrics/threshold_constants.py
.
Interpreting Results¶
Good Model Indicators¶
- Holdout Accuracy: MAPE < 15%, SMAPE < 15%, R-squared > 0.8
- In-Sample Accuracy: MAPE < 10%, SMAPE < 10%, R-squared > 0.9
- Cross-Validation: Out-of-sample MAPE/SMAPE similar to in-sample
- Refresh Stability: Parameter changes < 10%
- Perturbation: ROI changes < 5%
- Placebo: Shuffled channel ROI ≤ -50%
Warning Signs¶
- Poor Performance: High MAPE/SMAPE or low R-squared
- Overfitting: Much better in-sample than holdout performance
- Unstable Model: Large parameter changes
- Data Issues: Missing values or extreme outliers
Best Practices¶
Test Selection¶
- Start with holdout accuracy: Always run holdout accuracy tests first
- Add in-sample accuracy: To assess model fit and identify overfitting
- Include cross-validation: For generalization assessment
- Add stability tests: For production models
- Include robustness tests: To evaluate model sensitivity to data changes
Result Analysis¶
- Compare frameworks: Run same tests on different frameworks
- Track over time: Monitor performance as data grows
- Set thresholds: Define acceptable performance levels
- Document decisions: Record test choices and rationale
Troubleshooting¶
Common Issues¶
- Slow tests: Reduce data size or simplify model
- Memory errors: Use smaller datasets or more efficient settings
- Convergence issues: Check model configuration
- Inconsistent results: Verify random seed settings
Getting Help¶
- Check Configuration for test settings
- Review Examples for similar cases
- Join Discussions for support