Tests¶
mmm-eval provides a comprehensive suite of validation tests to evaluate MMM performance. This guide explains each test and how to interpret the results.
Overview¶
mmm-eval includes four main types of validation tests:
- Accuracy Tests: Measure how well the model fits the data
- Cross-Validation Tests: Assess model generalization
- Refresh Stability Tests: Evaluate model stability over time
- Performance Tests: Measure computational efficiency
Accuracy Tests¶
Accuracy tests evaluate how well the model fits the training data.
Metrics¶
- MAPE (Mean Absolute Percentage Error): Average percentage error
- RMSE (Root Mean Square Error): Standard deviation of prediction errors
- R-squared: Proportion of variance explained by the model
- MAE (Mean Absolute Error): Average absolute prediction error
Interpretation¶
- Lower MAPE/RMSE/MAE: Better model performance
- Higher R-squared: Better model fit (0-1 scale)
Cross-Validation Tests¶
Cross-validation tests assess how well the model generalizes to unseen data.
Process¶
- Time Series Split: Data is split chronologically
- Rolling Window: Model is trained on expanding windows
- Out-of-Sample Prediction: Predictions made on held-out data
- Performance Metrics: Calculated on out-of-sample predictions
Metrics¶
- MAPE: Out-of-sample prediction accuracy
- RMSE: Out-of-sample error magnitude
- R-squared: Out-of-sample explanatory power
- MAE: Out-of-sample absolute error
Interpretation¶
- Consistent performance: Similar in-sample and out-of-sample metrics
- Overfitting: Much better in-sample than out-of-sample performance
- Underfitting: Poor performance on both in-sample and out-of-sample data
Refresh Stability Tests¶
Refresh stability tests evaluate how model parameters change when new data is added.
Process¶
- Baseline Model: Train on initial dataset
- Incremental Updates: Add new data periods
- Parameter Comparison: Compare parameter estimates
- Stability Metrics: Calculate change percentages
Metrics¶
- Mean Percentage Change: Average change in parameter estimates
- Channel Stability: Stability of media channel parameters
- Intercept Stability: Stability of baseline parameters
- Seasonality Stability: Stability of seasonal components
Interpretation¶
- Low percentage changes: Stable model parameters
- High percentage changes: Unstable model (may need more data)
- Channel-specific stability: Some channels more stable than others
Performance Tests¶
Performance tests measure computational efficiency and resource usage.
Metrics¶
- Training Time: Time to fit the model
- Memory Usage: Peak memory consumption
- Prediction Time: Time to generate predictions
- Convergence: Number of iterations to convergence
Interpretation¶
- Faster training: More efficient model
- Lower memory: Better resource utilization
- Faster predictions: Better for real-time applications
- Fewer iterations: Better convergence properties
Running Tests¶
All Tests (Default)¶
mmm-eval --input-data-path data.csv --framework pymc-marketing --config-path config.json --output-path results/
Specific Tests¶
mmm-eval --input-data-path data.csv --framework pymc-marketing --config-path config.json --output-path results/ --test-names accuracy cross_validation
Available Test Names¶
accuracy
: Accuracy tests onlycross_validation
: Cross-validation tests onlyrefresh_stability
: Refresh stability tests onlyperformance
: Performance tests only
Test Configuration¶
If you'd like to modify the test pass/fail thresholds, you can fork the branch and modify the thresholds in mmm_eval/metrics/threshold_constants.py
.
Interpreting Results¶
Good Model Indicators¶
- Accuracy: MAPE < 15%, R-squared > 0.8
- Cross-Validation: Out-of-sample MAPE similar to in-sample
- Stability: Parameter changes < 10%
- Performance: Reasonable training times
Warning Signs¶
- Overfitting: Much better in-sample than out-of-sample performance
- Instability: Large parameter changes with new data
- Poor Performance: High MAPE or low R-squared
- Slow Training: Excessive computation time
Best Practices¶
Test Selection¶
- Start with accuracy: Always run accuracy tests first
- Add cross-validation: For generalization assessment
- Include stability: For production models
- Monitor performance: For computational constraints
Result Analysis¶
- Compare frameworks: Run same tests on different frameworks
- Track over time: Monitor performance as data grows
- Set thresholds: Define acceptable performance levels
- Document decisions: Record test choices and rationale
Troubleshooting¶
Common Issues¶
- Slow tests: Reduce data size or simplify model
- Memory errors: Use smaller datasets or more efficient settings
- Convergence issues: Check model configuration
- Inconsistent results: Verify random seed settings
Getting Help¶
- Check Configuration for test settings
- Review Examples for similar cases
- Join Discussions for support