Evaluation is an important topic in every machine learning project. There are offline metrics that we compute on the historical data. It’s supposed to provide us an indication of our model performance on real data. However, we often see a discrepancy in offline vs online performance.
In Predictive model performance: Offline and online evaluations, the authors investigate the offline/online model performance on advertisement data from Bing Search Engine.
- Evaluation metrics are important since it guides what the model optimizes on. Tuning on incorrect metrics might provide misleading results in offline settings that might surprise us in production.
- AUC is a really good metric to determine model classification efficiency.
- Offline evaluation using AUC doesn’t correlate to the online evaluation via A/B tests.
- The authors propose the usage of a simulation metric that simulates user behavior based on historical logs, which works better in the online evaluation.
What is an offline evaluation?
- In typical ML projects, we split our dataset into train/test sets.
- Models are trained on the train set.
- Evaluation is done on the test set.
What is an online evaluation?
When our model is in production, we perform an A/B test. Typically, it has 2 variants.
- control group with our existing model
- test group with the new model. Live traffic is split into the two groups & metrics like conversion rate, revenue per visitor, etc are measured. If the new model’s performance is statistically significant, it’s selected for launch.
Offline performance doesn’t always correlate to online performance due to the dynamic nature of the latter.
- The paper focuses on metrics used for click prediction problems that most search engines like Google, Bing, etc. face.
- Click prediction problems estimates the CTRs of ads given a query.
- This is treated as a binary classification problem.
We’ll review some important evaluation metrics for the use case.
AUC (Area under curve)
- Let’s say that we have a binary classifier that predicts a probability p for an event to occur. Then, 1-p is the probability that the event doesn’t occur. We need a threshold to determine the class membership. AUC provides a single score that tells us how good a model is across all possible ranges of thresholds.
- AUC is computed from a ROC (Receiver Operating Characteristics) curve.
- ROC curve = a graphical representation of TPR (true positive rate) as a function of FPR (false positive rate) of a binary classifier across different thresholds.
RIG (Relative Information Gain)
RIG = 1 - log_loss/entropy(y) where, log_loss = - [ c*log(p) + (1-c)*log(1-p) ] entropy(y) = - [ y*logy + (1-y)*log(1-y) ] c = observed click p = probability of a click y = CTR
Higher is better.
Prediction Error (PE)
PE = avg(p)/y - 1
- PE = 0 when average(p) exactly matches the click-through rate.
- It could also be 0 if there’s a mix of over-estimation/under-estimation of the CTR as long the average is closer to CTR.
- It’s not a reliable metric.
This part is really important. It teaches us a way to simulate different model performances offline without having to run expensive A/B tests.
- A/B tests run with a fixed set of model parameters.
- It could be expensive to run multiple experiments with different model parameters.
- It could also ruin the user experience, make losses if the new model underperforms
- The paper proposes a simulation of the user behavior offline aka auction simulation.
- Auction simulation reruns ad auctions offline for a given query and selects a set of ads based on the new model prediction scores.
- user clicks are estimated in the following way:
- if (user, ad) pair is found in the logs
- if it’s in the same position in history as in the simulation, use the historic CTR directly as the expected CTR
- if it’s not in the same position, the expected CTR is calibrated based on the position.
- if (user, ad) pair is not found, average CTR is used as the expected CTR.
- if (user, ad) pair is found in the logs
- ignores predicted probability values. It’s insensitive to the ranking based on the probability score. It’s possible to have different rankings with similar AUC scores.
- summarizes the test performance over the entire range of the ROC space, even where one would rarely operate on. Higher ROC doesn’t mean a better ranking.
- It weights false-positive and false negatives equally. In real life, the cost of not showing a relevant ad (false negatives) is way more than showing a sub-optimal ad (false positive).
- highly dependent on the underlying data distribution.
- Highly sensitive to underlying data distribution.
- We can’t judge a model by just using RIG alone.
- We could compare the relative performance of different models trained/tested on the same data.
Offline vs Online discrepancy
The authors compare 2 models
- model 1 (baseline): tuned on offline metrics like AUC & RIG
- model 2 (test): tuned on the simulation metric
The finding: model performs well on offline metrics but has a significant dip on online metrics.
Why do we see this?
- Tuning a model on offline metrics like AUC/RIG over-estimates the probability scores at the lower end of the score range.
- Over-estimation of the probability score at the higher end of the score range doesn’t matter much since they’ll be selected by either model.
- Over-estimation at the lower end of the score range is bad since irrelevant ads are more likely to be shown in that case.
- Offline metrics like AUC/RIG provide an overall score based on the entire range of probability scores - they’re not able to capture the intended effect.
- Tuning a model based on the simulation metric correlates better with online performance tests via A/B tests.