One of the tasks I've faced often was to convert a 3-dimensional ndarray to a pandas dataframe. I will share my preferred technique in this post today.

For the purpose of this exercise, I'll generate dummy sales data for a retail company. The dimensions include products, locations, and sales.

```
import pandas as pd
import numpy as np
```

Let's start with 1d data. What if we only had sales info for all products and locations?

```
arr_1d = np.random.randint(
low=1,
high=10,
size=3,
)
print(arr_1d)
```

```
[9 3 6]
```

That's easy. Ideally, 1-d information should be represented as a Series.

```
df_1d = pd.DataFrame(arr_1d, columns=["sales"])
print(df_1d)
```

```
sales
0 9
1 3
2 6
```

Let's move on to 2 dimensions. Now, we have data corresponding to different products.

```
arr_2d = np.random.randint(
low=1,
high=10,
size=(3, 2),
)
print(arr_2d)
```

```
[[4 6]
[8 1]
[2 7]]
```

Pandas DataFrame can handle 2-D ndarrays out of the box.

```
df_2d = pd.DataFrame(arr_2d, columns=["product", "sales"]).set_index("product")
print(df_2d)
```

```
sales
product
4 6
8 1
2 7
```

Now, what if we have a ndarray corresponding to all products for several locations?

```
# failure
arr_3d = np.random.randint(
low=1,
high=10,
size=(5, 3, 1),
)
print(arr_3d)
```

```
[[[9]
[6]
[2]]
[[1]
[4]
[4]]
[[2]
[5]
[6]]
[[9]
[6]
[5]]
[[1]
[6]
[1]]]
```

```
# the following raises ValueError
# pandas DataFrame expects a 2-d input
df_3d = pd.DataFrame(arr_3d, columns=["location", "product", "sales"])
```

pandas won't work out of the box. It cannot handle more than 2 dimensions. So, it raises a `ValueError`

.

```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/jq/ksxbjg7d58g9v9rrcl0f38380000gn/T/ipykernel_12628/1531564731.py in <module>
1 # the following raises ValueError
2 # pandas DataFrame expects a 2-d input
----> 3 df_3d = pd.DataFrame(arr_3d, columns=["location", "product", "sales"])
.
.
.
ValueError: Must pass 2-d input. shape=(5, 3, 1)
```

The solution?

`MultiIndex`

.

Assuming that the ndarray is ordered by location/products, we could prepare a multi-index, flatten our ndarray and let Pandas reshape it according to the provided index.

Sweet!

```
index = pd.MultiIndex.from_product(
[range(dim) for dim in arr_3d.shape[:-1]],
names=["location", "product"],
)
df_3d = pd.DataFrame(arr_3d.flatten(), index=index, columns=["sales"])
print(df_3d)
```

```
sales
location product
0 0 9
1 6
2 2
1 0 1
1 4
2 4
2 0 2
1 5
2 6
3 0 9
1 6
2 5
4 0 1
1 6
2 1
```

We just have sales corresponding to each location and product. What if the final `sales`

dimension includes sales for yesterday/today (or for every month, every week, etc.) ?

```
arr_3d = np.random.randint(
low=1,
high=10,
size=(5, 3, 2),
)
print(arr_3d)
index = pd.MultiIndex.from_product(
[range(dim) for dim in arr_3d.shape],
names=["location", "product", "sales"],
)
```

```
[[[1 9]
[8 6]
[9 4]]
[[4 9]
[3 9]
[1 8]]
[[5 2]
[9 9]
[1 9]]
[[4 5]
[7 4]
[7 7]]
[[6 9]
[4 2]
[7 1]]]
```

No major changes. Pandas should handle it just like before. Just unstack the sales dimension and rename the columns for readability.

```
df_3d = pd.DataFrame(
arr_3d.flatten(),
index=index,
columns=["sales"],
)
df_3d = df_3d.unstack(-1).rename(
columns={0: "yesterday", 1: "today"},
)
print(df_3d)
```

```
sales
sales yesterday today
location product
0 0 1 9
1 8 6
2 9 4
1 0 4 9
1 3 9
2 1 8
2 0 5 2
1 9 9
2 1 9
3 0 4 5
1 7 4
2 7 7
4 0 6 9
1 4 2
2 7 1
```

Do you know of other ways to switch between ndarray and DataFrame? Comment below :)

]]>“When George Lucas was a teenager, he almost died in a car accident. He decided “every day now is an extra day,” dedicated himself to film, and went on to direct Star Wars.”

“I realized I was going to die,” he says. “And when that gets into your mind . . . it utterly changed me . . . I thought, I’m not going to sit here and wait for things to happen, I’m going to make them happen, and if people think I’m an idiot I don’t care.”

“To all viewers but yourself, what matters is the product: the finished artwork. To you, and you alone, what matters is the process: the experience of shaping the artwork.”

“By letting go of our egos and sharing our process, we allow for the possibility of people having an ongoing connection with us and our work, which helps us move more of our product”

“We’re not all artists or astronauts. A lot of us go about our work and feel like we have nothing to show for it at the end of the day.”

“whatever the nature of your work, there is an art to what you do, and there are people who would be interested in that art, if only you presented it to them in the right way.”

“sharing your process might actually be most valuable if the products of your work aren’t easily shared, if you’re still in the apprentice stage of your work, if you can’t just slap up a portfolio and call it a day, or if your process doesn’t necessarily lead to tangible finished products.”

“No one is going to give a damn about your résumé; they want to see what you have made with your own little fingers.”

“Once a day, after you’ve done your day’s work, go back to your documentation and find one little piece of your process that you can share”

“Where you are in your process will determine what that piece is. If you’re in the very early stages, share your influences and what’s inspiring you. If you’re in the middle of executing a project, write about your methods or share works in progress. If you’ve just completed a project, show the final product, share scraps from the cutting-room floor, or write about what you learned”

“The act of sharing is one of generosity—you’re putting something out there because you think it might be helpful or entertaining to someone on the other side of the screen.”

“Build a good name. Keep your name clean. Don’t make compromises. Don’t worry about making a bunch of money or being successful. Be concerned with doing good work . . . and if you can build a good name, eventually that name will be its own currency.”

“All it takes to uncover hidden gems is a clear eye, an open mind, and a willingness to search for inspiration in places other people aren’t willing or able to go.”

“We all love things that other people think are garbage. You have to have the courage to keep loving your garbage, because what makes us unique is the diversity and breadth of our influences, the unique ways in which we mix up the parts of culture others have deemed “high” and the “low.”

“When you find things you genuinely enjoy, don’t let anyone else make you feel bad about it. Don’t feel guilty about the pleasure you take in the things you enjoy. Celebrate them. When you share your taste and your influences, have the guts to own all of it. Don’t give in to the pressure to self-edit too much.”

“A character wants something, goes after it despite opposition (perhaps including his own doubts), and so arrives at a win, lose, or draw.”

“You get a great idea, you go through the hard work of executing the idea, and then you release the idea out into the world, coming to a win, lose, or draw. Sometimes the idea succeeds, sometimes it fails, and more often than not, it does nothing at all.”

“A good pitch is set up in three acts: The first act is the past, the second act is the present, and the third act is the future.” “The first act is where you’ve been—what you want, how you came to want it, and what you’ve done so far to get it.” “The second act is where you are now in your work and how you’ve worked hard and used up most of your resources. The third act is where you’re going, and how exactly the person you’re pitching can help you get there. ”

“Think about what you can share from your process that would inform the people you’re trying to reach. Have you learned a craft? What are your techniques? Are you skilled at using certain tools and materials? What kind of knowledge comes along with your job?”

“The minute you learn something, turn around and teach it to others. Share your reading list. Point to helpful reference materials. Create some tutorials and post them online. Use pictures, words, and video. Take people step-by-step through part of your process. As blogger Kathy Sierra says, “Make people better at something they want to be better at.”

“Teaching people doesn’t subtract value from what you do, it actually adds to it. When you teach someone how to do your work, you are, in effect, generating more interest in your work. People feel closer to your work because you’re letting them in on what you know.”

“Don’t talk to people you don’t want to talk to, and don’t talk about stuff you don’t want to talk about.”

“life is all about “who you know.” But who you know is largely dependent on who you are and what you do, and the people you know can’t do anything for you if you’re not doing good work.”

“Make stuff you love and talk about stuff you love and you’ll attract people who love that kind of stuff. It’s that simple.”

“Try new things. If an opportunity comes along that will allow you to do more of the kind of work you want to do, say Yes. If an opportunity comes along that would mean more money, but less of the kind of work you want to do, say No.”

“You avoid stalling out in your career by never losing momentum.”

“Instead of taking a break in between projects, waiting for feedback, and worrying about what’s next, use the end of one project to light up the next one. ”

“Just do the work that’s in front of you, and when it’s finished, ask yourself what you missed, what you could’ve done better, or what you couldn’t get to, and jump right into the next project.”

“When you throw out old work, what you’re really doing is making room for new work.”

“Look for something new to learn, and when you find it, dedicate yourself to learning it out in the open. Document your progress and share as you go so that others can learn along with you. Show your work, and when the right people show up, pay close attention to them, because they’ll have a lot to show you.”

]]>Once you've got a hit, suddenly all the locked doors open wide. People love the hit so much that it seems to promote itself. Instead of trying to create demand, you're managing the huge demand

Success comes from persistently improving and inventing, not from persistently doing what's not working.

We all have lots of ideas, creations, and projects. When you present one to the world, and it's not a hit, don't keep pushing it as-is. Instead, get back to improving and inventing.

Present each new idea or improvement to the world. If multiple people are saying, “Wow! Yes! I need this! I'd be happy to pay you to do this!” then you should probably do it. But if the response is anything less, don't pursue it.

Don't waste years fighting uphill battles against locked doors. Improve or invent until you get that huge response.

No “yes.” Either “HELL YEAH!” or “no.”

If you're not saying “HELL YEAH!” about something, say “no.”

When deciding whether to do something, if you feel anything less than “Wow! That would be amazing! Absolutely! Hell yeah!”—then say “no.”

We're all busy. We've all taken on too much. Saying yes to less is the way out.

Start now. No funding needed.

If you want to be useful, you can always start now, with only 1 percent of what you have in your grand vision. It'll be a humble prototype version of your grand vision, but you'll be in the game. You'll be ahead of the rest, because you actually started, while others are waiting for the finish line to magically appear at the starting line.

Starting small puts 100 percent of your energy on actually solving real problems for real people. It gives you a stronger foundation to grow from. It eliminates the friction of big infrastructure and gets right to the point. And it will let you change your plan in an instant, as you're working closely with those first customers telling you what they really need.

So no, your idea doesn't need funding to start. (You also don't need an MBA, a particular big client, a certain person's endorsement, a lucky break, or any other common excuse not to start.)

You need to confidently exclude people, and proudly say what you're not. By doing so, you will win the hearts of the people you want.

Are you helping people? Are they happy? Are you happy? Are you profitable? Isn't that enough? How do you grade yourself?

How do you grade yourself? It's important to know in advance, to make sure you're staying focused on what's honestly important to you, instead of doing what others think you should.

People fall in love with people who won't give them the time of day.

If you set up your business like you don't need the money, people are happier to pay you.

When someone's doing something for love, being generous instead of stingy, trusting instead of fearful, it triggers this law: We want to give to those who give.

It's another Tao of business: Set up your business like you don't need the money, and it'll likely come your way.

But no matter what business you're in, it's good to prepare for what would happen if business doubled.

You might get bigger faster and make millions if you outsourced everything to the experts. But what's the point of getting bigger and making millions? To be happy, right?

In the end, it's about what you want to be, not what you want to have.

To have something (a finished recording, a business, or millions of dollars) is the means, not the end. To be something (a good singer, a skilled entrepreneur, or just plain happy) is the real point.

When you sign up to run a marathon, you don't want a taxi to take you to the finish line.

To be a true business owner, make sure you could leave for a year, and when you came back, your business would be doing better than when you left.

Never forget that you can make your role anything you want it to be.

Anything you hate to do, someone else loves. So find that person and let him do it.

For me, I loved sitting alone and programming, writing, planning, and inventing. Thinking of ideas and making them happen. This makes me happy, not business deals or management. So I found someone who liked doing business deals and put him in charge of all that.

Delegate, but don't abdicate.

Just pay close attention to what excites you and what drains you. Pay close attention to when you're being the real you and when you're trying to impress an invisible jury.

]]>3 steps.

**Choose**a pivot element.**Partition**: Put all elements smaller than the pivot in a smaller array, (say, smaller_subarray). Put all elements greater than the pivot in another array (say, greater_subarray).**Recurse & Merge**: Recursively sort the smaller and greater sub-arrays. Merge the sorted arrays with the pivot.

Let's start with the high-level function `quicksort()`

```
def quicksort(array: List[int]) -> List[int]:
"""Recursive implementation of quicksort.
Args:
array (List[int]): a list of integers
Returns:
List[int]: sorted array
"""
# Base case
if len(array) <= 1:
return array
# Step 1: choose a pivot
pivot_idx = random.choice(range(len(array)))
# Step 2: partition
smaller_subarray, greater_subarray = partition(array, pivot_idx)
# Step 3: Recurse on smaller and greater subarrays.
return (
quicksort(smaller_subarray)
+ [array[pivot_idx]]
+ quicksort(greater_subarray)
)
```

Let's go line by line,

```
if len(array) <= 1:
return array
```

Every recursive function must have one. If we have an empty or single-item array, there's nothing to sort. We simply return the array.

`pivot_idx = random.choice(range(len(array)))`

The choice of pivot is the difference between having an `O(N log N)`

vs `O(N^2)`

time complexity.

How so? Let's take an example.

The worst-case involves having an already sorted array.

`array = [1,2,3,4,5]`

The image below shows recursion trees using two different pivot selection strategies:

**Legend**

- Blue lines => the greater subarray.
- Orange lines => the smaller subarray.
- Red lines => the pivot.

Left shows the case where we always choose the first/last item as the pivot. This leads to higher recursion depth as the subproblem size reduces by 1 at each level. We have roughly

`O(N)`

levels / recursive calls. Since joining the sub-arrays with the pivot is`O(N)`

, this leads to`O(N^2)`

time complexity.Right shows the ideal case where we choose the median as the pivot at every level. This leads to balanced sub-problems. The recursion depth is lower, roughly

`O(log N)`

. This leads to`O(N log N)`

time complexity.

Choosing a random pivot leads to an *expected* runtime of `O(N log N)`

time complexity. We use that in this implementation.

Wait! There's still one case where `quicksort()`

could be `O(N^2)`

, even after choosing a randomized/median pivot. This is the case when every item in the array is identical. We could add a check to avoid this case as well.

According to this source ,

The probability that quicksort will use a quadratic number of compares when sorting a large array on your computer is much less than the probability that your computer will be struck by lightning!

We write the high-level function assuming that we have a magic function `partition()`

which returns two arrays.

`smaller_subarray`

: all elements`<=`

the pivot`greater_subarray`

: all elements`>`

the pivot

The implementation of `partition()`

is quite simple. We compare each item in the array with the pivot. If it's `<=`

pivot, we add it to the smaller sub-array, else we add it to the greater sub-array.

```
def partition(array: List[int], pivot_idx: int) -> Tuple[List[int], List[int]]:
"""Parition array into subarrays smaller and greater than the pivot.
Args:
array (List[int]): input array
pivot_idx (int): index of the pivot
Returns:
Tuple[List[int], List[int]]: smaller subarray, greater subarray
"""
smaller_subarray, greater_subarray = [], []
for idx, item in enumerate(array):
# we don't want to add pivot to any of the sub-arrays
if idx == pivot_idx:
continue
if item <= array[pivot_idx]:
smaller_subarray.append(item)
else:
greater_subarray.append(item)
return smaller_subarray, greater_subarray
```

```
return (
quicksort(smaller_subarray)
+ [array[pivot_idx]]
+ quicksort(greater_subarray)
)
```

This step is quite exquisite. We recurse on the smaller and greater sub-arrays and place the pivot in between them. The beauty of a recursive implementation is that we could re-use our function with a smaller size of the input and trust that it would work.

Let's unroll it a bit.

`quicksort(smaller_subarray)`

=> returns a sorted version of the`smaller_subarray`

`[array[pivot_idx]]`

=> out pivot in a list. It's needed for concatenating it with the other two lists.`quicksort(greater_subarray)`

=> returns a sorted version of the`greater_subarray`

Now, we join the 3 lists including the pivot, and return the final sorted list.

**Expected** run-time of `O(N log N)`

since it's a randomized implementation.

We do `O(N)`

work at each `quicksort()`

call. The major components are `partition()`

and merge step - both are `O(N)`

.

For randomized implementation, we have O(log N) levels/recursive calls.

So, expected time complexity ~ `Number of recursive calls`

* `work done per recursive call`

~ `O(N) * O(log N)`

~ `O(N log N)`

`O(N)`

since we use extra space for storing the smaller and greater sub-arrays. The recursion stack also uses `O(log N)`

space. Overall, this implementation uses `O(N)`

space.

The in-place implementation without using auxiliary lists would lead to an `O(log N)`

space complexity.

The implementation including the `partition()`

and tests are here.

YouTube has 100m+ daily active users who consume more than a billion hours' worth of content every day. 100s of hours of videos are uploaded every second. At that scale, recommending personalized videos is a colossal task.

I've always wondered how YouTube is always able to come up with relevant recommendations that kept me hooked! I found a very interesting paper on Deep Neural networks for YouTube Recommendations. In this post, I will summarise the key ideas.

To able to come up with relevant & personalized recommendations for every user is a problem because of:

**scale**: billions of users, billions of videos.**freshness**: massive volume of videos are uploaded every day. It's an explore-exploit trade-off between popular vs new content.**noise**: only sparse implicit user feedback is available for modeling.

In this paper, the authors demonstrate the usage of deep learning techniques for improving recommendations as opposed to matrix-factorization techniques used earlier.

The problem of recommendations at scale is divided into 2 subproblems:

**Candidate Generation**- selects a small subset from the overall corpus which might be relevant to the user.**Ranking**- ranks the candidate videos based on their relative importance.

For Candidate Generation, the objective is to predict the next video watch. User search/watch history, demographics, etc are used by a simple feed-forward network as embeddings which are jointly learned during the training.

For Ranking, the objective is to model an expected watch time. A score is assigned based on the expected watch time and videos are sorted accordingly.

Similar neural network architecture is used for both procedures.

Offline metrics like precision, recall, ranking loss, etc. are used during development. A/B test is used to determine the final effectiveness of the model. We've already explored the discrepancies between offline vs online evaluation in a different post.

**Problem Formulation:**Recommendation is formulated as an*extreme multi-class classification*`P(w_t = i | U,C) = softmax(v_i . u) where, w_t = video v_i watched at time t U = user C = context v_i = dense video embeddings u = dense user embeddings`

- The task of the deep neural network is to learn the embeddings
*u*as a function of user history and context. - user completing a video watch is a positive example.
- candidate sampling & importance weighting is used to sample negative examples.

- The task of the deep neural network is to learn the embeddings
Embeddings describing the watch history, user query, demographics, etc are fed to a simple feed-forward neural network having a final softmax layer to learn the class probabilities.

**watch history**: dense vector representation of watched video is learned from a sequence of video-ids (just like word2vec).**user query**: n-gram representations**demographics**: geographic region, device, age, etc. are used as numerical/categorical features

- The embeddings are jointly learned while training the model via gradient descent back-propagation.
- Age of the video is used to model the time-dependent nature of popular videos. Otherwise, the good old popular videos are going to be selected most of the times, isn't it?
- What's interesting is that a lot of features were "engineered" as opposed to the promise of deep learning to reduce it.
**Training data & label:**- There's an inherent sequence of video consumption. Hence, using random held-out data will be cheating since future information will leak into the training process. The model will overfit! Think about time-series forecasting. A random train-test split won't work since the future data is not available in production during serving time.
- The authors propose a model of predicting
*user's next watch*instead of a randomly held-out watch. This makes sense, as we consume videos in a sequence. For example, if you're watching a series with several episodes, recommending a random episode from that series doesn't make sense.

**Serving:**- To score millions of videos in latency of tens of milliseconds, a nearest neighbor-based search algorithm is used. Exact probability values of softmax() are not required. Hence, a dot product of user and video embeddings could be used to figure out the propensity score of a user
*u*for a particular video*v_i*. A nearest neighbor search algorithm could be used to figure out the top K candidate videos based on the score.

- To score millions of videos in latency of tens of milliseconds, a nearest neighbor-based search algorithm is used. Exact probability values of softmax() are not required. Hence, a dot product of user and video embeddings could be used to figure out the propensity score of a user

- Candidate generation selects a few hundred out of millions of videos. The ranking procedure could make use of more video features as well as user's interactions with it in order to figure out an order of recommendation.
- The model architecture is similar to the candidate generation procedure. We assign a score to the videos using weighted logistic regression.
- The objective to optimize is a function of expected watch time per impression.
- Why not click-through rate? Well, that would promote clickbait videos instead of quality content. Watch time is a better signal that captures engagement.
**Modeling Expected Watch Time****Objective**: Predict expected watch time for a given video.**Model**: Weighted Logistic Regression, since the class distributions are imbalanced.**Positive example**: the video was watched.**Negative example**: the video was not clicked.

**What are the weights in "weighted" logistic regression?**- Positive examples are weighted by the watch time.
- Negative examples are given a weight of 1.

**Loss**: cross-entropy

Evaluation is an important topic in every machine learning project. There are offline metrics that we compute on the historical data. It's supposed to provide us an indication of our model performance on real data. However, we often see a discrepancy in offline vs online performance.

In Predictive model performance: Offline and online evaluations, the authors investigate the offline/online model performance on advertisement data from Bing Search Engine.

- Evaluation metrics are important since it guides what the model optimizes on. Tuning on incorrect metrics might provide misleading results in offline settings that might surprise us in production.
- AUC is a really good metric to determine model classification efficiency.
- Offline evaluation using AUC doesn't correlate to the online evaluation via A/B tests.
- The authors propose the usage of a simulation metric that simulates user behavior based on historical logs, which works better in the online evaluation.

**What is an offline evaluation?**

- In typical ML projects, we split our dataset into train/test sets.
- Models are trained on the train set.
- Evaluation is done on the test set.

**What is an online evaluation?**

When our model is in production, we perform an A/B test. Typically, it has 2 variants.

*control*group with our existing model*test*group with the new model. Live traffic is split into the two groups & metrics like conversion rate, revenue per visitor, etc are measured. If the new model's performance is statistically significant, it's selected for launch.

**The issue?**

Offline performance doesn't always correlate to online performance due to the dynamic nature of the latter.

- The paper focuses on metrics used for click prediction problems that most search engines like Google, Bing, etc. face.
- Click prediction problems estimates the CTRs of ads given a query.
- This is treated as a binary classification problem.

We'll review some important evaluation metrics for the use case.

- Let's say that we have a binary classifier that predicts a probability
*p*for an event to occur. Then,*1-p*is the probability that the event doesn't occur. We need a threshold to determine the class membership. AUC provides a single score that tells us how good a model is across all possible ranges of thresholds. - AUC is computed from a ROC (Receiver Operating Characteristics) curve.
- ROC curve = a graphical representation of TPR (true positive rate) as a function of FPR (false positive rate) of a binary classifier across different thresholds.

```
RIG = 1 - log_loss/entropy(y)
where,
log_loss = - [ c*log(p) + (1-c)*log(1-p) ]
entropy(y) = - [ y*logy + (1-y)*log(1-y) ]
c = observed click
p = probability of a click
y = CTR
```

Higher is better.

`PE = avg(p)/y - 1`

- PE = 0 when average(p) exactly matches the click-through rate.
- It could also be 0 if there's a mix of over-estimation/under-estimation of the CTR as long the average is closer to CTR.
- It's not a reliable metric.

This part is really important. It teaches us a way to simulate different model performances offline without having to run expensive A/B tests.

- A/B tests run with a fixed set of model parameters.
- It could be expensive to run multiple experiments with different model parameters.
- It could also ruin the user experience, make losses if the new model underperforms

- The paper proposes a simulation of the user behavior offline aka auction simulation.
- Auction simulation reruns ad auctions offline for a given query and selects a set of ads based on the new model prediction scores.
- user clicks are estimated in the following way:
- if (user, ad) pair is found in the logs
- if it's in the same position in history as in the simulation, use the historic CTR directly as the expected CTR
- if it's not in the same position, the expected CTR is calibrated based on the position.

- if (user, ad) pair is not found, average CTR is used as the expected CTR.

- if (user, ad) pair is found in the logs

- ignores predicted probability values. It's insensitive to the ranking based on the probability score. It's possible to have different rankings with similar AUC scores.
- summarizes the test performance over the entire range of the ROC space, even where one would rarely operate on. Higher ROC doesn't mean a better ranking.
- It weights false-positive and false negatives equally. In real life, the cost of not showing a relevant ad (false negatives) is way more than showing a sub-optimal ad (false positive).
- highly dependent on the underlying data distribution.

- Highly sensitive to underlying data distribution.
- We can't judge a model by just using RIG alone.
- We could compare the relative performance of different models trained/tested on the same data.

The authors compare 2 models

- model 1 (baseline): tuned on offline metrics like AUC & RIG
- model 2 (test): tuned on the simulation metric

The finding: model performs well on offline metrics but has a significant dip on online metrics.

**Why do we see this?**

- Tuning a model on offline metrics like AUC/RIG over-estimates the probability scores at the lower end of the score range.
- Over-estimation of the probability score at the higher end of the score range doesn't matter much since they'll be selected by either model.
- Over-estimation at the lower end of the score range is bad since irrelevant ads are more likely to be shown in that case.
- Offline metrics like AUC/RIG provide an overall score based on the entire range of probability scores - they're not able to capture the intended effect.
- Tuning a model based on the simulation metric correlates better with online performance tests via A/B tests.

Predictive Model Performance: Offline and Online Evaluations

]]>