# Building pandas DataFrames from ndarrays

## MultiIndexing to the rescue...

At work, I have to switch between numpy & pandas depending on the computational needs. Numpy is faster. Pandas is easier to work with.

One of the tasks I've faced often was to convert a 3-dimensional ndarray to a pandas dataframe. I will share my preferred technique in this post today.

For the purpose of this exercise, I'll generate dummy sales data for a retail company. The dimensions include products, locations, and sales.

# Convert ndarray to pandas dataframe

```
import pandas as pd
import numpy as np
```

## 1d : dimension = sales

Let's start with 1d data. What if we only had sales info for all products and locations?

```
arr_1d = np.random.randint(
low=1,
high=10,
size=3,
)
print(arr_1d)
```

```
[9 3 6]
```

That's easy. Ideally, 1-d information should be represented as a Series.

```
df_1d = pd.DataFrame(arr_1d, columns=["sales"])
print(df_1d)
```

```
sales
0 9
1 3
2 6
```

## 2d: dimension = sales * product

Let's move on to 2 dimensions. Now, we have data corresponding to different products.

```
arr_2d = np.random.randint(
low=1,
high=10,
size=(3, 2),
)
print(arr_2d)
```

```
[[4 6]
[8 1]
[2 7]]
```

Pandas DataFrame can handle 2-D ndarrays out of the box.

```
df_2d = pd.DataFrame(arr_2d, columns=["product", "sales"]).set_index("product")
print(df_2d)
```

```
sales
product
4 6
8 1
2 7
```

## 3d : dimension = location x product x sales

Now, what if we have a ndarray corresponding to all products for several locations?

```
# failure
arr_3d = np.random.randint(
low=1,
high=10,
size=(5, 3, 1),
)
print(arr_3d)
```

```
[[[9]
[6]
[2]]
[[1]
[4]
[4]]
[[2]
[5]
[6]]
[[9]
[6]
[5]]
[[1]
[6]
[1]]]
```

```
# the following raises ValueError
# pandas DataFrame expects a 2-d input
df_3d = pd.DataFrame(arr_3d, columns=["location", "product", "sales"])
```

pandas won't work out of the box. It cannot handle more than 2 dimensions. So, it raises a `ValueError`

.

```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/jq/ksxbjg7d58g9v9rrcl0f38380000gn/T/ipykernel_12628/1531564731.py in <module>
1 # the following raises ValueError
2 # pandas DataFrame expects a 2-d input
----> 3 df_3d = pd.DataFrame(arr_3d, columns=["location", "product", "sales"])
.
.
.
ValueError: Must pass 2-d input. shape=(5, 3, 1)
```

The solution?

`MultiIndex`

.

Assuming that the ndarray is ordered by location/products, we could prepare a multi-index, flatten our ndarray and let Pandas reshape it according to the provided index.

Sweet!

```
index = pd.MultiIndex.from_product(
[range(dim) for dim in arr_3d.shape[:-1]],
names=["location", "product"],
)
df_3d = pd.DataFrame(arr_3d.flatten(), index=index, columns=["sales"])
print(df_3d)
```

```
sales
location product
0 0 9
1 6
2 2
1 0 1
1 4
2 4
2 0 2
1 5
2 6
3 0 9
1 6
2 5
4 0 1
1 6
2 1
```

We just have sales corresponding to each location and product. What if the final `sales`

dimension includes sales for yesterday/today (or for every month, every week, etc.) ?

## 3d : dimension = location x product x sales (multi)

```
arr_3d = np.random.randint(
low=1,
high=10,
size=(5, 3, 2),
)
print(arr_3d)
index = pd.MultiIndex.from_product(
[range(dim) for dim in arr_3d.shape],
names=["location", "product", "sales"],
)
```

```
[[[1 9]
[8 6]
[9 4]]
[[4 9]
[3 9]
[1 8]]
[[5 2]
[9 9]
[1 9]]
[[4 5]
[7 4]
[7 7]]
[[6 9]
[4 2]
[7 1]]]
```

No major changes. Pandas should handle it just like before. Just unstack the sales dimension and rename the columns for readability.

```
df_3d = pd.DataFrame(
arr_3d.flatten(),
index=index,
columns=["sales"],
)
df_3d = df_3d.unstack(-1).rename(
columns={0: "yesterday", 1: "today"},
)
print(df_3d)
```

```
sales
sales yesterday today
location product
0 0 1 9
1 8 6
2 9 4
1 0 4 9
1 3 9
2 1 8
2 0 5 2
1 9 9
2 1 9
3 0 4 5
1 7 4
2 7 7
4 0 6 9
1 4 2
2 7 1
```

Do you know of other ways to switch between ndarray and DataFrame? Comment below :)