Saikat Kumar Dey
Saikat Kumar Dey

Saikat Kumar Dey

Building pandas DataFrames from ndarrays

MultiIndexing to the rescue...

Saikat Kumar Dey's photo
Saikat Kumar Dey

Published on Sep 27, 2021

3 min read

At work, I have to switch between numpy & pandas depending on the computational needs. Numpy is faster. Pandas is easier to work with.

One of the tasks I've faced often was to convert a 3-dimensional ndarray to a pandas dataframe. I will share my preferred technique in this post today.

For the purpose of this exercise, I'll generate dummy sales data for a retail company. The dimensions include products, locations, and sales.

Convert ndarray to pandas dataframe

import pandas as pd
import numpy as np

1d : dimension = sales

Let's start with 1d data. What if we only had sales info for all products and locations?

arr_1d = np.random.randint(
    low=1,
    high=10,
    size=3,
)
print(arr_1d)
[9 3 6]

That's easy. Ideally, 1-d information should be represented as a Series.

df_1d = pd.DataFrame(arr_1d, columns=["sales"])
print(df_1d)
   sales
0      9
1      3
2      6

2d: dimension = sales * product

Let's move on to 2 dimensions. Now, we have data corresponding to different products.

arr_2d = np.random.randint(
    low=1,
    high=10,
    size=(3, 2),
)
print(arr_2d)
[[4 6]
 [8 1]
 [2 7]]

Pandas DataFrame can handle 2-D ndarrays out of the box.

df_2d = pd.DataFrame(arr_2d, columns=["product", "sales"]).set_index("product")
print(df_2d)
         sales
product       
4            6
8            1
2            7

3d : dimension = location x product x sales

Now, what if we have a ndarray corresponding to all products for several locations?

# failure
arr_3d = np.random.randint(
    low=1,
    high=10,
    size=(5, 3, 1),
)
print(arr_3d)
[[[9]
  [6]
  [2]]

 [[1]
  [4]
  [4]]

 [[2]
  [5]
  [6]]

 [[9]
  [6]
  [5]]

 [[1]
  [6]
  [1]]]
# the following raises ValueError
# pandas DataFrame expects a 2-d input
df_3d = pd.DataFrame(arr_3d, columns=["location", "product", "sales"])

pandas won't work out of the box. It cannot handle more than 2 dimensions. So, it raises a ValueError.

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

/var/folders/jq/ksxbjg7d58g9v9rrcl0f38380000gn/T/ipykernel_12628/1531564731.py in <module>
      1 # the following raises ValueError
      2 # pandas DataFrame expects a 2-d input
----> 3 df_3d = pd.DataFrame(arr_3d, columns=["location", "product", "sales"])
.
.
.
ValueError: Must pass 2-d input. shape=(5, 3, 1)

The solution?

MultiIndex.

Assuming that the ndarray is ordered by location/products, we could prepare a multi-index, flatten our ndarray and let Pandas reshape it according to the provided index.

Sweet!

index = pd.MultiIndex.from_product(
    [range(dim) for dim in arr_3d.shape[:-1]],
    names=["location", "product"],
)

df_3d = pd.DataFrame(arr_3d.flatten(), index=index, columns=["sales"])
print(df_3d)
                  sales
location product       
0        0            9
         1            6
         2            2
1        0            1
         1            4
         2            4
2        0            2
         1            5
         2            6
3        0            9
         1            6
         2            5
4        0            1
         1            6
         2            1

We just have sales corresponding to each location and product. What if the final sales dimension includes sales for yesterday/today (or for every month, every week, etc.) ?

3d : dimension = location x product x sales (multi)

arr_3d = np.random.randint(
    low=1,
    high=10,
    size=(5, 3, 2),
)
print(arr_3d)
index = pd.MultiIndex.from_product(
    [range(dim) for dim in arr_3d.shape],
    names=["location", "product", "sales"],
)
[[[1 9]
  [8 6]
  [9 4]]

 [[4 9]
  [3 9]
  [1 8]]

 [[5 2]
  [9 9]
  [1 9]]

 [[4 5]
  [7 4]
  [7 7]]

 [[6 9]
  [4 2]
  [7 1]]]

No major changes. Pandas should handle it just like before. Just unstack the sales dimension and rename the columns for readability.

df_3d = pd.DataFrame(
    arr_3d.flatten(),
    index=index,
    columns=["sales"],
)
df_3d = df_3d.unstack(-1).rename(
    columns={0: "yesterday", 1: "today"},
)
print(df_3d)
                     sales      
sales            yesterday today
location product                
0        0               1     9
         1               8     6
         2               9     4
1        0               4     9
         1               3     9
         2               1     8
2        0               5     2
         1               9     9
         2               1     9
3        0               4     5
         1               7     4
         2               7     7
4        0               6     9
         1               4     2
         2               7     1

Do you know of other ways to switch between ndarray and DataFrame? Comment below :)

 
Share this