Producing a Pandas Data Frame from Long to Wide format Efficiently

Question

I have a time series data in long format which looks like as follows:

+======+==========+======+======+
| Name |   Date   | Val1 | Val2 |
+======+==========+======+======+
| A    | 1/1/2018 |    1 |    2 |
+------+----------+------+------+
| B    | 1/1/2018 |    2 |    3 |
+------+----------+------+------+
| C    | 1/1/2018 |    3 |    4 |
+------+----------+------+------+
| D    | 1/4/2018 |    4 |    5 |
+------+----------+------+------+
| A    | 1/4/2018 |    5 |    6 |
+------+----------+------+------+
| B    | 1/4/2018 |    6 |    7 |
+------+----------+------+------+
| C    | 1/4/2018 |    7 |    8 |
+------+----------+------+------+

I need to convert the above data into wide format which like as follows:

+---+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
|   | Val1.1/1/2018 | Val2.1/1/2018 | Val1.1/2/2018 | Val2.1/2/2018 | Val1.1/3/2018 | Val2.1/3/2018 | Val1.1/4/2018 | Val2.1/4/2018 |
+---+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| A | 1             | 2             | NULL          | NULL          | NULL          | NULL          |             5 |             6 |
| B | 2             | 3             | NULL          | NULL          | NULL          | NULL          |             6 |             7 |
| C | 3             | 4             | NULL          | NULL          | NULL          | NULL          |             7 |             8 |
| D | NULL          | NULL          | NULL          | NULL          | NULL          | NULL          |             4 |             5 |
+---+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+

To achieve that I've followed the following steps

First I've converted my initial data set date column to date format and added dates ranging from 01/01/2018 to 01/04/2018 in long format since I am dealing with time series data, I would want dates 01/02/2018 and 01/03/2018 to be included in wide format table even though those columns would contain NaNs.

To achieve the above mentioned task I've used the following code:

df = pd.read_csv('data.csv')
df['Date'] =  pd.to_datetime(df['Date'], format='%m/%d/%Y')
idx = pd.MultiIndex.from_product([df.Name.unique(), pd.date_range(df.Date.min(), df.Date.max())])

df = df.set_index(['Name','Date']).reindex(idx).reset_index().rename(columns = {'level_0':'Name', 'level_1':'Date'})

df.Date = df.Date.dt.strftime('%m/%d/%Y')
new_df = df.pivot('Name', 'Date', ['Val1', 'Val2'])
new_df.columns = new_df.columns.map('.'.join)

I think the above code is not optimized to deal with larger data set (1.2 millions rows). How could I go about optimizing this code?

The similar task done in R with the follwing code takes much lesser time:

library(dplyr)
library(tidyr) #complete
library(data.table) #dcast and setDT
df %>% mutate(Date=as.Date(Date,'%m/%d/%Y')) %>% 
       complete(Name, nesting(Date=full_seq(Date,1))) %>%
       setDT(.) %>% dcast(Name ~ Date, value.var=c('Val2','Val1'))

Credits: Python code mentioned in this post is taken from here. R code mentioned in this post is taken from here.

Reinderien · Accepted Answer · 2025-05-05 12:22:23Z

I have a time series data in long format which looks like as follows

No, it doesn't. It's CSV, so it looks something like

Name,     Date, Val1, Val2
   A, 1/1/2018,    1,    2
   B, 1/1/2018,    2,    3
   C, 1/1/2018,    3,    4
   D, 1/4/2018,    4,    5
   A, 1/4/2018,    5,    6
   B, 1/4/2018,    6,    7
   C, 1/4/2018,    7,    8

You call to_datetime and then undo that with a strftime; why? Don't do either. Pass date_format to read_csv for processing, or if you really want the (poor, non-ISO8601) date format in the output, then don't parse the date at all and leave it as a string.

Don't call from_product. Just pass two index_col to read_csv. Don't reindex, don't reset_index and don't rename your columns.

You didn't make it clear why you're producing the "wide format". For most purposes, I would do the following, which does produce a wide format with the same data but in a different column order and without concatenate-degrading the multi-level column index:

import pandas as pd

df = pd.read_csv(
    '214261.csv',
    skipinitialspace=True,
    index_col=['Name', 'Date'],
    parse_dates=['Date'],
    date_format='%d/%m/%Y',
)

print(df.unstack('Date'))

           Val1                  Val2           
Date 2018-01-01 2018-04-01 2018-01-01 2018-04-01
Name                                            
A           1.0        5.0        2.0        6.0
B           2.0        6.0        3.0        7.0
C           3.0        7.0        4.0        8.0
D           NaN        4.0        NaN        5.0

If you really want Date first and Val second, then call reorder_levels.

hplieninger · Accepted Answer · 2019-02-26 10:14:59Z

Solution in R

In your last code snippet, you're mixing code from tidyverse and data.table packages. I don't consider this to be completely wrong, but I would rather avoid it to increase readability and consistency.

library(magrittr)
library(data.table)
library(bench)

# data copied from OP
dat <- structure(list(Name = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L),
                                       .Label = c("A", "B", "C", "D"),
                                       class = "factor"),
                      Date = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L),
                                       .Label = c("1/1/2018", "1/4/2018"),
                                       class = "factor"), 
                      Val1 = 1:7,
                      Val2 = 2:8),
                 class = "data.frame", row.names = 1:7)

dat
#>   Name     Date Val1 Val2
#> 1    A 1/1/2018    1    2
#> 2    B 1/1/2018    2    3
#> 3    C 1/1/2018    3    4
#> 4    D 1/4/2018    4    5
#> 5    A 1/4/2018    5    6
#> 6    B 1/4/2018    6    7
#> 7    C 1/4/2018    7    8
str(dat)
#> 'data.frame':    7 obs. of  4 variables:
#>  $ Name: Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3
#>  $ Date: Factor w/ 2 levels "1/1/2018","1/4/2018": 1 1 1 2 2 2 2
#>  $ Val1: int  1 2 3 4 5 6 7
#>  $ Val2: int  2 3 4 5 6 7 8

Tidyverse Solution

tidyr::gather(dat, key = "key", value = "value", -Date, -Name) %>% 
    tidyr::unite("id", key, Date, sep = ".") %>% 
    tidyr::spread(id, value)
#>   Name Val1.1/1/2018 Val1.1/4/2018 Val2.1/1/2018 Val2.1/4/2018
#> 1    A             1             5             2             6
#> 2    B             2             6             3             7
#> 3    C             3             7             4             8
#> 4    D            NA             4            NA             5

data.table Solution

dt <- data.table(dat)
dt_long <- melt(dt, id.vars = c("Name", "Date"))

dcast(dt_long, Name ~ variable + Date)
#>    Name Val1_1/1/2018 Val1_1/4/2018 Val2_1/1/2018 Val2_1/4/2018
#> 1:    A             1             5             2             6
#> 2:    B             2             6             3             7
#> 3:    C             3             7             4             8
#> 4:    D            NA             4            NA             5

Benchmark

As you can see, data.table is already much faster with 1,200 rows.

nrows <- 1.2e4
# nrows <- 1.2e6
dat2 <- expand.grid(Name = LETTERS[1:4],
                    Date = seq(as.Date("2018-01-01"), by = "days", length.out = nrows/4))
dat2$Val1 <- sample(1:8, nrow(dat2), TRUE)
dat2$Val2 <- sample(1:8, nrow(dat2), TRUE)

f1 <- function(dat) {
    tidyr::gather(dat, key = "key", value = "value", -Date, -Name) %>% 
        tidyr::unite("id", key, Date, sep = ".") %>% 
        tidyr::spread(id, value)
}

f2 <- function(dat) {
    dt <- data.table(dat)
    dt_long <- melt(dt, id.vars = c("Name", "Date"))
    dt_wide <- dcast(dt_long, Name ~ variable + Date)
}

mark(tidyverse = f1(dat2),
     datatable = f2(dat2),
     check = function(x, y) all.equal(x, y, check.attributes = FALSE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 tidyverse  184.4ms 189.7ms 187.9ms 196.7ms      5.27   15.73MB     5
#> 2 datatable   43.1ms  45.9ms  45.4ms  51.7ms     21.8     5.36MB     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

^{Created on 2019-02-26 by the reprex package (v0.2.1)}

I appreciate your response, but I am looking for an optimized version in Python. — Furqan Hashim
– Furqan Hashim, Commented Feb 26, 2019 at 9:14
Alright. Then you should mention that in your question. Consider to add a bold one-liner stating what your question actually is. — hplieninger
– hplieninger, Commented Feb 26, 2019 at 10:13
I suppose it's clearly mentioned in the post as there is only one sentence followed by a question mark. I've added a bold to that line in my edit. — Furqan Hashim
– Furqan Hashim, Commented Feb 26, 2019 at 10:24

Stack Exchange Network

Producing a Pandas Data Frame from Long to Wide format Efficiently

2 Answers 2

Solution in R

Tidyverse Solution

data.table Solution

Benchmark

You must log in to answer this question.

Hot Network Questions

Producing a Pandas Data Frame from Long to Wide format Efficiently

2 Answers 2

Solution in R

Tidyverse Solution

data.table Solution

Benchmark

You must log in to answer this question.

Related

Hot Network Questions