4
\$\begingroup\$

I have a time series data in long format which looks like as follows:

+======+==========+======+======+
| Name |   Date   | Val1 | Val2 |
+======+==========+======+======+
| A    | 1/1/2018 |    1 |    2 |
+------+----------+------+------+
| B    | 1/1/2018 |    2 |    3 |
+------+----------+------+------+
| C    | 1/1/2018 |    3 |    4 |
+------+----------+------+------+
| D    | 1/4/2018 |    4 |    5 |
+------+----------+------+------+
| A    | 1/4/2018 |    5 |    6 |
+------+----------+------+------+
| B    | 1/4/2018 |    6 |    7 |
+------+----------+------+------+
| C    | 1/4/2018 |    7 |    8 |
+------+----------+------+------+

I need to convert the above data into wide format which like as follows:

+---+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
|   | Val1.1/1/2018 | Val2.1/1/2018 | Val1.1/2/2018 | Val2.1/2/2018 | Val1.1/3/2018 | Val2.1/3/2018 | Val1.1/4/2018 | Val2.1/4/2018 |
+---+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+
| A | 1             | 2             | NULL          | NULL          | NULL          | NULL          |             5 |             6 |
| B | 2             | 3             | NULL          | NULL          | NULL          | NULL          |             6 |             7 |
| C | 3             | 4             | NULL          | NULL          | NULL          | NULL          |             7 |             8 |
| D | NULL          | NULL          | NULL          | NULL          | NULL          | NULL          |             4 |             5 |
+---+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+

To achieve that I've followed the following steps

First I've converted my initial data set date column to date format and added dates ranging from 01/01/2018 to 01/04/2018 in long format since I am dealing with time series data, I would want dates 01/02/2018 and 01/03/2018 to be included in wide format table even though those columns would contain NaNs.

To achieve the above mentioned task I've used the following code:

df = pd.read_csv('data.csv')
df['Date'] =  pd.to_datetime(df['Date'], format='%m/%d/%Y')
idx = pd.MultiIndex.from_product([df.Name.unique(), pd.date_range(df.Date.min(), df.Date.max())])

df = df.set_index(['Name','Date']).reindex(idx).reset_index().rename(columns = {'level_0':'Name', 'level_1':'Date'})

df.Date = df.Date.dt.strftime('%m/%d/%Y')
new_df = df.pivot('Name', 'Date', ['Val1', 'Val2'])
new_df.columns = new_df.columns.map('.'.join)

I think the above code is not optimized to deal with larger data set (1.2 millions rows). How could I go about optimizing this code?

The similar task done in R with the follwing code takes much lesser time:

library(dplyr)
library(tidyr) #complete
library(data.table) #dcast and setDT
df %>% mutate(Date=as.Date(Date,'%m/%d/%Y')) %>% 
       complete(Name, nesting(Date=full_seq(Date,1))) %>%
       setDT(.) %>% dcast(Name ~ Date, value.var=c('Val2','Val1'))

Credits: Python code mentioned in this post is taken from here. R code mentioned in this post is taken from here.

\$\endgroup\$

2 Answers 2

2
\$\begingroup\$

I have a time series data in long format which looks like as follows

No, it doesn't. It's CSV, so it looks something like

Name,     Date, Val1, Val2
   A, 1/1/2018,    1,    2
   B, 1/1/2018,    2,    3
   C, 1/1/2018,    3,    4
   D, 1/4/2018,    4,    5
   A, 1/4/2018,    5,    6
   B, 1/4/2018,    6,    7
   C, 1/4/2018,    7,    8

You call to_datetime and then undo that with a strftime; why? Don't do either. Pass date_format to read_csv for processing, or if you really want the (poor, non-ISO8601) date format in the output, then don't parse the date at all and leave it as a string.

Don't call from_product. Just pass two index_col to read_csv. Don't reindex, don't reset_index and don't rename your columns.

You didn't make it clear why you're producing the "wide format". For most purposes, I would do the following, which does produce a wide format with the same data but in a different column order and without concatenate-degrading the multi-level column index:

import pandas as pd

df = pd.read_csv(
    '214261.csv',
    skipinitialspace=True,
    index_col=['Name', 'Date'],
    parse_dates=['Date'],
    date_format='%d/%m/%Y',
)

print(df.unstack('Date'))
           Val1                  Val2           
Date 2018-01-01 2018-04-01 2018-01-01 2018-04-01
Name                                            
A           1.0        5.0        2.0        6.0
B           2.0        6.0        3.0        7.0
C           3.0        7.0        4.0        8.0
D           NaN        4.0        NaN        5.0

If you really want Date first and Val second, then call reorder_levels.

\$\endgroup\$
0
\$\begingroup\$

Solution in R

In your last code snippet, you're mixing code from tidyverse and data.table packages. I don't consider this to be completely wrong, but I would rather avoid it to increase readability and consistency.

library(magrittr)
library(data.table)
library(bench)

# data copied from OP
dat <- structure(list(Name = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L),
                                       .Label = c("A", "B", "C", "D"),
                                       class = "factor"),
                      Date = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L),
                                       .Label = c("1/1/2018", "1/4/2018"),
                                       class = "factor"), 
                      Val1 = 1:7,
                      Val2 = 2:8),
                 class = "data.frame", row.names = 1:7)

dat
#>   Name     Date Val1 Val2
#> 1    A 1/1/2018    1    2
#> 2    B 1/1/2018    2    3
#> 3    C 1/1/2018    3    4
#> 4    D 1/4/2018    4    5
#> 5    A 1/4/2018    5    6
#> 6    B 1/4/2018    6    7
#> 7    C 1/4/2018    7    8
str(dat)
#> 'data.frame':    7 obs. of  4 variables:
#>  $ Name: Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3
#>  $ Date: Factor w/ 2 levels "1/1/2018","1/4/2018": 1 1 1 2 2 2 2
#>  $ Val1: int  1 2 3 4 5 6 7
#>  $ Val2: int  2 3 4 5 6 7 8

Tidyverse Solution

tidyr::gather(dat, key = "key", value = "value", -Date, -Name) %>% 
    tidyr::unite("id", key, Date, sep = ".") %>% 
    tidyr::spread(id, value)
#>   Name Val1.1/1/2018 Val1.1/4/2018 Val2.1/1/2018 Val2.1/4/2018
#> 1    A             1             5             2             6
#> 2    B             2             6             3             7
#> 3    C             3             7             4             8
#> 4    D            NA             4            NA             5

data.table Solution

dt <- data.table(dat)
dt_long <- melt(dt, id.vars = c("Name", "Date"))

dcast(dt_long, Name ~ variable + Date)
#>    Name Val1_1/1/2018 Val1_1/4/2018 Val2_1/1/2018 Val2_1/4/2018
#> 1:    A             1             5             2             6
#> 2:    B             2             6             3             7
#> 3:    C             3             7             4             8
#> 4:    D            NA             4            NA             5

Benchmark

As you can see, data.table is already much faster with 1,200 rows.

nrows <- 1.2e4
# nrows <- 1.2e6
dat2 <- expand.grid(Name = LETTERS[1:4],
                    Date = seq(as.Date("2018-01-01"), by = "days", length.out = nrows/4))
dat2$Val1 <- sample(1:8, nrow(dat2), TRUE)
dat2$Val2 <- sample(1:8, nrow(dat2), TRUE)

f1 <- function(dat) {
    tidyr::gather(dat, key = "key", value = "value", -Date, -Name) %>% 
        tidyr::unite("id", key, Date, sep = ".") %>% 
        tidyr::spread(id, value)
}

f2 <- function(dat) {
    dt <- data.table(dat)
    dt_long <- melt(dt, id.vars = c("Name", "Date"))
    dt_wide <- dcast(dt_long, Name ~ variable + Date)
}

mark(tidyverse = f1(dat2),
     datatable = f2(dat2),
     check = function(x, y) all.equal(x, y, check.attributes = FALSE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 tidyverse  184.4ms 189.7ms 187.9ms 196.7ms      5.27   15.73MB     5
#> 2 datatable   43.1ms  45.9ms  45.4ms  51.7ms     21.8     5.36MB     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

Created on 2019-02-26 by the reprex package (v0.2.1)

\$\endgroup\$
3
  • \$\begingroup\$ I appreciate your response, but I am looking for an optimized version in Python. \$\endgroup\$ Commented Feb 26, 2019 at 9:14
  • \$\begingroup\$ Alright. Then you should mention that in your question. Consider to add a bold one-liner stating what your question actually is. \$\endgroup\$ Commented Feb 26, 2019 at 10:13
  • 1
    \$\begingroup\$ I suppose it's clearly mentioned in the post as there is only one sentence followed by a question mark. I've added a bold to that line in my edit. \$\endgroup\$ Commented Feb 26, 2019 at 10:24

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.