Split dataframe column containing iterable

Question

I have a DataFrame with one of the columns containing some sequential data in a form of list or tuple (always the same length), my aim is to split this column into several new columns, ideally updating one of the existing columns.

Here is the minimal example

from pandas import DataFrame, concat

data = DataFrame({"label": [a for a in "abcde"], "x": range(5)})
print(data)

  label  x
0     a  0
1     b  1
2     c  2
3     d  3
4     e  4

The fictional way, using nonexisting function splittuple would be something like this

data[["x", "x2"]] = data["x"].apply(lambda x: (x, x*2)).splittuple(expand = True)

resulting in

  label  x  x2
0     a  0  0
1     b  1  2
2     c  2  4
3     d  3  6
4     e  4  8

Of course I can do it like this, though the solution is bit cloggy

newdata = DataFrame(data["x"].apply(lambda x: (x, x*2)).tolist(), columns = ["x", "x2"])
data.drop("x", axis = 1, inplace = True)
data = concat((data, newdata), axis = 1)
print(data)

  label  x  x2
0     a  0   0
1     b  1   2
2     c  2   4
3     d  3   6
4     e  4   8

Alternative even more ugly solution

data[["x", "x2"]] = 
  data["x"].apply(lambda x: "{} {}".format(x, x*2)).str.split(expand = True).astype(int)

Could you suggest more elegant way to do this type of transformation?

jezrael · Accepted Answer · 2018-01-18 15:43:22Z

2

It is possible, but not so fast with apply and Series:

tup = data["x"].apply(lambda x: (x, x*2))
data[["x", "x2"]] = tup.apply(pd.Series)

print (data)
  label  x  x2
0     a  0   0
1     b  1   2
2     c  2   4
3     d  3   6
4     e  4   8

Faster is use DataFrame constructor:

tup = data["x"].apply(lambda x: (x, x*2))
data[["x", "x2"]] = pd.DataFrame(tup.values.tolist())
print (data)
  label  x  x2
0     a  0   0
1     b  1   2
2     c  2   4
3     d  3   6
4     e  4   8

Timings:

data = pd.DataFrame({"label": [a for a in "abcde"], "x": range(5)})
data = pd.concat([data]*1000).reset_index(drop=True)
tup = data["x"].apply(lambda x: (x, x*2))


data[["x", "x2"]] = tup.apply(pd.Series)
data[["y", "y2"]] = pd.DataFrame(tup.values.tolist())
print (data)

In [266]: %timeit data[["x", "x2"]] = tup.apply(pd.Series)
1 loop, best of 3: 836 ms per loop

In [267]: %timeit data[["y", "y2"]] = pd.DataFrame(tup.values.tolist())
100 loops, best of 3: 3.1 ms per loop

edited Jan 18, 2018 at 15:43

answered Jan 18, 2018 at 15:37

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pyd Over a year ago

hi @jezrael , can you pls check this stackoverflow.com/questions/48335265/…

Collectives™ on Stack Overflow

Split dataframe column containing iterable

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related