12

I have a pandas dataframe as follows:

ticker    account      value         date
aa       assets       100,200       20121231, 20131231
bb       liabilities  50, 150       20141231, 20131231

I would like to split df['value'] and df['date'] so that the dataframe looks like this:

ticker    account      value         date
aa       assets       100           20121231
aa       assets       200           20131231 
bb       liabilities  50            20141231
bb       liabilities  150           20131231

Would greatly appreciate any help.

2

5 Answers 5

16

You can first split columns, create Series by stack and remove whitespaces by strip:

s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)

Then concat both Series to df1:

df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])

Remove old columns value and date and join:

print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
  ticker      account value      date
0     aa       assets   100  20121231
1     aa       assets   200  20131231
2     bb  liabilities    50  20141231
3     bb  liabilities   150  20131231
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you jezrael and piRSquared for both answers!! jezrael, your approach worked efficiently.
9

I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:

So I wrote a function that will do it.

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

But before we can use it, we need lists (or iterable) in a column.

Setup

df = pd.DataFrame([['aa', 'assets',      '100,200', '20121231,20131231'],
                   ['bb', 'liabilities', '50,50',   '20141231,20131231']],
                  columns=['ticker', 'account', 'value', 'date'])

df

enter image description here

split value and date columns:

df.value = df.value.str.split(',')
df.date = df.date.str.split(',')

df

enter image description here

Now we could explode on either column or both, one after the other.

Solution

explode(df, ['value','date'])

enter image description here


Timing

I removed strip from @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.

code

def get_df(n=1):
    return pd.DataFrame([['aa', 'assets',      '100,200,200', '20121231,20131231,20131231'],
                         ['bb', 'liabilities', '50,50',   '20141231,20131231']] * n,
                        columns=['ticker', 'account', 'value', 'date'])

small 2 row sample

enter image description here

medium 200 row sample

enter image description here

large 2,000,000 row sample

enter image description here

3 Comments

I am very curious for timings ;) iterritems is slow, but on the other hand there is a lot operation like stack, concat and join, so maybe this can be comparable.
I see difference in solution - I use strip. Can you add it to your solution too and then try timings again? I think you forget for it.
@jezrael updated. Notice what I wrote just below ###Timing
2

I wrote explode function based on previous answers. It might be useful for anyone who want to grab and use it quickly.

def explode(df, cols, split_on=','):
    """
    Explode dataframe on the given column, split on given delimeter
    """
    cols_sep = list(set(df.columns) - set(cols))
    df_cols = df[cols_sep]
    explode_len = df[cols[0]].str.split(split_on).map(len)
    repeat_list = []
    for r, e in zip(df_cols.as_matrix(), explode_len):
        repeat_list.extend([list(r)]*e)
    df_repeat = pd.DataFrame(repeat_list, columns=cols_sep)
    df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True)
                            for col in cols], axis=1)
    df_explode.columns = cols
    return pd.concat((df_repeat, df_explode), axis=1)

example given from @piRSquared:

df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
                   ['bb', 'liabilities', '50,50', '20141231,20131231']],
                  columns=['ticker', 'account', 'value', 'date'])
explode(df, ['value', 'date'])

output

+-----------+------+-----+--------+
|    account|ticker|value|    date|
+-----------+------+-----+--------+
|     assets|    aa|  100|20121231|
|     assets|    aa|  200|20131231|
|liabilities|    bb|   50|20141231|
|liabilities|    bb|   50|20131231|
+-----------+------+-----+--------+

Comments

2

Pandas >= 0.25

df.value = df.value.str.split(',')
df.date = df.date.str.split(',')
df = df.explode('value').explode("date").reset_index(drop=True)

df:

    ticker  account      value  date
0   aa      assets       100    20121231
1   aa      assets       100    20131231
2   aa      assets       200    20121231
3   aa      assets       200    20131231
4   bb      liabilities  50     20141231
5   bb      liabilities  50     20131231
6   bb      liabilities  50     20141231
7   bb      liabilities  50     20131231

Comments

0

Because I'm too new, I'm not allowed to write a comment, so I write an "answer".

@titipata your answer worked really good, but in my opinion there is a small "mistake" in your code I'm not able to find for my self.

I work with the example from this question and changed just the values.

df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'],
               ['title2', 'publisher2', '2', '2.1,2.2']],
              columns=['titel', 'publisher', 'print', 'electronic'])

explode(df, ['print', 'electronic'])

    publisher   titel   print   electronic
0   publisher1  title1  1.1     1
1   publisher1  title1  1.2     2.1
2   publisher2  title2  2       2.2

As you see, in the column 'electronic' should be in row '1' the value '1' and not '2.1'.

Because of that, the hole DataSet would change. I hope someone could help me to find a solution for this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.