57

I have a DataFrame df with a column containing labels for each row (in addition to some relevant data for each row). I have a dictionary labeldict with keys equal to the possible labels and values equal to 2-tuples of information related to that label. I'd like to tack two new columns onto my frame, one for each part of the 2-tuple corresponding to the label for each row.

Here is the setup:

import pandas as pd
import numpy as np

np.random.seed(1)
n = 10

labels = list('abcdef')
colors = ['red', 'green', 'blue']
sizes = ['small', 'medium', 'large']

labeldict = {c: (np.random.choice(colors), np.random.choice(sizes)) for c in labels}

df = pd.DataFrame({'label': np.random.choice(labels, n), 
                   'somedata': np.random.randn(n)})

I can get what I want by running:

df['color'], df['size'] = zip(*df['label'].map(labeldict))
print df

  label  somedata  color    size
0     b  0.196643    red  medium
1     c -1.545214  green   small
2     a -0.088104  green   small
3     c  0.852239  green   small
4     b  0.677234    red  medium
5     c -0.106878  green   small
6     a  0.725274  green   small
7     d  0.934889    red  medium
8     a  1.118297  green   small
9     c  0.055613  green   small

But how can I do this if I don't want to manually type out the two columns on the left side of the assignment? I.e. how can I create multiple new columns on the fly. For example, if I had 10-tuples in labeldict instead of 2-tuples, this would be a real pain as currently written. Here are a couple things that don't work:

# set up attrlist for later use
attrlist = ['color', 'size']

# non-working idea 1)
df[attrlist] = zip(*df['label'].map(labeldict))

# non-working idea 2)
df.loc[:, attrlist] = zip(*df['label'].map(labeldict))

This does work, but seems like a hack:

for a in attrlist:
    df[a] = 0
df[attrlist] = zip(*df['label'].map(labeldict))

Better solutions?

3
  • 2
    FYI your last method will work in 0.13 (without initially creating the column) Commented Dec 29, 2013 at 21:55
  • Thanks Jeff. I have a similar question inspired by your comment here. I am having a similar problem to the one stated here Commented Sep 17, 2014 at 16:58
  • Seems like the answer is pythonically simple: stackoverflow.com/a/34074894/3494126 Commented Oct 8, 2019 at 9:57

6 Answers 6

35

Just use result_type='expand' in pandas apply

df
Out[78]: 
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')

df
Out[80]: 
   a  b  mean  std  max
0  0  1   0.5  0.5  1.0
1  2  3   2.5  0.5  3.0
2  4  5   4.5  0.5  5.0
3  6  7   6.5  0.5  7.0
4  8  9   8.5  0.5  9.0

and here some copy paste code

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5,2), columns=['a','b'])
print('df',df, sep='\n')
print()
def mathOperationsTuple(arr):
    return np.mean(arr), np.std(arr), np.amax(arr)

df[['mean', 'std', 'max']]=df[['a','b']].apply(mathOperationsTuple, axis=1, result_type='expand')
print('df',df, sep='\n')
Sign up to request clarification or add additional context in comments.

2 Comments

Use of apply was what I wanted but although this answer was helpful it made the assignment and function interdependent based on the order of the columns used as input and output. I ended up using the approach from @EricNess instead to capture this mapping entirely within the helper function.
13

You can use merge instead:

>>> ld = pd.DataFrame(labeldict).T
>>> ld.columns = ['color', 'size']
>>> ld.index.name = 'label'
>>> df.merge(ld.reset_index(), on='label')
  label  somedata  color    size
0     b  1.462108    red  medium
1     c -2.060141  green   small
2     c  1.133769  green   small
3     c  0.042214  green   small
4     e -0.322417    red  medium
5     e -1.099891    red  medium
6     e -0.877858    red  medium
7     e  0.582815    red  medium
8     f -0.384054    red   large
9     d -0.172428    red  medium

2 Comments

you can get away without the reset_index using merge's left_index and right_on.
Note that merge generates a new dataframe instead of adding columns to an existing one. Also if there is no index column in the dataframe of your problem, you can't use merge.
8

Instead of doing what you're doing with labeldict, you could make that information into a DataFrame and then join it with your original one:

>>> labeldf = pandas.DataFrame([(np.random.choice(colors), np.random.choice(sizes)) for c in labels], columns=['color', 'size'], index=labels)
>>> df.join(labeldf, on='label')
  label  somedata  color    size
0     a -1.709973    red  medium
1     b  0.099109   blue  medium
2     a -0.427323    red  medium
3     b  0.474995   blue  medium
4     b -2.819208   blue  medium
5     d -0.998888    red   small
6     b  0.713357   blue  medium
7     d  0.331989    red   small
8     e -0.906240  green   large
9     c -0.501916   blue   large

Comments

3

If you want to add multiple columns to a DataFrame as part of a method chain, you can use apply. The first step is to create a function that will transform a row represented as a Series into the form you want. Then you can call apply to use this function on each row.

def append_label_attributes(row: pd.Series, labelmap: dict) -> pd.Series:
    result = row.copy()
    result['color'] = labelmap[result['label']][0]
    result['size'] = labelmap[result['label']][1]
    return result

df = (
    pd.DataFrame(
    {
        'label': np.random.choice(labels, n),
        'somedata': np.random.randn(n)}
    )
    .apply(append_label_attributes, axis='columns', labelmap=labeldict)
)

Comments

2

This should work:

df[['color','size']] = list(df['label'].apply(labeldict))

Comments

0

pandas apply method support this out of the box

code

demo.py

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(10).reshape(5, 2), columns=['a', 'b'])
df2 = df[['a', 'b']].apply(lambda ser: pd.Series({
    'mean': ser.mean(),
    'std': ser.std(ddof=0),
    'max': np.amax(ser.to_numpy())
}), axis=1)
df2 = pd.concat((df, df2), axis=1)

output

>>> from demo import df, df2
>>> df
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> df2
   a  b  mean  std  max
0  0  1   0.5  0.5  1.0
1  2  3   2.5  0.5  3.0
2  4  5   4.5  0.5  5.0
3  6  7   6.5  0.5  7.0
4  8  9   8.5  0.5  9.0

background

A few years after writing my first answer to this question I got a bit better overview of the mechanics in pandas.

The func argument in apply takes a function: pd.Series -> Any whereas from the type of Any the exact processing of the returned value into the DataFrame is deduced.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function.

If you pass a lambda function: pd.Series -> pd.Series, you get exactly the required behavior.

Note

apply can also be applied on pd.groupby outputs, achieving a more flexible alternative to .aggregate. You can find an example in the question: Python pandas groupby aggregate on multiple columns, then pivot

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.