3

I´d like to learn how to data frame column as code maped from multiple columns.

In the partial example below I was trying what would could be a clumsy way folowing the path: get unique values as a temporary data frame; concatenate some prefix string to temp row number as a new column and them join the 2 data frames.

df = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
                   'col2' : ['B1', 'B2', 'B1', 'B1'],
                   'value' : [100, 200, 300, 400],
                   })

tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])


#   col1 col2
# 0   A1   B1
# 1   A2   B2
# 3   A3   B1

The first question is how to get 'temp' row number and its value to a tmp column?

And what is the clever pythonic way to achieve the result below from df?

dfnew = pd.DataFrame({'col1' : ['A1', 'A2', 'A1', 'A3'],
                   'col2' : ['B1', 'B2', 'B1', 'B1'],
                   'code' :  ['CODE0','CODE1', 'CODE0', 'CODE3'],
                   'value' : [100, 200, 300, 400],
                   })

    code col1 col2  value
0  CODE0   A1   B1    100
1  CODE1   A2   B2    200
2  CODE0   A1   B1    300
3  CODE3   A3   B1    400

thanks.

After the answers and just as an exercise I kept working on the non-pythonic version I had in mind with insights I got from great answers, and reached this:

tmp = df[['col1','col2']].drop_duplicates(['col1', 'col2'])

tmp.reset_index(inplace=True)

tmp.drop('index', axis=1, inplace=True)

tmp['code'] = tmp.index.to_series().apply(lambda x: 'code' + format(x, '04d'))

dfnew = pd.merge(df, tmp, on=['col1', 'col2'])

At the time of posting this question, I did not realize that would be nicer to have the index reset to have a fresh sequence instead of their original index numbers.

I tried some variations but I did not get how to chain 'reset_index' and 'drop' in just one command.

I´m starting to enjoy Python. Thank you all.

4 Answers 4

2

groupby on df.index with ['col1', 'col2'] using transform('first') and map

df.assign(
    code=df.index.to_series().groupby(
        [df.col1, df.col2]
    ).transform('first').map('CODE{}'.format)
)[['code'] + df.columns.tolist()]

    code col1 col2  value
0  CODE0   A1   B1    100
1  CODE1   A2   B2    200
2  CODE0   A1   B1    300
3  CODE3   A3   B1    400

explanation

# turn index to series so I can perform a groupby on it
idx_series = df.index.to_series()

# groupby col1 and col2 to establish uniqueness
idx_gb = idx_series.groupby([df.col1, df.col2])

# get first index value in each unique group
# and broadcast over entire group with transform
idx_tf = idx_gb.transform('first')

# map a format function to get desired string
code = idx_tf.map('code{}'.format)

# use assign to create new column
df.assign(code=code)
Sign up to request clarification or add additional context in comments.

1 Comment

Hi @piRSquared, thank you, very fast solution indeed. An extra help?, how could I set 'groupby( [df.col1, df.col2]' to be done on a variable list of columns/fields ? something like 'df.groupby[ [ fields list/series ] ]'. I´d like to generalize it to a function with field names as base to code as parameters.
2

You can first sort_values of columns col1 and col2 where by duplicated find all duplicates:

df = df.sort_values(['col1', 'col2'])
mask = df.duplicated(['col1','col2'])
print (mask)
0    False
2     True
1    False
3    False
dtype: bool

Then use insert if need specify position of output column code with numpy.where and fillna missing values. Last sort_index:

df.insert(0, 'code', np.where(mask, np.nan, 'CODE' + df.index.astype(str)))
df.code = df.code.ffill()
df = df.sort_index()
print (df)
    code col1 col2  value
0  CODE0   A1   B1    100
1  CODE1   A2   B2    200
2  CODE0   A1   B1    300
3  CODE3   A3   B1    400

5 Comments

Sure, btw, I think my code is faster, you can check it. ;)
ok, I´ll try to build a comparison test here with the real data frame. thank you.
Give me know, I am really interested. Because obviously groupby is slow.
your solution is the second fastest for some tests I did here. I used my data with aprox. 150.000 records and It gets very close to piRSquared solution when I remove the sort at the end. Is that really necessary? The concat solution was the slowest.
No, sort_index in the end is not necessary, only nicer output ;)
2

How to get 'temp' row number and its value to a tmp column?

Value column is not propagating because you filter it out at the beginning: df[['col1','col2']]. Hence, this is fixed by changing it to tmp = df.drop_duplicates(['col1', 'col2']).

Index is preserved in the index column, if you want to copy it explicitly into data column, just do tmp['index'] = tmp.index.

What is the clever pythonic way to achieve the result below from df?

I do not know if it is particularly clever or not, as this is subjective, but one way of achieving that is

pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])])

Finally, to achieve the result in a form you specified, you can add .sort_index() and [['code', 'col1', 'col2', 'value']] to the above, in order to specify ordering of columns. Giving:

newdf = pd.concat([gr.assign(code='CODE{}'.format(min(gr.index))) for _, gr in df.groupby(['col1', 'col2'])]).sort_index()[['code', 'col1', 'col2', 'value']]

Possible performance bottleneck may be groupby and concat which may matter if you operate on large data sets.

Comments

2

If you have df DataFrame like this:

    state       year    population
0   California  2000    33871648
1   California  2010    37253956
2   New York    2000    18976457
3   New York    2010    19378102
4   Texas       2000    20851820
5   Texas       2010    25145561

you can create indexes from state and year columns with:

df2 = df.set_index(['state','year'])

which will give you dataframe with multi-index constructed from columns state and year:

enter image description here

Accessing Multindexed dataframe

df['California',2000]
Result: 33871648

df[:,2010]
Result:
state
California    37253956
New York      19378102
Texas         25145561
dtype: int64


pop.loc['California':'New York']
Result:
state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.