How to structure vectorized function with pandas?

Question

I am unsure how to structure a function I want to vectorize in pandas.

I have two df's like such:

contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
})

cats = pd.DataFrame({
'Cat1': ['1|2|4'],
'Cat2': ['3|2|5'],
'Cat3': ['6|9|11'],
})

My goal is to .insert a new column to contents that, per row, is either 1 if contents['Items'] is element of cats['cat1'] or 0 otherwise. That is to be repeated per cat.

Goal format:

contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
'contains_Cat1': [1, 1, 0, 1, 1, 1],
'contains_Cat2': [0, 1, 1, 0, 0, 1],
'contains_Cat3': [0, 0, 0, 0, 0, 0],
})

As my contents df is big(!) I would like to vectorize this. My approach for each cat is to do something like this

contents.insert(
    loc=len(contents.columns),
    column='contains_Cat1',
    value=has_content(contents, cats['Cat1'])

def has_content(contents: pd.DataFrame, cat: pd.Series) -> pd.Series:
    # Initialization of pd.Series here??
    if contents['Items'] in cat:
        return True
    else:
        return False

My question is: How do I structure my has_content(...)? Especially unclear to me is how I initialize that pd.Series to contain all False values. Do I even need to? After that I know how to check if something is contained in something else. But can I really do it column-wise like above and return immediately without becoming cell-wise?

Anurag Dabas · Accepted Answer · 2021-08-08 17:01:33Z

5

Try with str.get_dummies then reshape with stack and unstack

out = cats.stack().str.get_dummies().stack()\
          .unstack(level=1).reset_index(level=0,drop=True)\
           .reindex(contents.Items.astype(str))
Out[229]: 
       Cat1  Cat2  Cat3
Items                  
1         1     0     0
2         1     1     0
3         0     1     0
1         1     0     0
1         1     0     0
2         1     1     0

Improvement:

out=cats.stack().str.get_dummies().droplevel(0).T\
        .add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()

Out[230]: 

    Items   contains_Cat1   contains_Cat2   contains_Cat3
0   1       1               0               0
1   2       1               1               0
2   3       0               1               0
3   1       1               0               0
4   1       1               0               0
5   2       1               1               0

edited Aug 8, 2021 at 17:01

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

answered Aug 8, 2021 at 16:30

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

harmonica141 Over a year ago

Just to clarify: I'm getting a df returned here, right? So I skip the .insert() and the separate function altogether?

Pygirl · Accepted Answer · 2021-08-08 17:52:20Z

Simple method:

contents = (contents.join([pd.Series(contents.Items.astype(str).
                                     str.contains(cats[c][0]).astype(int), 
                                     name="Contains_"+c) for c in cats]))

contents:

    Items   contains_Cat1   contains_Cat2   contains_Cat3
0   1       1               0               0
1   2       1               1               0
2   3       0               1               0
3   1       1               0               0
4   1       1               0               0
5   2       1               1               0

Time comparison:

%%timeit -n 2000
(contents.join([pd.Series(contents.Items.astype(str).
                                     str.contains(cats[c][0]).astype(int), 
                                     name="Contains_"+c) for c in cats]))

3.01 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

%%timeit -n 2000
cats.stack().str.get_dummies().stack()\
          .unstack(level=1).reset_index(level=0,drop=True)\
           .reindex(contents.Items.astype(str))

5.13 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

%%timeit -n 2000
cats.stack().str.get_dummies().droplevel(0).T\
        .add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()

4.58 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

Collectives™ on Stack Overflow

How to structure vectorized function with pandas?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related