2

I am unsure how to structure a function I want to vectorize in pandas.

I have two df's like such:

contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
})

cats = pd.DataFrame({
'Cat1': ['1|2|4'],
'Cat2': ['3|2|5'],
'Cat3': ['6|9|11'],
})

My goal is to .insert a new column to contents that, per row, is either 1 if contents['Items'] is element of cats['cat1'] or 0 otherwise. That is to be repeated per cat.

Goal format:

contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
'contains_Cat1': [1, 1, 0, 1, 1, 1],
'contains_Cat2': [0, 1, 1, 0, 0, 1],
'contains_Cat3': [0, 0, 0, 0, 0, 0],
})

As my contents df is big(!) I would like to vectorize this. My approach for each cat is to do something like this

contents.insert(
    loc=len(contents.columns),
    column='contains_Cat1',
    value=has_content(contents, cats['Cat1'])

def has_content(contents: pd.DataFrame, cat: pd.Series) -> pd.Series:
    # Initialization of pd.Series here??
    if contents['Items'] in cat:
        return True
    else:
        return False

My question is: How do I structure my has_content(...)? Especially unclear to me is how I initialize that pd.Series to contain all False values. Do I even need to? After that I know how to check if something is contained in something else. But can I really do it column-wise like above and return immediately without becoming cell-wise?

2 Answers 2

5

Try with str.get_dummies then reshape with stack and unstack

out = cats.stack().str.get_dummies().stack()\
          .unstack(level=1).reset_index(level=0,drop=True)\
           .reindex(contents.Items.astype(str))
Out[229]: 
       Cat1  Cat2  Cat3
Items                  
1         1     0     0
2         1     1     0
3         0     1     0
1         1     0     0
1         1     0     0
2         1     1     0

Improvement:

out=cats.stack().str.get_dummies().droplevel(0).T\
        .add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()

Out[230]: 

    Items   contains_Cat1   contains_Cat2   contains_Cat3
0   1       1               0               0
1   2       1               1               0
2   3       0               1               0
3   1       1               0               0
4   1       1               0               0
5   2       1               1               0
Sign up to request clarification or add additional context in comments.

1 Comment

Just to clarify: I'm getting a df returned here, right? So I skip the .insert() and the separate function altogether?
3

Simple method:

contents = (contents.join([pd.Series(contents.Items.astype(str).
                                     str.contains(cats[c][0]).astype(int), 
                                     name="Contains_"+c) for c in cats]))

contents:

    Items   contains_Cat1   contains_Cat2   contains_Cat3
0   1       1               0               0
1   2       1               1               0
2   3       0               1               0
3   1       1               0               0
4   1       1               0               0
5   2       1               1               0

Time comparison:

%%timeit -n 2000
(contents.join([pd.Series(contents.Items.astype(str).
                                     str.contains(cats[c][0]).astype(int), 
                                     name="Contains_"+c) for c in cats]))

3.01 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)


%%timeit -n 2000
cats.stack().str.get_dummies().stack()\
          .unstack(level=1).reset_index(level=0,drop=True)\
           .reindex(contents.Items.astype(str))

5.13 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)


%%timeit -n 2000
cats.stack().str.get_dummies().droplevel(0).T\
        .add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index()

4.58 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.