I am unsure how to structure a function I want to vectorize in pandas.
I have two df's like such:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
})
cats = pd.DataFrame({
'Cat1': ['1|2|4'],
'Cat2': ['3|2|5'],
'Cat3': ['6|9|11'],
})
My goal is to .insert a new column to contents that, per row, is either 1 if contents['Items'] is element of cats['cat1'] or 0 otherwise. That is to be repeated per cat.
Goal format:
contents = pd.DataFrame({
'Items': [1, 2, 3, 1, 1, 2],
'contains_Cat1': [1, 1, 0, 1, 1, 1],
'contains_Cat2': [0, 1, 1, 0, 0, 1],
'contains_Cat3': [0, 0, 0, 0, 0, 0],
})
As my contents df is big(!) I would like to vectorize this. My approach for each cat is to do something like this
contents.insert(
loc=len(contents.columns),
column='contains_Cat1',
value=has_content(contents, cats['Cat1'])
def has_content(contents: pd.DataFrame, cat: pd.Series) -> pd.Series:
# Initialization of pd.Series here??
if contents['Items'] in cat:
return True
else:
return False
My question is: How do I structure my has_content(...)? Especially unclear to me is how I initialize that pd.Series to contain all False values. Do I even need to? After that I know how to check if something is contained in something else. But can I really do it column-wise like above and return immediately without becoming cell-wise?