2

I haven't been able to find an answer here specific to my issue and I'm wondering if I could get some help (apologies for the links, I'm not allowed to embed images yet).

I have stored Counter objects within my DataFrame and also want them added to the DataFrame as a column for each counted element.

Beginning data

data = {
    "words": ["ABC", "BCDB", "CDE", "F"],
    "stuff": ["abc", "bcda", "cde", "f"]
}
df = pd.DataFrame(data)

Preliminary Data Frame

patternData = {
    "name": ["A", "B", "C", "D", "E", "F"],
    "rex": ["A{1}", "B{1}", "C{1}", "D{1}", "E{1}", "F{1}"]
}
patterns = pd.DataFrame(patternData)

Pattern DataFrame

def countFound(ps):
    result = Counter()
    for index, row in patterns.iterrows():
        findName = row['name']
        findRex = row['rex']
        found = re.findall(findRex, ps)
        if (len(found) > 0):
            result.update({findName:len(found)})
    return result

df['found'] = df['words'].apply(lambda x: countFound(x))

Found DataFrame

Desired Results

words stuff found A B C D E F
ABC acb {'A': 1, 'B': 1, 'C': 1} 1 1 1 0 0 0
BCD bcd {'B': 1, 'C': 1, 'D': 1} 0 2 1 1 0 0
CDE cde {'C': 1, 'D': 1, 'E': 1} 0 0 1 1 1 0
F f {'F': 1} 0 0 0 0 0 1

2 Answers 2

2

You can use json_normalize:

df.join(pd.json_normalize(df['found']).fillna(0, downcast='infer'))

Output:

  words stuff                     found  A  B  C  D  E  F
0   ABC   abc  {'A': 1, 'B': 1, 'C': 1}  1  1  1  0  0  0
1  BCDB  bcda  {'B': 2, 'C': 1, 'D': 1}  0  2  1  1  0  0
2   CDE   cde  {'C': 1, 'D': 1, 'E': 1}  0  0  1  1  1  0
3     F     f                  {'F': 1}  0  0  0  0  0  1

You can also directly get the columns without your custom function. For this use a dynamically crafted regex with named capturing groups and str.extractall:

regex = ('(?P<'+patterns['name']+'>'+patterns['rex']+')').str.cat(sep='|')
# (?P<A>A{1})|(?P<B>B{1})|(?P<C>C{1})|(?P<D>D{1})|(?P<E>E{1})|(?P<F>F{1})

df2 = df.join(df
 ['words']
 .str.extractall(regex)
 .groupby(level=0).count()
 )

Or variant without named capturing groups and settings up the column names later:

regex = ('('+patterns['rex']+')').str.cat(sep='|')
# (A{1})|(B{1})|(C{1})|(D{1})|(E{1})|(F{1})

print(df.join(df
 ['words']
 .str.extractall(regex)
 .set_axis(patterns['name'], axis=1)
 .groupby(level=0).count()
 ))

Output:

  words stuff  A  B  C  D  E  F
0   ABC   abc  1  1  1  0  0  0
1  BCDB  bcda  0  2  1  1  0  0
2   CDE   cde  0  0  1  1  1  0
3     F     f  0  0  0  0  0  1
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, both your answer and @code-different 's did the trick!
2

A Counter behaves a lot like a dictionary. Calling pd.DataFrame on a list of dictionaries will give you the matrix of counted values:

found = df['words'].apply(countFound).to_list()
pd.concat([
    df.assign(found=found),
    pd.DataFrame(found).fillna(0).astype("int")
], axis=1)

2 Comments

@sammywemmy I think you meant fillna(0, downcast='infer') or fillna(0, downcast='int') ;)
Thank you both your answer and @mozway's did the trick.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.