1

How can I add the count of string present in target column.

data = [{'target': ['Aging','Brain', 'Neurons', 'Genetics']}, 
        {'target': ['Dementia', 'Genetics']}, 
        {'target': ['Brain','Dementia', 'Genetics']}]

df = pd.DataFrame(data)

Dataframe

target
0   [Aging, Brain, Neurons, Genetics]
1   [Dementia, Genetics]
2   [Brain, Dementia, Genetics]

Unique labels

target = []
for sublist in df['target'].values:
    tmp_list = [x.strip() for x in sublist]
    target.extend(tmp_list)

target = list(set(target))

# ['Brain', 'Neurons', 'Aging', 'Genetics', 'Dementia']

The expected output is here enter image description here

2 Answers 2

2

If need indicator columns (only 0 or 1):

Use MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['target']),columns=mlb.classes_)
print (df1)
   Aging  Brain  Dementia  Genetics  Neurons
0      1      1         0         1        1
1      0      0         1         1        0
2      0      1         1         1        0

Or Series.str.join with Series.str.get_dummies - but it is slowier:

df1 = df['target'].str.join('|').str.get_dummies()

If need count values in lists:

data = [{'target': ['Neurons','Brain', 'Neurons', 'Neurons']}, 
        {'target': ['Dementia', 'Genetics']}, 
        {'target': ['Brain','Brain', 'Genetics']}]

df = pd.DataFrame(data)

from collections import Counter
df = pd.DataFrame([Counter(x) for x in df['target']]).fillna(0).astype(int)
print (df)

   Brain  Dementia  Genetics  Neurons
0      1         0         0        3
1      0         1         1        0
2      2         0         1        0
Sign up to request clarification or add additional context in comments.

Comments

0

Maybe this will help

# Instead of creation of target list ,
# Convert list of str to one single str 
list_to_str = [" ".join(tags['target']) for tags in data]

##
#['Aging Brain Neurons Genetics',
# 'Dementia Genetics',
# 'Brain Dementia Genetics',
# 'Neurons Brain Neurons Neurons'
# ]

# Using CountVector
from sklearn.feature_extraction.text import CountVectorizer
text_data = np.array(list_to_str)

# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)   # needs to coverted to array

# Get feature names
feature_names = count.get_feature_names()

# Create df
df1  = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

print(df1)

## Output
   aging  brain  dementia  genetics  neurons
0      1      1         0         1        1
1      0      0         1         1        0
2      0      1         1         1        0
3      0      1         0         0        3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.