2

I would like to transform this DF

pd.DataFrame({"l1": [["fr en","en"]],
              "l2": [["fr en","in","it"]],
              "l3": [["he","es","fi"]],
              "l4": [["es"]]}).T
>> l1  [fr en, en]
   ...
   l4  [es]

to this DTM :

data = [[1,1,0,0,0,0,0], [1,0,1,1,0,0,0], [0,0,0,0,1,1,1], [0,0,0,0,0,1,1]]
pd.DataFrame(index=["l1","l2","l3","l4"], data=data, columns=["fr en","en","in","it","he","es","fi"])
>>      fr en en in it he es fi
    l1  1     1  0  0  0  0  0
    ... ...

My inefficient way to do this is to chain all possible values then to Count-Vectorize like

langs = set(chain(*df["lang"]))
pd.DataFrame(data=df["lang"].apply(lambda x: [1 if lang in x else 0 for lang in langs]).tolist(), columns=langs)

PS : I don't want to " ".join() the lists because it could represent a loss of information as you can see in fr en

1 Answer 1

3

I think need MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df[0]),columns=mlb.classes_, index=df.index)
print (df)
    en  es  fi  fr en  he  in  it
l1   1   0   0      1   0   0   0
l2   0   0   0      1   0   1   1
l3   0   1   1      0   1   0   0
l4   0   1   0      0   0   0   0

Or is possible use slowier solution with join by | if this separator not exist in data:

df = df[0].str.join('|').str.get_dummies()
print (df)
    en  es  fi  fr en  he  in  it
l1   1   0   0      1   0   0   0
l2   0   0   0      1   0   1   1
l3   0   1   1      0   1   0   0
l4   0   1   0      0   0   0   0
Sign up to request clarification or add additional context in comments.

1 Comment

excellent I did not even think about the MultiLabelBinarizer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.