2

I have a dataframe with lots of data and 1 column that is structured like this:

index    var_1
1        a=3:b=4:c=5:d=6:e=3
2        b=3:a=4:c=5:d=6:e=3
3        e=3:a=4:c=5:d=6
4        c=3:a=4:b=5:d=6:f=3

I am trying to structure the data in that column to look like this:

index    a   b   c   d   e   f
1        3   4   5   6   3   0
2        4   3   5   6   3   0
3        4   0   5   6   3   0
4        4   5   3   6   0   3

I have done the following thus far:

df1 = df['var1'].str.split(':', expand=True)

I can then loop through the cols of df1 and do another split on '=', but then I'll just have loads of disorganised label cols and value cols.

3 Answers 3

5

Use list comprehension with dictionaries for each value and pass to DataFrame constructor:

comp = [dict([y.split('=') for y in x.split(':')]) for x in df['var_1']]
df = pd.DataFrame(comp).fillna(0).astype(int)
print (df)
   a  b  c  d  e  f
0  3  4  5  6  3  0
1  4  3  5  6  3  0
2  4  0  5  6  3  0
3  4  5  3  6  0  3

Or use Series.str.split with expand=True for DataFrame, reshape by DataFrame.stack, again split, remove first level of MultiIndex and add new level by 0 column, last reshape by Series.unstack:

df = (df['var_1'].str.split(':', expand=True)
                 .stack()
                 .str.split('=', expand=True)
                 .reset_index(level=1, drop=True)
                 .set_index(0, append=True)[1]
                 .unstack(fill_value=0)
                 .rename_axis(None, axis=1))
print (df)
   a  b  c  d  e  f
1  3  4  5  6  3  0
2  4  3  5  6  3  0
3  4  0  5  6  3  0
4  4  5  3  6  0  3
Sign up to request clarification or add additional context in comments.

Comments

1

Here's one approach using str.get_dummies:

out = df.var_1.str.get_dummies(sep=':')
out = out * out.columns.str[2:].astype(int).values
out.columns = pd.MultiIndex.from_arrays([out.columns.str[0], out.columns])

print(out.max(axis=1, level=0))

       a  b  c  d  e  f
index                  
1      3  4  5  6  3  0
2      4  3  5  6  3  0
3      4  0  5  6  3  0
4      4  5  3  6  0  3

Comments

0

You can apply "extractall" and "pivot".
After "extractall" you get:

             0  1
index match      
1     0      a  3
      1      b  4
      2      c  5
      3      d  6
      4      e  3
2     0      b  3
      1      a  4
      2      c  5
      3      d  6
      4      e  3
3     0      e  3
      1      a  4
      2      c  5
      3      d  6
4     0      c  3
      1      a  4
      2      b  5
      3      d  6
      4      f  3

And in one step:

rslt= df.var_1.str.extractall(r"([a-z])=(\d+)") \
                .reset_index(level="match",drop=True) \
                .pivot(columns=0).fillna(0)                     


         1               
    0      a  b  c  d  e  f
    index                  
    1      3  4  5  6  3  0
    2      4  3  5  6  3  0
    3      4  0  5  6  3  0
    4      4  5  3  6  0  3

#rslt.columns= rslt.columns.levels[1].values

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.