Split columns into lists in Pandas

Question

I have a following example DataFrame:

| id   | lang      | text       |
 _______________________________
| "1"  | "en"      | "text1"    |
| "2"  | "ua"      | "text2"    |
| "1"  | "en"      | "text3"    |
| "2"  | "en"      | "text4"    |
| "3"  | "en"      | "text5"    |
| "4"  | "ru"      | "text6"    |
| "4"  | "en"      | "text7"    |
| "3"  | "ua"      | "text8"    |

I need to group it by ID and language and output the texts as a separate list.

The output from the DataFrame above should be the following:

There should be a list of unique IDs: [1, 2, 3, 4]

For every language in the lang column, there should be a separate list with texts from the text column with length of the unique IDs list, in this case, if there are multiple texts for each ID, then they are concatenated (by a space, for example). Since in the example DF we have 3 languages: en, ua, ru; we need 3 lists:

ids = [ 1,               2,        3,         4 ]  # <-- list of IDs for reference
en  = ["text1 text3", "text4",  "text5",  "text7"]
ua  = ["",            "text2",  "text8",  ""     ]
ru  = ["",            "",       "",       "text6"]

The list of texts should be as long as the list of IDs, if there are multiple texts for one ID, they should be joined, if there are none then we write an empty string.

So far I have this Python solution:

import pandas as pd
my_table = pd.read_csv("my_data.csv", delimiter="\t")

en = list()
ua = list()
ru = list()

# iterate over unique ids only
for single_id in list(my_table.cluster_id.unique()):

    # append a concatenated list of all texts given id and lang
    en.append(" ".join(list(
        my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("en"))]["text"]
    )))

    ua.append(" ".join(list(
        my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("ua"))]["text"]
    )))

    de.append(" ".join(list(
        my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("ru"))]["text"]
    )))

This is rather slow. Any way of doing the filtering in Pandas first and somehow quickly outputting it into the separate lists? I need Python lists as output.

EDIT: this is on Python 2.7

BENY · Accepted Answer · 2018-03-13 14:32:15Z

2

IIUC

#df.groupby(['id','lang']).text.apply(list).unstack(-2)
df.groupby(['id','lang']).text.apply(','.join).unstack(-2)

Out[384]: 
id              1      2      3      4
lang                                  
en    text1,text3  text4  text5  text7
ru           None   None   None  text6
ua           None  text2  text8   None

If you want to be 'list'(dict)

df.groupby(['id','lang']).text.apply(','.join).unstack(-2).T.fillna('').to_dict('l')
Out[386]: 
{'en': ['text1,text3', 'text4', 'text5', 'text7'],
 'ru': ['', '', '', 'text6'],
 'ua': ['', 'text2', 'text8', '']}

For Id

df.groupby(['id','lang']).text.apply(','.join).unstack(-2).columns.tolist()
Out[388]: [1, 2, 3, 4]

edited Mar 13, 2018 at 14:32

answered Mar 13, 2018 at 14:27

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ivan Bilan Over a year ago

thanks, how do I get the list of IDs out of there as well?

Ivan Bilan Over a year ago

thanks, last bit, I can't seem to save it. By saving it under out_dict = df.groupby... when I do print(out_dict["en"]), I get KeyError: 'en'. I forgot to mention I am on Python 2.7 if that makes a difference.

BENY Over a year ago

@ivan_bilan dict query should like out_dict.get('en')

Ivan Bilan Over a year ago

my bad, I did T.to_dict('l') instead of just .to_dict('l')

Collectives™ on Stack Overflow

Split columns into lists in Pandas

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related