1

I have a following example DataFrame:

| id   | lang      | text       |
 _______________________________
| "1"  | "en"      | "text1"    |
| "2"  | "ua"      | "text2"    |
| "1"  | "en"      | "text3"    |
| "2"  | "en"      | "text4"    |
| "3"  | "en"      | "text5"    |
| "4"  | "ru"      | "text6"    |
| "4"  | "en"      | "text7"    |
| "3"  | "ua"      | "text8"    |

I need to group it by ID and language and output the texts as a separate list.

The output from the DataFrame above should be the following:

There should be a list of unique IDs: [1, 2, 3, 4]

For every language in the lang column, there should be a separate list with texts from the text column with length of the unique IDs list, in this case, if there are multiple texts for each ID, then they are concatenated (by a space, for example). Since in the example DF we have 3 languages: en, ua, ru; we need 3 lists:

ids = [ 1,               2,        3,         4 ]  # <-- list of IDs for reference
en  = ["text1 text3", "text4",  "text5",  "text7"]
ua  = ["",            "text2",  "text8",  ""     ]
ru  = ["",            "",       "",       "text6"]

The list of texts should be as long as the list of IDs, if there are multiple texts for one ID, they should be joined, if there are none then we write an empty string.

So far I have this Python solution:

import pandas as pd
my_table = pd.read_csv("my_data.csv", delimiter="\t")

en = list()
ua = list()
ru = list()

# iterate over unique ids only
for single_id in list(my_table.cluster_id.unique()):

    # append a concatenated list of all texts given id and lang
    en.append(" ".join(list(
        my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("en"))]["text"]
    )))

    ua.append(" ".join(list(
        my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("ua"))]["text"]
    )))

    de.append(" ".join(list(
        my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("ru"))]["text"]
    )))

This is rather slow. Any way of doing the filtering in Pandas first and somehow quickly outputting it into the separate lists? I need Python lists as output.

EDIT: this is on Python 2.7

1 Answer 1

2

IIUC

#df.groupby(['id','lang']).text.apply(list).unstack(-2)
df.groupby(['id','lang']).text.apply(','.join).unstack(-2)

Out[384]: 
id              1      2      3      4
lang                                  
en    text1,text3  text4  text5  text7
ru           None   None   None  text6
ua           None  text2  text8   None

If you want to be 'list'(dict)

df.groupby(['id','lang']).text.apply(','.join).unstack(-2).T.fillna('').to_dict('l')
Out[386]: 
{'en': ['text1,text3', 'text4', 'text5', 'text7'],
 'ru': ['', '', '', 'text6'],
 'ua': ['', 'text2', 'text8', '']}

For Id

df.groupby(['id','lang']).text.apply(','.join).unstack(-2).columns.tolist()
Out[388]: [1, 2, 3, 4]
Sign up to request clarification or add additional context in comments.

4 Comments

thanks, how do I get the list of IDs out of there as well?
thanks, last bit, I can't seem to save it. By saving it under out_dict = df.groupby... when I do print(out_dict["en"]), I get KeyError: 'en'. I forgot to mention I am on Python 2.7 if that makes a difference.
@ivan_bilan dict query should like out_dict.get('en')
my bad, I did T.to_dict('l') instead of just .to_dict('l')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.