I have a following example DataFrame:
| id | lang | text |
_______________________________
| "1" | "en" | "text1" |
| "2" | "ua" | "text2" |
| "1" | "en" | "text3" |
| "2" | "en" | "text4" |
| "3" | "en" | "text5" |
| "4" | "ru" | "text6" |
| "4" | "en" | "text7" |
| "3" | "ua" | "text8" |
I need to group it by ID and language and output the texts as a separate list.
The output from the DataFrame above should be the following:
There should be a list of unique IDs:
[1, 2, 3, 4]
For every language in the lang column, there should be a separate list with texts from the text column with length of the unique IDs list, in this case, if there are multiple texts for each ID, then they are concatenated (by a space, for example). Since in the example DF we have 3 languages: en, ua, ru; we need 3 lists:
ids = [ 1, 2, 3, 4 ] # <-- list of IDs for reference
en = ["text1 text3", "text4", "text5", "text7"]
ua = ["", "text2", "text8", "" ]
ru = ["", "", "", "text6"]
The list of texts should be as long as the list of IDs, if there are multiple texts for one ID, they should be joined, if there are none then we write an empty string.
So far I have this Python solution:
import pandas as pd
my_table = pd.read_csv("my_data.csv", delimiter="\t")
en = list()
ua = list()
ru = list()
# iterate over unique ids only
for single_id in list(my_table.cluster_id.unique()):
# append a concatenated list of all texts given id and lang
en.append(" ".join(list(
my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("en"))]["text"]
)))
ua.append(" ".join(list(
my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("ua"))]["text"]
)))
de.append(" ".join(list(
my_table[(my_table["id"]==unicode(id))&(my_table["lang"]==unicode("ru"))]["text"]
)))
This is rather slow. Any way of doing the filtering in Pandas first and somehow quickly outputting it into the separate lists? I need Python lists as output.
EDIT: this is on Python 2.7