0

I have the following 3 data frames:

dfSpa = pd.read_csv(
    "sentences and translations/SpanishSentences.csv", sep=',')
print(dfSpa.head())

dfEng = pd.read_csv(
    'sentences and translations/EngTranslations.csv', sep=',')
print(dfEng.head())

dfIndex = pd.read_csv(
    'sentences and translations/SpaSentencesThatHaveEngTranslations.csv', sep=',')
print(dfIndex.head())

That output the following:

      0    1                               2
0  2482  spa        Tengo que irme a dormir.
1  2487  spa   Ahora, Muiriel tiene 20 años.
2  2493  spa  Simplemente no sé qué decir...
3  2495  spa      Yo estaba en las montañas.
4  2497  spa          No sé si tengo tiempo.
      0    1                               2
0  1277  eng          I have to go to sleep.
1  1282  eng              Muiriel is 20 now.
2  1287  eng     This is never going to end.
3  1288  eng  I just don't know what to say.
4  1290  eng         I was in the mountains.
      0       1
0  2482    1277
1  2487    1282
2  2493    1288
3  2493  693485
4  2495    1290

Colum 0 in dfIndex represents a Spanish sentence in dfSpa and column 1 represents the English translation in dfEng that goes with it. dfSpa has more rows than the other 2 df's so, some sentences do not have english translations. Also, dfIndex is longer than dfEng because there are some duplicate translations with different values such as with 2493, in dfIndex.head(), as shown above.

I am trying to create another data frame that simply has the Spanish sentence in one column and the corresponding English translation in the other column. How could I get this done?

2 Answers 2

1
dfIndex.merge(
    dfSpa[[0,2]], on=0)[[1,2]].rename(columns={2: "Spa"}).merge(
        dfEng, left_on=1, right_on=0).rename(columns={2: "Eng"})[['Spa', 'Eng']]
Sign up to request clarification or add additional context in comments.

Comments

1

You could try:

df_n=pd.DataFrame()
df_n['A'] = [df.iloc[x].values for x in dfSpa.loc[:,0]]
df_n['B'] = [df.iloc[x].values for x in dfEng.loc[:,0]]

and then remove duplicated rows using:

df_n = df_n.drop_duplicates(subset = ['A'])

It would be easier to check if you had sample dfs.

3 Comments

df_n['A'] = [dfIndex.iloc[x].values for x in dfSpa.loc[:, '0']] df_n['B'] = [dfIndex.iloc[x].values for x in dfEng.loc[:, '0']] Is giving the error single positional indexer is out-of-bounds. How can I provide a sample df? Thanks
If @mujjiga's comment above doesn't work, you can upload short csv files?
His comment worked. But, thanks for the help anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.