How to get the subset of dataframe based on another dataframe in pandas python

Question

I just learnt pandas and basically I want to take the some rows of a dataframe based on the ids that are stored in another dataframe. Let me show you the code:

import pandas as pd
from sklearn.model_selection import train_test_split

f_data="data.tsv"
all_data = pd.read_csv(f_data,delimiter='\t',encoding='utf-8',header=None)
x_data = all_data[[0,1,3]]
y_data = all_data[[2]]

# Split train and test sets
x_train,x_test,y_train,y_test = train_test_split(x_data,y_data,test_size=0.1)

all_data have 12 columns in total. I use 3 of the columns in x_data and 1 of them in y_data.

Once I create x_train and x_test, I would like to write these instances into tsv files but while doing that I want to write all of the 12 columns stored in all_data. To be able to do that, I need to match the instances in x_train and x_test with all_data. How could I do that ?

EDIT

Here how my data looks like:

all_data

        0                                                  1                              2    3   ...                                                8                      9     10    11
0       35  Auch in Großbritannien, wo 19 Atomreaktoren in...                       Ausstieg -1.0  ...                                      Sunday Times           Sunday Times   NaN     1

# continues like that

x_train

         0                                                  1    3
939   2074  Die CSU verlangt von der schwarz-gelben Koalit...  1.0

So, what I want to do is to get the rows starting with 939,710,288,854,433 in all_data and write them into a file.

Perhaps using all_data.loc[x_data.index]? You haven't shown us your data though. — John Zwinck
– John Zwinck, Commented Aug 5, 2018 at 10:19

John Zwinck · Accepted Answer · 2018-08-05 10:53:48Z

1

The index of the split data corresponds to the original, and can be used to look up the original data (assuming the index is unique):

all_data.loc[x_train.index]
all_data.loc[x_test.index]

answered Aug 5, 2018 at 10:53

John Zwinck

252k44 gold badges347 silver badges459 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to get the subset of dataframe based on another dataframe in pandas python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related