I just learnt pandas and basically I want to take the some rows of a dataframe based on the ids that are stored in another dataframe. Let me show you the code:
import pandas as pd
from sklearn.model_selection import train_test_split
f_data="data.tsv"
all_data = pd.read_csv(f_data,delimiter='\t',encoding='utf-8',header=None)
x_data = all_data[[0,1,3]]
y_data = all_data[[2]]
# Split train and test sets
x_train,x_test,y_train,y_test = train_test_split(x_data,y_data,test_size=0.1)
all_data have 12 columns in total. I use 3 of the columns in x_data and 1 of them in y_data.
Once I create x_train and x_test, I would like to write these instances into tsv files but while doing that I want to write all of the 12 columns stored in all_data. To be able to do that, I need to match the instances in x_train and x_test with all_data. How could I do that ?
EDIT
Here how my data looks like:
all_data
0 1 2 3 ... 8 9 10 11
0 35 Auch in Großbritannien, wo 19 Atomreaktoren in... Ausstieg -1.0 ... Sunday Times Sunday Times NaN 1
# continues like that
x_train
0 1 3
939 2074 Die CSU verlangt von der schwarz-gelben Koalit... 1.0
So, what I want to do is to get the rows starting with 939,710,288,854,433 in all_data and write them into a file.
all_data.loc[x_data.index]? You haven't shown us your data though.