0

I am trying to match csv entries and extract data but stuck. My csv files are in this format:

df1 looks like this:

type prediction ax ay az

df2 looks like this:

type ax ay az x y z fx fy fz

I would like to first match df1 and df2. For this, I need to match ax, ay and az all together with `df2. Matching only single column can give me wrong dataframe because entries are repeated.

After matching multiple columns with df2, I would like to extract those values and make a dataframe with df1.

Expected dataframe:

type prediction ax ay az x y z

df1 and df2 doesn't have same size. Actually, df2 is a huge file that is why I want to extract only required dataset.

This is my code:

def match_dataset(df1, df2):
    df1_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
    df2_new = pd.DataFrame(columns=['x','y','z','fx','fy','fz','az','ax','ay'])
    for i in range(len(df1)):
        for j in range(len(df2)):
            if df1.iloc[i]['az'] == df2.iloc[j]['az'] and df1.iloc[i]['ay'] == df2.iloc[j]['ay'] and df1.iloc[i]['ax'] == df2.iloc[j]['ax']:
                df1_new = df1_new.append(df2.iloc[j], ignore_index=True)
                #df2_new = df2_new.append(df2.iloc[j], ignore_index=True)
    return df1_new

data = match_dataset(df1, df2)

print(data.head())

But my code is stuck in loop. It doesn't give me output.

Can I get some help? Thank you.

1 Answer 1

2

I think you can use df1.merge() to get the desired output. Here, I'm creating two dfs with your columns and some random digits in each column:

import pandas as pd
import numpy as np

df1_cols = ['type', 'prediction', 'ax', 'ay', 'az']
df2_cols = ['type', 'ax', 'ay', 'az', 'x', 'y', 'z', 'fx', 'fy', 'fz']

df1 = pd.DataFrame(np.random.randint(10,size=(4,len(df1_cols))), columns = df1_cols)
df2 = pd.DataFrame(np.random.randint(10,size=(1000,len(df2_cols))), columns = df2_cols)

In this example, df1 came out like this:

   type  prediction  ax  ay  az
0     8           1   8   2   7
1     3           0   5   4   5
2     7           3   0   0   2
3     2           5   3   5   7

Now apply merge:

df1_new = df1.merge(df2, on=['az','ay','ax'], how='left')
print(df1_new)

result:

   type_x  prediction  ax  ay  az  type_y    x    y    z   fx   fy   fz
0       8           1   8   2   7     3.0  0.0  2.0  6.0  7.0  8.0  9.0
1       3           0   5   4   5     NaN  NaN  NaN  NaN  NaN  NaN  NaN
2       7           3   0   0   2     NaN  NaN  NaN  NaN  NaN  NaN  NaN
3       2           5   3   5   7     9.0  8.0  4.0  0.0  3.0  3.0  1.0
4       2           5   3   5   7     9.0  9.0  1.0  3.0  6.0  5.0  9.0

Apparently, in this random example, we found 2 matches for df1.iloc[3], but zero for df1.iloc[1:3], hence the NA values. We can simply drop these from the df with df1_new.dropna(inplace=True). Finally, reset the index: df1_new.reset_index(drop=True, inplace=True):

   type_x  prediction  ax  ay  az  type_y    x    y    z   fx   fy   fz
0       8           1   8   2   7     3.0  0.0  2.0  6.0  7.0  8.0  9.0
1       2           5   3   5   7     9.0  8.0  4.0  0.0  3.0  3.0  1.0
2       2           5   3   5   7     9.0  9.0  1.0  3.0  6.0  5.0  9.0

To select only the columns mentioned in your def, you can use:

df1_new[['x','y','z','fx','fy','fz','az','ax','ay']]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.