0

I am trying to create a new data frame that compresses pre-existing columns from another data frame.

I am looking to turn something like this:

id | x1  | x2  | x3  | x4
-------------------------- ...
a  | x1a | x2a | x3a | x4a
b  | x1b | x2b | x3b | x4b
c  | x1c | x2c | x3c | x4c

Into this:

id |     z1       |      z2
-------------------------------- ...
a  | f1(x1a, x2a) | f2(x3a, x4a) 
b  | f1(x1b, x2b) | f2(x3b, x4b) 
c  | f1(x1c, x2c) | f2(x3c, x4c) 

My current approach has been to continuously just append row by row to the new data frame. Like so:

for row in rows:
   new_row_map = get_new_row_map(df_in, row)
   df_out = df_out.append(new_row_map, ignore_index=True) 
return df_out

I have been running this code for a couple hours now and it seems to be very inefficient. I was wondering if anyone had a quicker/more efficient approach here. Thanks!

4
  • Where is x4a, x4b, x4c? Commented Jul 11, 2022 at 21:35
  • Sorry, second inputs to f2 function Commented Jul 11, 2022 at 21:39
  • You have 2 different functions? Commented Jul 11, 2022 at 21:46
  • Yeah they are different functions Commented Jul 11, 2022 at 21:49

2 Answers 2

1

You're right, appending row by row to a data is very inefficient, which is why pandas and numpy use vectorized operations to alter and access their data. Data types in numpy and pandas are stored with less metadata than they would be in a base python type, and vectorized operations allow all the calculations to be done at once (for every element) rather than iterating sequentially through each row. See Chapter 4 of Python for Data Analysis for a more thorough explanation (it's free online).

Rather than appending row by row, you need to apply a vectorized function to the whole data frame (meaning it alters the entire data frame at once instead of iterating over the rows). For instance:

df["z1"] = f1(df)
df["z2"] = f2(df)

#examples of what f1 and f2 could be
def f1(df):
    result = (df["x1"] * df["x2"] + 4) + np.cos(df["x2"]))
    return result

def f2(df):
    df["x3"] - df["x4"] * 9.8

# you could cut out the original columns like so
df = df[["z1", "z2"]]

See this post about vectorizing a function, and this article

Sign up to request clarification or add additional context in comments.

Comments

1

You can use:

def f1(row):
    # do stuff here, just return a string for demo
    return f"f({', '.join(row)})"
    
def f2(row):
    # do stuff here, just return a string for demo
    return f"f({', '.join(row)})"

df['z1'] = df[['x1', 'x2']].apply(f1, axis=1)
df['z2'] = df[['x3', 'x4']].apply(f2, axis=1)

Output:

  id   x1   x2   x3   x4           z1           z2
0  a  x1a  x2a  x3a  x4a  f(x1a, x2a)  f(x3a, x4a)
1  b  x1b  x2b  x3b  x4b  f(x1b, x2b)  f(x3b, x4b)
2  c  x1c  x2c  x3c  x4c  f(x1c, x2c)  f(x3c, x4c)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.