4

This question is referring to the previous post

The solutions proposed worked very well for a smaller data set, here I'm manipulating with 7 .txt files with a total memory of 750 MB. Which shouldn't be too big, so I must be doing something wrong in the process.

df1  = pd.read_csv('Data1.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14])
df2  = pd.read_csv('Data2.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14])
df3  = ...
df4 = ...

This is how one of my dataframes (df1) look like - head:

  name_profile depth           VAR1  ...  year  month  day
0  profile_1   0.6           0.2044  ...  2012     11  26
1  profile_1   0.6           0.2044  ...  2012     11  26
2  profile_1   1.1           0.2044  ...  2012     11  26
3  profile_1   1.2           0.2044  ...  2012     11  26
4  profile_1   1.4           0.2044  ...  2012     11  26
...

And tail:

       name_profile     depth              VAR1  ...  year  month  day
955281  profile_1300   194.600006          0.01460  ...  2015      3  20
955282  profile_1300   195.800003          0.01095  ...  2015      3  20
955283  profile_1300   196.899994          0.01095  ...  2015      3  20
955284  profile_1300   198.100006          0.00730  ...  2015      3  20
955285  profile_1300   199.199997          0.01825  ...  2015      3  20

I followed a suggestion and dropped duplicates:

df1.drop_duplicates()
...

etc.

Similarly df2 has VAR2, df3 VAR3 etc.

The solution is modified according to one of the answers from the previous post.

The aim is to create a new, merged DataFrame with all VARX (of each dfX) as additional columns to the depth, profile and other 3 ones, so I tried something like this:

dfs = [df.set_index(['depth','name_profile', 'year', 'month', 'day']) for df in [df1, df2, df3, df4, df5, df6, df7]]

df_merged = (pd.concat(dfs, axis=1).reset_index())

The current error is:

ValueError: cannot handle a non-unique multi-index!

What am I doing wrong?

17
  • 1
    You don't need Dask for this, the file size is trivial for any modern system. Commented Apr 12, 2019 at 17:00
  • 1
    reduce is a very intensive process as it nests with each iteration. Use concat instead . Commented Apr 12, 2019 at 17:01
  • 1
    The problem is here: dfs2 = [dfs1, df3]. dfs1 is, itself, a list of dataframes. You perhaps wanted to extend the list or append to it, not nest it Commented Apr 12, 2019 at 17:06
  • 1
    Once again, de-dupe your data on the keys with drop_duplicates(...) or run an aggregation to pick first pairing groupby(...).first() Commented Apr 12, 2019 at 17:09
  • 1
    Please show your data, attempted code, and errors/undesired results. Commented Apr 12, 2019 at 17:25

1 Answer 1

1

Consider again using the horizontal concatenation with pandas.concat. Because you have multiple rows sharing same profile, depth, year, month, and day, add a running count cumcount into mult-index, calculated with groupby().cumcount():

grp_cols = ['depth', 'name_profile', 'year', 'month', 'day']

dfs = [(df.assign(grp_count = df.groupby(grp_cols).cumcount())
          .set_index(grp_cols + ['grp_count'])
       ) for df in [df1, df2, df3, df4, df5, df6, df7]]

df_merged = pd.concat(dfs, axis=1).reset_index()

print(df_merged)
Sign up to request clarification or add additional context in comments.

2 Comments

I think this might be it, yes!!! THANK YOU SO MUCH! I'll take a closer look tomorrow, now I'm KO, but that might be it! What's the point of this cumcount function by the way?
Please read my opening text where cumcount is to resolve your repeat profile/depth rows.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.