1

I have a dataframe as such

      ID       NAME  group_id
0     205292   A     183144058824253894513539088231878865676           
1     475121   B     183144058824253894513539088231878865676
1     475129   C     183144058824253894513539088231878865676

I want to transform it such that row 0 is linked to the other rows in the following way

   LinkedBy  By_Id    LinkedTo  To_Id   group_id
1  A         205292   B         475121  183144058824253894513539088231878865676
2  A         205292   C         475129  183144058824253894513539088231878865676

Basically, I am compressing the first dataframe by linking 0th index row against all other such that an n row df will give me a (n-1) row df. I can accomplish this without the group id (which is of type long and stays constant) by the following code:

pd.DataFrame({"LinkedBy": df['NAME'].iloc[0],"By_Id": df['ID'].iloc[0],"LinkedTo":df['NAME'].iloc[1:],"To_Id":df['ID'].iloc[1:]})

But I am facing problems while adding a group id. When I do the following

pd.DataFrame({"LinkedBy": df['NAME'].iloc[0],"By_Id": df['ID'].iloc[0],"LinkedTo":df['NAME'].iloc[1:],"To_Id":df['ID'].iloc[1:],"GroupId":df['potential_group_id'].iloc[0]})

I get OverflowError: long too big to convert

How do I add the group_id of type long to my new df.

4
  • can it just be a str instead which should work? so basically cast the dtype using .astype(str) Commented Sep 14, 2016 at 20:45
  • 1
    I suspect this is because you are passing arrays of different sizes. This introduces NaNs, which forces to float and you cannot have that big of floats. If it is not cruical, I agree that str would be a better choice. Commented Sep 14, 2016 at 20:51
  • Ideally, I would like to keep them as Long. I would like to know what happens in the background while trying to create this df that gives the error and if there's another way to skin the cat so to speak Commented Sep 14, 2016 at 20:52
  • The problem is in the dict. "LinkedBy": df['NAME'].iloc[0] this has only one entry but "LinkedTo": df['NAME'].iloc[1:] this one has two. Instead of [A] you need to pass [A, A]. Maybe with 2* [df['NAME'].iloc[0]]. Commented Sep 14, 2016 at 20:58

2 Answers 2

1

Since your group_id in all rows appears to be the same, you could try this:

res = pd.merge(left=df.iloc[0,:], right=df.iloc[1:,:], how='right', on=['group_id'])
res.columns = ['By_Id', 'LinkedBy', 'group_id', 'To_Id', 'LinkedTo']

Note that this will only work when group_id can be used as your join key.

Sign up to request clarification or add additional context in comments.

Comments

0
  • groupby everything and then apply with custom function
  • cond1 make sure 'group_id' matches
  • cond2 make sure 'NAME' does not match
  • subset df in apply function
  • rename and drop stuff
  • more renaming and dropping and resetting

def find_grp(x):
    cond1 = df.group_id == x.name[2]
    cond2 = df.NAME != x.name[1]
    temp = df[cond1 & cond2]
    rnm = dict(ID='To_ID', NAME='LinkedTo')
    return temp.drop('group_id', axis=1).rename(columns=rnm)


cols = ['ID', 'NAME', 'group_id']
df1 = df.groupby(cols).apply(find_grp)
df1.index = df1.index.droplevel(-1)
df1.rename_axis(['By_ID', 'LinkedBy', 'group_id']).reset_index()

enter image description here


OR

df1 = df.merge(df, on='group_id', suffixes=('_By', '_To'))
df1 = df1[df1.NAME_By != df1.NAME_To]

rnm = dict(ID_By='By_ID', ID_To='To_ID', NAME_To='LinkedTo', NAME_By='LinkedBy')

df1.rename(columns=rnm)

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.