Comparing pandas DataFrames where column values are lists

Question

I have some chemical data that I'm trying to process using Pandas. I have two dataframes:

C_atoms_all.head()

   id_all  index_all label_all species_all                   position
0    217          1         C           C    [6.609, 6.6024, 19.3301]
1    218          2         C           C  [4.8792, 11.9845, 14.6312]
2    219          3         C           C  [4.8373, 10.7563, 13.9466]
3    220          4         C           C  [4.7366, 10.9327, 12.5408]
4   6573          5         C           C  [1.9482, -3.8747, 19.6319]

C_atoms_a.head()

  id_a  index_a label_a species_a                    position
0   55        1       C         C    [6.609, 6.6024, 19.3302]
1   56        2       C         C  [4.8792, 11.9844, 14.6313]
2   57        3       C         C  [4.8372, 10.7565, 13.9467]
3   58        4       C         C  [4.7367, 10.9326, 12.5409]
4   59        5       C         C  [5.1528, 15.5976, 14.1249]

What I want to do is get a mapping of all of the id_all values to the id_a values where their position matches. You can see that for C_atoms_all.iloc[0]['id_all'] (which returns 55) and the same query for C_atoms_a, the position values match (within a small fudge factor), which I should also include in the query.

The problem I'm having is that I can't merge or filter on the position columns because lists aren't hashable in Python.

I'd ideally like to return a dataframe that looks like so:

  id_all  id_a                    position
     217    55    [6.609, 6.6024, 19.3301]
     ...   ...                        ...

for every row where the position values match.

ashkangh · Accepted Answer · 2021-01-29 05:41:29Z

2

You can do it like below: I named your C_atoms_all as df_all and C_atoms_a as df_a:

# First we try to extract different values in "position" columns for both dataframes.
df_all["val0"] = df_all["position"].str[0]
df_all["val1"] = df_all["position"].str[1]
df_all["val2"] = df_all["position"].str[2]
df_a["val0"] = df_a["position"].str[0]
df_a["val1"] = df_a["position"].str[1]
df_a["val2"] = df_a["position"].str[2]

# Then because the position values match (within a small fudge factor)
# we round them with three decimal 
df_all.loc[:, ["val0", "val1", "val2"]] = df_all[["val0", "val1", "val2"]].round(3) 
df_a.loc[:, ["val0", "val1", "val2"]]= df_a[["val0", "val1", "val2"]].round(3)
# We use loc to modify the original dataframe, instead of a copy of it.

# Then we use merge on three extracted values from position column
df = df_all.merge(df_a, on=["val0", "val1", "val2"], left_index=False, right_index=False,
                 suffixes=(None, "_y"))

# Finally we just keep the the desired columns
df = df[["id_all", "id_a", "position"]]

print(df)
    id_all  id_a    position
0   217     55  [6.609, 6.6024, 19.3301]
1   218     56  [4.8792, 11.9845, 14.6312]
2   219     57  [4.8373, 10.7563, 13.9466]
3   220     58  [4.7366, 10.9327, 12.5408]

edited Jan 29, 2021 at 5:41

answered Jan 29, 2021 at 1:18

ashkangh

1,6241 gold badge8 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ad_intra Over a year ago

This seems to work for now! I do get some SettingWithCopyWarnings from pandas about trying to set values with a copy of a slice from the dataframe. Any way to get around this? It still seems to work, so not a big deal.

ashkangh Over a year ago

Sometimes SettingWithCopyWarning can be serious. I guess pandas is giving this error when we try to round some values. The way we are choosing those columns is a copy of the datframe, not itself. There is a simple way to get around this, which I edited my answer. You can also read more about SettingWithCopyWarning here

ad_intra Over a year ago

I still get the SettingWithCopyWarning when doing the rounding, and I also get it when setting the vals using the str method to split the position values.

Justin · Accepted Answer · 2021-01-29 01:06:07Z

1

This isn't pretty, but it might work for you

def do(x, df_a):
    try:
        return next((df_a.iloc[i]['id_a'] for i in df_a.index if df_a.iloc[i]['position'] == x))
    except StopIteration:
        return np.NAN

match = pd.DataFrame(C_atoms_all[['id_all', 'position']])
match['id_a'] = C_atoms_all['position'].apply(do, args=(C_atoms_a,))

answered Jan 29, 2021 at 1:06

Justin

3301 gold badge3 silver badges9 bronze badges

Comments

Juan Pablo · Accepted Answer · 2021-01-29 03:44:58Z

1

You can create a new column in both datasets that contains the hash of the position column and then merge both datasets by that new column.

# Custom hash function

def hash_position(position):
    return hash(tuple(position))


# Create the hash column "hashed_position"

C_atoms_all['hashed_position'] = C_atoms_all['position'].apply(hash_position)
C_atoms_a['hashed_position'] = C_atoms_a['position'].apply(hash_position)

# merge datasets

C_atoms_a.merge(C_atoms_all, how='inner', on='hashed_position')

# ... keep the columns you need

edited Jan 29, 2021 at 3:44

answered Jan 29, 2021 at 3:23

Juan Pablo

3672 silver badges8 bronze badges

Comments

wwnde · Accepted Answer · 2021-01-29 01:46:15Z

Your question is not clear. It seems to me an interesting question though. For that reason I have reproduced your data in a more useful format just in case there is some one who can help more than I can.

Data

C_atoms_all = pd.DataFrame({
    'id_all': [217,218,219,220,6573],
    'index_all': [1,2,3,4,5],
    'label_all': ['C','C','C','C','C'],
    'species_all': ['C','C','C','C','C'],
    'position':[[6.609, 6.6024, 19.3301],[4.8792, 11.9845, 14.6312],[4.8373, 10.7563, 13.9466],[4.7366, 10.9327, 12.5408],[1.9482,-3.8747, 19.6319]]})


C_atoms_a = pd.DataFrame({
    'id_a': [55,56,57,58,59],
    'index_a': [1,2,3,4,5],
    'label_a': ['C','C','C','C','C'],
    'species_a': ['C','C','C','C','C'],
    'position':[[6.609, 6.6024, 19.3302],[4.8792, 11.9844, 14.6313],[4.8372, 10.7565, 13.9467],[4.7367, 10.9326, 12.5409],[5.1528, 15.5976, 14.1249]]})
C_atoms_ab

Solution

#new dataframe bringing together columns position
    df3=C_atoms_all.set_index('index_all').join(C_atoms_a.set_index('index_a').loc[:,'position'].to_frame(),rsuffix='_r').reset_index()

    #Create temp column that gives you the comparison tolerances
    df3['temp']=df3.filter(regex='^position').apply(lambda x: np.round(np.array(x[0])-np.array(x[1]), 4), axis=1)

    #Assume tolerance is where only one of the values is over 0.0

    C_atoms_all[df3.explode('temp').groupby(level=0)['temp'].apply(lambda x:x.eq(0).sum()).gt(1)]

        id_all  index_all label_all species_all                  position
    0     217          1         C           C  [6.609, 6.6024, 19.3301]

Collectives™ on Stack Overflow

Comparing pandas DataFrames where column values are lists

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related