1

I have some chemical data that I'm trying to process using Pandas. I have two dataframes:

C_atoms_all.head()

   id_all  index_all label_all species_all                   position
0    217          1         C           C    [6.609, 6.6024, 19.3301]
1    218          2         C           C  [4.8792, 11.9845, 14.6312]
2    219          3         C           C  [4.8373, 10.7563, 13.9466]
3    220          4         C           C  [4.7366, 10.9327, 12.5408]
4   6573          5         C           C  [1.9482, -3.8747, 19.6319]

C_atoms_a.head()

  id_a  index_a label_a species_a                    position
0   55        1       C         C    [6.609, 6.6024, 19.3302]
1   56        2       C         C  [4.8792, 11.9844, 14.6313]
2   57        3       C         C  [4.8372, 10.7565, 13.9467]
3   58        4       C         C  [4.7367, 10.9326, 12.5409]
4   59        5       C         C  [5.1528, 15.5976, 14.1249]

What I want to do is get a mapping of all of the id_all values to the id_a values where their position matches. You can see that for C_atoms_all.iloc[0]['id_all'] (which returns 55) and the same query for C_atoms_a, the position values match (within a small fudge factor), which I should also include in the query.

The problem I'm having is that I can't merge or filter on the position columns because lists aren't hashable in Python.

I'd ideally like to return a dataframe that looks like so:

  id_all  id_a                    position
     217    55    [6.609, 6.6024, 19.3301]
     ...   ...                        ...

for every row where the position values match.

4 Answers 4

2

You can do it like below: I named your C_atoms_all as df_all and C_atoms_a as df_a:

# First we try to extract different values in "position" columns for both dataframes.
df_all["val0"] = df_all["position"].str[0]
df_all["val1"] = df_all["position"].str[1]
df_all["val2"] = df_all["position"].str[2]
df_a["val0"] = df_a["position"].str[0]
df_a["val1"] = df_a["position"].str[1]
df_a["val2"] = df_a["position"].str[2]

# Then because the position values match (within a small fudge factor)
# we round them with three decimal 
df_all.loc[:, ["val0", "val1", "val2"]] = df_all[["val0", "val1", "val2"]].round(3) 
df_a.loc[:, ["val0", "val1", "val2"]]= df_a[["val0", "val1", "val2"]].round(3)
# We use loc to modify the original dataframe, instead of a copy of it.

# Then we use merge on three extracted values from position column
df = df_all.merge(df_a, on=["val0", "val1", "val2"], left_index=False, right_index=False,
                 suffixes=(None, "_y"))

# Finally we just keep the the desired columns
df = df[["id_all", "id_a", "position"]]

print(df)
    id_all  id_a    position
0   217     55  [6.609, 6.6024, 19.3301]
1   218     56  [4.8792, 11.9845, 14.6312]
2   219     57  [4.8373, 10.7563, 13.9466]
3   220     58  [4.7366, 10.9327, 12.5408] 
Sign up to request clarification or add additional context in comments.

3 Comments

This seems to work for now! I do get some SettingWithCopyWarnings from pandas about trying to set values with a copy of a slice from the dataframe. Any way to get around this? It still seems to work, so not a big deal.
Sometimes SettingWithCopyWarning can be serious. I guess pandas is giving this error when we try to round some values. The way we are choosing those columns is a copy of the datframe, not itself. There is a simple way to get around this, which I edited my answer. You can also read more about SettingWithCopyWarning here
I still get the SettingWithCopyWarning when doing the rounding, and I also get it when setting the vals using the str method to split the position values.
1

This isn't pretty, but it might work for you

def do(x, df_a):
    try:
        return next((df_a.iloc[i]['id_a'] for i in df_a.index if df_a.iloc[i]['position'] == x))
    except StopIteration:
        return np.NAN

match = pd.DataFrame(C_atoms_all[['id_all', 'position']])
match['id_a'] = C_atoms_all['position'].apply(do, args=(C_atoms_a,))

Comments

1

You can create a new column in both datasets that contains the hash of the position column and then merge both datasets by that new column.

# Custom hash function

def hash_position(position):
    return hash(tuple(position))


# Create the hash column "hashed_position"

C_atoms_all['hashed_position'] = C_atoms_all['position'].apply(hash_position)
C_atoms_a['hashed_position'] = C_atoms_a['position'].apply(hash_position)

# merge datasets

C_atoms_a.merge(C_atoms_all, how='inner', on='hashed_position')

# ... keep the columns you need

Comments

0

Your question is not clear. It seems to me an interesting question though. For that reason I have reproduced your data in a more useful format just in case there is some one who can help more than I can.

Data

C_atoms_all = pd.DataFrame({
    'id_all': [217,218,219,220,6573],
    'index_all': [1,2,3,4,5],
    'label_all': ['C','C','C','C','C'],
    'species_all': ['C','C','C','C','C'],
    'position':[[6.609, 6.6024, 19.3301],[4.8792, 11.9845, 14.6312],[4.8373, 10.7563, 13.9466],[4.7366, 10.9327, 12.5408],[1.9482,-3.8747, 19.6319]]})


C_atoms_a = pd.DataFrame({
    'id_a': [55,56,57,58,59],
    'index_a': [1,2,3,4,5],
    'label_a': ['C','C','C','C','C'],
    'species_a': ['C','C','C','C','C'],
    'position':[[6.609, 6.6024, 19.3302],[4.8792, 11.9844, 14.6313],[4.8372, 10.7565, 13.9467],[4.7367, 10.9326, 12.5409],[5.1528, 15.5976, 14.1249]]})
C_atoms_ab

Solution

#new dataframe bringing together columns position
    df3=C_atoms_all.set_index('index_all').join(C_atoms_a.set_index('index_a').loc[:,'position'].to_frame(),rsuffix='_r').reset_index()

    #Create temp column that gives you the comparison tolerances
    df3['temp']=df3.filter(regex='^position').apply(lambda x: np.round(np.array(x[0])-np.array(x[1]), 4), axis=1)

    #Assume tolerance is where only one of the values is over 0.0

    C_atoms_all[df3.explode('temp').groupby(level=0)['temp'].apply(lambda x:x.eq(0).sum()).gt(1)]

        id_all  index_all label_all species_all                  position
    0     217          1         C           C  [6.609, 6.6024, 19.3301]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.