-1

I have this fossil data and want to create a new column with unique values for each of the unique occurrence in

GENUS = (['Microtherium', 'Bachitherium', 'Coelodonta', ..., 'Murina',
   'Boopsis', None], dtype=object)
SPECIES = (['Microtherium', 'Bachitherium', 'Coelodonta', ..., 'Murina',
   'Boopsis', None], dtype=object)

#dropping the duplicates
dffossil[['GENUS', 'SPECIES']].drop_duplicates

Now I want to have a new column with unique number for each of the the unique GENUS and SPECIES combination.

4
  • Do you want a unique number (i.e. integer) for each combination or just a unique identifier. If you want a unique identifier you could easily try hash(Genus_String + SPECIES_str) to create a hash value of each combination within the df. Commented Jul 23, 2022 at 14:10
  • Is this pseudocode, just to show what your columns are? Displaying an actual (small) DataFrame would be more helpful, and make this a minimal reproducible example. (Also, don't forget the parentheses when calling drop_duplicates()...) Commented Jul 23, 2022 at 15:03
  • Also relevant: Pandas-specific advice for minimal reproducible examples Commented Jul 23, 2022 at 15:06
  • 1
    Does this answer your question? How to create a unique identifier based on multiple columns? Commented Sep 1, 2022 at 13:03

1 Answer 1

1

If you simply want a unique identifier for each combination of GENUS and SPECIES you can do the following:
Note: In have assumed that either GENUS or SPECIES can contain a None value, which complicates the process slightly.

So Given a DF of the form:

    GENUS   SPECIES
0   Murina  Coelodonta
1   Murina  Microtherium
2   Microtherium    Murina
3   Bachitherium    Microtherium
4   Coelodonta  None
5   Coelodonta  Coelodonta
6   Microtherium    Coelodonta
7   Microtherium    Murina
8   Microtherium    Bachitherium
9   Murina  Microtherium  

Add a column which uniquely identifies each combination of GENUS and SPECIES. We call this Column 'ID'.

Define a function to create a hash of entries, taking into account the possibility of a None entry.

def hashValues(g, s):
    if g == None:
        g = "None"
    if s == None:
        s = 'None'
    return hash(g + s)  

To add the column use the following:

df['ID'] = [hashValues(df['GENUS'].to_list()[i], df['SPECIES'].to_list()[i]) for i in range(df.shape[0])]  

which yields:

    GENUS           SPECIES         ID
0   Murina          Coelodonta      -6583287505830614713
1   Murina          Microtherium    6019734726691011903
2   Microtherium    Murina          -2318069015748475190
3   Bachitherium    Microtherium    5795352218934423262
4   Coelodonta      None            4851538573581845777
5   Coelodonta      Coelodonta      -5115794138222494493
6   Microtherium    Coelodonta      2603682196287415014
7   Microtherium    Murina          -2318069015748475190
8   Microtherium    Bachitherium    -2746445536675711990
9   Murina          Microtherium    6019734726691011903
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.