1

I have a DataFrame with species and their occurence. An excerpt follows:

+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Family      | Genus      | Species                                   | Species code | Nearctic Mexico | Mesoamerica | Caribbean | Los Llanos | Amazonia |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra castaneifolia Britton ex Rusby | Aphecast     |                 |             |           |            |          |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra pilosa Leonard                 | Aphpilo      |                 | +           |           | +          | +        |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra pulcherrima (Jacq.) Kunth      | Aphepulc     |                 | +           | +         | +          | +        |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra | Aphelandra sinclairiana Nees              | Aphesinc     |                 | +           |           |            |          |
+-------------+------------+-------------------------------------------+--------------+-----------------+-------------+-----------+------------+----------+

I need to analyze genera only, but preserve their occurences as follows. I'm sorry if that's a real rookie question, but I'm having trouble putting it in words what I intend to do. This is the expected result:

+-------------+------------+-----------------+-------------+-----------+------------+----------+
| Family      | Genus      | Nearctic Mexico | Mesoamerica | Caribbean | Los Llanos | Amazonia |
+-------------+------------+-----------------+-------------+-----------+------------+----------+
| Acanthaceae | Aphelandra |                 | +           | +         | +          | +        |
+-------------+------------+-----------------+-------------+-----------+------------+----------+

I have found similar problems, but dealing with numeric data and I couldn't manage to adapt the solution.

1
  • Do you want the string equivalent of a "max" of the location columns? So if there is a plus anywhere for that Genus then give it a plus? Commented Jan 28, 2021 at 7:08

1 Answer 1

1

Use GroupBy.first for first non misisng values for 5th+ columns:

#if necessary replace empty space to missing values
df = df.replace('',np.nan)

cols = df.columns[4:].tolist()
df = df.groupby(['Family','Genus'], as_index=False)[cols].first()
print (df)
        Family       Genus  Nearctic Mexico Mesoamerica Caribbean Los Llanos  \
0  Acanthaceae  Aphelandra              NaN           +         +          +   

  Amazonia  
0        +  
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.