1

Hi I would like to manipulate the following data frame so that everytime a new "sample" is shown a new dataframe is created.

For example the algorithm should group all analytes,CAS numbers, and values below sample1 into a dataframe and then create a new dataframe once it hits sample 2 and so on.

New to pandas and python so thank you in advance.

LAB DATA

1
  • Look for tutorials on for loops, lists and DF.loc. the combination of the three will help you. Commented Aug 1, 2020 at 20:21

1 Answer 1

1
import pandas as pd

# Create DataFrame
data = [{'analyte': 'sample1'},
        {'analyte': 'bacon', 'CAS1': 1},
        {'analyte': 'eggs', 'CAS1': 2},
        {'analyte': 'money', 'CAS1': 3, 'CAS2': 1, 'Value2': 1.11},
        {'analyte': 'shoe', 'CAS1': 4},
        {'analyte': 'boy', 'CAS1': 5},
        {'analyte': 'girl', 'CAS1': 6},
        {'analyte': 'onion', 'CAS1': 7, 'CAS2': 4, 'Value2': 6.53},
        {'analyte': 'sample2'},
        {'analyte': 'bacon', 'CAS1': 1},
        {'analyte': 'eggs', 'CAS1': 2, 'CAS2': 1, 'Value2': 7.88},
        {'analyte': 'money', 'CAS1': 3},
        {'analyte': 'shoe', 'CAS1': 4, 'CAS2': 3, 'Value2': 15.5},
        {'analyte': 'boy', 'CAS1': 5},
        {'analyte': 'girl', 'CAS1': 6},
        {'analyte': 'onion', 'CAS1': 7}]
df = pd.DataFrame(data)

# Create list of row indices for each sample name
# For above example: s = [0, 8, 16]
s = list(df['analyte'].index[df['analyte'].str[:6] == 'sample']) + [len(df)]

# Create new dataframes for each sample and print results
samples = {}
for i, j in zip(s, s[1:]):
    sample_df = df.iloc[i+1 : j, :].reset_index(drop=True)
    sample_name = df.iloc[i].loc['analyte']
    samples.update( {sample_name : sample_df} )

print(samples['sample2'])

Other options:

# if CAS1 cell of sample row is NaN
sample_indices = list(df['CAS1'].index[df['CAS1'].apply(np.isnan)]) + [len(df)]

# if CAS1 cell of sample row is NaN or None
sample_indices = list(df['CAS1'].index[df['CAS1'].isnull()]) + [len(df)]

# if CAS1 cell of sample row is an empty string
sample_indices = list(df['CAS1'].index[df['CAS1'] == '']) + [len(df)]
Sign up to request clarification or add additional context in comments.

8 Comments

Awesome-This worked. Thank you! One question if each of the output dataframes is df1 how do I access them separately?
If you don't want to use numpy, you can substitute .apply(np.isnan) with .isnull().
Thanks. One last question- does this method rely on CAS1 repeating itself. For example would having a "sample 3" with CAS1 values of "18,6,3,7, etc." throw off your algorithm. Thanks again.
It only matters what value is in the CAS1 column of the sample row. And that value can not occur in any of the CAS1 cells below the sample. If the CAS1 sample cell is a None or NaN, the algorithm will work using df['CAS1'].isnull(), but only if the CAS1 cells below the sample are not None or NaN. If the sample cell is an empty string, then replace df['CAS1'].isnull() with df['CAS1'] == "". This then requires the CAS1 cells under the sample to not be an empty string. You can also use the 'or' operator such as (df['CAS1'].isnull() or df['CAS1'] == "")
1. The example above has NaN's in CAS1, not blank spaces. 2. I would not use regex if the sample always begins with same letters. I put another option above that works. 3. You could append both the sample name and the DataFrame to a 2D list or tuple, then select DataFrame that you want by using the sample name or the index.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.