58

I'm new to using pandas and am writing a script where I read in a dataframe and then do some computation on some of the columns.

Sometimes I will have the column called "Met":

df = pd.read_csv(File, 
  sep='\t', 
  compression='gzip', 
  header=0, 
  names=["Chrom", "Site", "coverage", "Met"]
)

Other times I will have:

df = pd.read_csv(File, 
  sep='\t', 
  compression='gzip', 
  header=0, 
  names=["Chrom", "Site", "coverage", "freqC"]
)

I need to do some computation with the "Met" column so if it isn't present I will need to calculate it using:

df['Met'] = df['freqC'] * df['coverage'] 

is there a way to check if the "Met" column is present in the dataframe, and if not add it?

4 Answers 4

101

You check it like this:

if 'Met' not in df:
    df['Met'] = df['freqC'] * df['coverage'] 
Sign up to request clarification or add additional context in comments.

1 Comment

See stackoverflow.com/a/62449676/14555505 for how to add in multiple columns iff they don't exist
11

When interested in conditionally adding columns in a method chain, consider using pipe() with a lambda:

df.pipe(lambda d: (
    d.assign(Met=d['freqC'] * d['coverage'])
    if 'Met' not in d else d
))

2 Comments

Nice terse solution for chaining
Better still, you can drop the pipe and drop the negation from the if statement: df.assign(Met=lambda d: d.Met if 'Met' in d else d.freqC * d.coverage).
6

If you were creating the dataframe from scratch, you could create the missing columns without a loop merely by passing the column names into the pd.DataFrame() call:

cols = ['column 1','column 2','column 3','column 4','column 5']
df = pd.DataFrame(list_or_dict, index=['a',], columns=cols)

Comments

6

Alternatively you can use get:

df['Met'] = df.get('Met', df['freqC'] * df['coverage'])    

If the column Met exists, the values inside this column are taken. Otherwise freqC and coverage are multiplied.

2 Comments

I think this solution is correct but it's not as efficient as the others because the assignation is always done and the product is always done, as well.
EDIT: In fact it may fail always since the DataFrame either has Met or freqC but not both so in order to be correct you should do something like df['Met'] = df.get('Met', df.get('freqC') * df['coverage']) (notice the new get for freqC)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.