Add a new column to a dataframe based on first value in row

Question

I have a dataframe like such:

>>> import pandas as pd

>>> pd.read_csv('csv/10_no_headers_with_com.csv')
                  //field  field2
0   //first field is time     NaN
1                 132605     1.0
2                 132750     2.0
3                 132772     3.0
4                 132773     4.0
5                 133065     5.0
6                 133150     6.0

I would like to add another field that says whether the first value of the first field is a comment character, //. So far I have something like this:

# may not have a heading value, so use the index not the key
df[0].str.startswith('//')

What would be the correct way to add on a new column with this value, so that the result is something like:

pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
                       0       1       _starts_with_comment
0                 //field  field2       True
1  //first field is time     NaN       True
2                 132605       1       False
3                 132750       2       False
4                 132772       3       False

In case you rather like to optimize named columns import when dealing with commented headers, please consider looking at my edit below. — SpghttCd
– SpghttCd, Commented Dec 20, 2018 at 9:35

SpghttCd · Accepted Answer · 2018-12-20 10:52:41Z

What is the issue with your command, simply assigned to a new column?:

df['comment_flag'] = df[0].str.startswith('//')

Or do you indeed have mixed type columns as mentioned by jpp?

EDIT:
I'm not quite sure, but from your comments I get the impression you don't really need an additional column of comment flags. Just in case you want to load the data without comments into a dataframe but still use field names somewhat hidden in the commented header as column names, you might want to check this out:
So based on this textfile:

//field  field2
//first field is time     NaN
132605     1.0
132750     2.0
132772     3.0
132773     4.0
133065     5.0
133150     6.0

You could do:

cmt = '//'

header = []
with open(textfilename, 'r') as f:
    for line in f:
        if line.startswith(cmt):
            header.append(line)
        else:                      # leave that out if collecting all comments of entire file is ok/wanted
            break
print(header)
# ['//field  field2\n', '//first field is time     NaN\n']

This way you have the header information prepared for being used for e.g. column names.
Getting the names from the first header line and using it for pandas import would be like

nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')

    field  field2                                           
0  132605     1.0                                         
1  132750     2.0                                       
2  132772     3.0                                      
3  132773     4.0                                       
4  133065     5.0                                       
5  133150     6.0

jpp · Accepted Answer · 2018-12-20 00:10:05Z

1

One way is to utilise pd.to_numeric, assuming non-numeric data in the first column must indicate a comment:

df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()

Just note this kind of mixing types within series is strongly discouraged. Your first two series will no longer support vectorised operations as they will be stored in object dtype series. You lose some of the main benefits of Pandas.

A much better idea is to use the csv module to extract those attributes at the top of your file and store them as separate variables. Here's an example of how you can achieve this.

answered Dec 20, 2018 at 0:10

jpp

166k37 gold badges301 silver badges363 bronze badges

2 Comments

David542 Over a year ago

thanks for mentioning this approach. About the answer you linked, what if the header itself has a comment? I've actually seen that quite frequently to designate that the first row of the csv file is a header and not data.

jpp Over a year ago

@David542, You'll have to write some logic to store the logic separately, then add it later via df.columns = [....], where [...] represents a list of strings.

Jorge · Accepted Answer · 2018-12-20 11:08:55Z

1

Try this:

import pandas as pd
import numpy as np

df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)

edited Dec 20, 2018 at 11:08

answered Dec 20, 2018 at 1:53

Jorge

2,2491 gold badge24 silver badges32 bronze badges

6 Comments

jpp Over a year ago

Or just df[0].str.startswith(r'//').. np.where not necessary. You also need df.loc[:, '_startswith_comment'].

David542 Over a year ago

@jorge could you please explain the difference between doing np.where and just doing it without that?

Jorge Over a year ago

David542 as @jpp pointed out, in this example, there is no difference. If you have other options in column [0] that you want to use in the new column, you can try to add more np.where inside the np.where that I wrote. Something like np.where(df[0].str.startswith(r'//'), 'Starts with '//', np.where(df[0] == 132750, 'number', 'Something_else')). Just keep track of the parenthesis and where you place them. I find np.where very useful in my work.

David542 Over a year ago

@Jorge thanks for the explanation. This may be a silly question, but does pandas automatically import numpy or do I need to import that separately?

Jorge Over a year ago

@David542, no, pandas does NOT upload numpy. You need to import it separately. As for your second question. Both produce the same results. You may get a 'warning' from pandas with df['_starts_with_comment'], Using .loc is for indexing purposes. I found this site that explain some of the differences shanelynn.ie/…

|

Collectives™ on Stack Overflow

Add a new column to a dataframe based on first value in row

3 Answers 3

Comments

2 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related