1

I have a dataframe like such:

>>> import pandas as pd

>>> pd.read_csv('csv/10_no_headers_with_com.csv')
                  //field  field2
0   //first field is time     NaN
1                 132605     1.0
2                 132750     2.0
3                 132772     3.0
4                 132773     4.0
5                 133065     5.0
6                 133150     6.0

I would like to add another field that says whether the first value of the first field is a comment character, //. So far I have something like this:

# may not have a heading value, so use the index not the key
df[0].str.startswith('//')  

What would be the correct way to add on a new column with this value, so that the result is something like:

pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
                       0       1       _starts_with_comment
0                 //field  field2       True
1  //first field is time     NaN       True
2                 132605       1       False
3                 132750       2       False
4                 132772       3       False
1
  • In case you rather like to optimize named columns import when dealing with commented headers, please consider looking at my edit below. Commented Dec 20, 2018 at 9:35

3 Answers 3

1

What is the issue with your command, simply assigned to a new column?:

df['comment_flag'] = df[0].str.startswith('//')

Or do you indeed have mixed type columns as mentioned by jpp?


EDIT:
I'm not quite sure, but from your comments I get the impression you don't really need an additional column of comment flags. Just in case you want to load the data without comments into a dataframe but still use field names somewhat hidden in the commented header as column names, you might want to check this out:
So based on this textfile:

//field  field2
//first field is time     NaN
132605     1.0
132750     2.0
132772     3.0
132773     4.0
133065     5.0
133150     6.0

You could do:

cmt = '//'

header = []
with open(textfilename, 'r') as f:
    for line in f:
        if line.startswith(cmt):
            header.append(line)
        else:                      # leave that out if collecting all comments of entire file is ok/wanted
            break
print(header)
# ['//field  field2\n', '//first field is time     NaN\n']  

This way you have the header information prepared for being used for e.g. column names.
Getting the names from the first header line and using it for pandas import would be like

nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')

    field  field2                                           
0  132605     1.0                                         
1  132750     2.0                                       
2  132772     3.0                                      
3  132773     4.0                                       
4  133065     5.0                                       
5  133150     6.0                                       
Sign up to request clarification or add additional context in comments.

Comments

1

One way is to utilise pd.to_numeric, assuming non-numeric data in the first column must indicate a comment:

df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()

Just note this kind of mixing types within series is strongly discouraged. Your first two series will no longer support vectorised operations as they will be stored in object dtype series. You lose some of the main benefits of Pandas.

A much better idea is to use the csv module to extract those attributes at the top of your file and store them as separate variables. Here's an example of how you can achieve this.

2 Comments

thanks for mentioning this approach. About the answer you linked, what if the header itself has a comment? I've actually seen that quite frequently to designate that the first row of the csv file is a header and not data.
@David542, You'll have to write some logic to store the logic separately, then add it later via df.columns = [....], where [...] represents a list of strings.
1

Try this:

import pandas as pd
import numpy as np

df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)

6 Comments

Or just df[0].str.startswith(r'//').. np.where not necessary. You also need df.loc[:, '_startswith_comment'].
@jorge could you please explain the difference between doing np.where and just doing it without that?
David542 as @jpp pointed out, in this example, there is no difference. If you have other options in column [0] that you want to use in the new column, you can try to add more np.where inside the np.where that I wrote. Something like np.where(df[0].str.startswith(r'//'), 'Starts with '//', np.where(df[0] == 132750, 'number', 'Something_else')). Just keep track of the parenthesis and where you place them. I find np.where very useful in my work.
@Jorge thanks for the explanation. This may be a silly question, but does pandas automatically import numpy or do I need to import that separately?
@David542, no, pandas does NOT upload numpy. You need to import it separately. As for your second question. Both produce the same results. You may get a 'warning' from pandas with df['_starts_with_comment'], Using .loc is for indexing purposes. I found this site that explain some of the differences shanelynn.ie/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.