How can I read only the header column of a CSV file using Python?

Question

I am looking for a a way to read just the header row of a large number of large CSV files.

Using Pandas, I have this method available, for each csv file:

>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns

I could do this with just the csv module:

>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames

The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.

My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.

How can I extract only the header row of a CSV file, quickly?

Note that DictReader doesn't read the entire file... so you could just use that iteratively over the files required and build a set... I'm done something similar in an answer I've made... — Jon Clements
– Jon Clements, Commented Jul 25, 2014 at 19:17

Jarno · Accepted Answer · 2018-11-29 09:46:08Z

46

Expanding on the answer given by Jeff It is now possbile to use pandas without actually reading any rows.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')

In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']

pandas can have the advantage that it deals more gracefully with CSV encodings.

answered Nov 29, 2018 at 9:46

Jarno

7,3625 gold badges49 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mark M Over a year ago

Hi and great tip. I found replacing index_col with header got me an additional field name that I was missing. Otherwise, the rest works perfectly!

Jarno Over a year ago

@MarkMoretto I think it depends upon whether you have an extra index column without a header in your CSV or not. If not it is probably clearest to set index_col=False as header=0 is already kind of the default.

Tyler · Accepted Answer · 2016-10-18 18:35:00Z

27

I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.

import csv    

with open(fpath, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames

Hopefully that helps!

edited Oct 18, 2016 at 18:35

answered Jul 30, 2016 at 14:48

Tyler

1,0602 gold badges15 silver badges24 bronze badges

1 Comment

Genarito Over a year ago

This should be the new accepted answer. It's the fastest and clearest method

Jon Clements · Accepted Answer · 2014-07-25 19:15:35Z

15

I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:

import csv
from glob import iglob

unique_headers = set()
for filename in iglob('*.csv'):
    with open(filename, 'rb') as fin:
        csvin = csv.reader(fin)
        unique_headers.update(next(csvin, []))

answered Jul 25, 2014 at 19:15

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

3 Comments

Andy Over a year ago

I compared this to the answer Jeff provided. This one is running about 5 times faster than the pandas answer for a sample of my data set. I suspect it's because it doesn't read the extra data row (I appreciate the note about DictReader as well). Thanks

Jon Clements Over a year ago

@Andy I suspect the real difference is not really the un-necessary reading of an additional row, but more the overhead of creating a DataFrame to do so...

Erik Johnsson Over a year ago

May I know what this sentence means ? "unique_headers.update(next(csvin, []))" @JonClements

mdubez · Accepted Answer · 2017-02-28 18:22:22Z

14

What about:

pandas.read_csv(PATH_TO_CSV, nrows=1).columns

That'll read the first row only and return the columns found.

answered Feb 28, 2017 at 18:22

mdubez

3,1541 gold badge20 silver badges10 bronze badges

1 Comment

greendino Over a year ago

still create a unnecessary dataframe of the first row

Jeff · Accepted Answer · 2014-07-25 19:14:36Z

13

Here's one way. You get 1 row.

In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')

In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]: 
          a         b         c         d
0  0.365453  0.633631 -1.917368 -1.996505

answered Jul 25, 2014 at 19:14

Jeff

130k21 gold badges223 silver badges189 bronze badges

9 Comments

Jon Clements Over a year ago

This does read an un-necessary row though for the sake of reading the header... but maybe I'm not entirely clear on what the OP wishes

Andy Over a year ago

I appreciate the answer, Jeff. I compared your answer to the one provided by Jon. Both work, but this one runs about 5 times slower than the one he provided.

furas Over a year ago

@Jon Clements OP need only headers but read_csv() doesn't run with nrows=0 - read_csv() needs to read at least one row.

Jeff Over a year ago

@Andy if that matter to you then use the other soln. This is the pandas method.

furas Over a year ago

@Jeff & Jon Clements: I think you could add header=None to get headers as normal row - without first row of data.

|

Saurabh Chandra Patel · Accepted Answer · 2019-05-29 16:47:33Z

7

you have missed nrows=1 param to read_csv

>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns

answered May 29, 2019 at 16:47

Saurabh Chandra Patel

13.7k7 gold badges94 silver badges80 bronze badges

Comments

Muhieddine Alkousy · Accepted Answer · 2018-09-28 09:39:25Z

1

it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:

for filename in glob.glob(files_path+"\*.csv"):
    with open(filename) as f:
        first_line = f.readline()

answered Sep 28, 2018 at 9:39

Muhieddine Alkousy

111 bronze badge

Comments

Sway Wu · Accepted Answer · 2020-09-24 23:44:06Z

1

it is easy you can use this:

df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()

In this case you can only read really few row for get your header

answered Sep 24, 2020 at 23:44

Sway Wu

3893 silver badges8 bronze badges

Comments

blessedk · Accepted Answer · 2022-01-12 16:53:51Z

1

if you are only interested in the headers and would like to use pandas, the only extra thing you need to pass in apart from the csv file name is "nrows=0":

headers = pd.read_csv("test.csv", nrows=0)

answered Jan 12, 2022 at 16:53

blessedk

717 bronze badges

Comments

Theo · Accepted Answer · 2019-03-23 14:49:04Z

-1

import pandas as pd

get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)

edited Mar 23, 2019 at 14:49

Theo

61.9k8 gold badges28 silver badges48 bronze badges

answered Mar 23, 2019 at 14:43

Aaksh Kumar

92 bronze badges

Collectives™ on Stack Overflow

How can I read only the header column of a CSV file using Python?

10 Answers 10

2 Comments

1 Comment

3 Comments

1 Comment

9 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

2 Comments

1 Comment

3 Comments

1 Comment

9 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related