skipping unknown number of lines to read the header python pandas

Question

i have an excel data that i read in with python pandas:

import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )

the mock data looks like this:

unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :

data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )

data looks like:

 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.

use open to have a file object, iterate through your file object until you reached the end of your junk (you'll have to find out how to assess this) then pass the file object into pd.read_csv(fileobject, ..) instead of your filepath. — R Nar
– R Nar, Commented Dec 1, 2015 at 19:42
@Jessica, I added an answer, once you know the header then you just pass that to the function, the logic with readline and tell is all that is important, you can do whatever you want with it i.e take more args or just use the logic, it is just an example — Padraic Cunningham
– Padraic Cunningham, Commented Dec 1, 2015 at 19:48

Padraic Cunningham · Accepted Answer · 2015-12-01 20:09:10Z

13

If you know what the header startswith:

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Demo:

In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")

In [20]: df
Out[20]: 
      The header
0     foo    bar
1  foobar    foo

By calling .tell we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.

Or using the junk if they all started with something in common:

def skip_to(fle, junk,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while cur_line.startswith(junk):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

 df = skip_to("test.txt", "junk",sep="\t")

edited Dec 1, 2015 at 20:09

answered Dec 1, 2015 at 19:44

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Padraic Cunningham Over a year ago

@Jessica, because you need to pass sep="\t", I added an option to pass keywords that will be passed to read_csv, I also misses a not with my edit ;0

Jessica Over a year ago

how to change this function to output excel file instead of .csv? I tried but the error i keep getting is: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 17: character maps to <undefined> , when applying it to read excel file.

Padraic Cunningham Over a year ago

@Jessica, set encoding="utf8" in thw open call

Jessica Over a year ago

i tried 'encoding='latin-1' no error messages but no output either, the script seems to be running forever. @Padraic Cunningham

Padraic Cunningham Over a year ago

@Jessica, you need to specify the correct encoding or you could end up with garbage, do you know the actual encoding that the data was encoded to? Also, as far as running forever goes, if your file is very large, it could take quite a while to parse, there is no way the while loop never ends as it stops on a match or the EOF.

The AG · Accepted Answer · 2021-06-24 19:07:51Z

1

Another simple way to achieve a dynamic skiprows would something like this which worked for me:

# Open the file
with open('test.csv', encoding='utf-8') as readfile:
        ls_readfile = readfile.readlines()
        
        #Find the skiprows number with ID as the startswith
        skip = next(filter(lambda x: x[1].startswith('ID'), enumerate(ls_readfile)))[0]
        print(skip)

#import the file with the separator \t
df = pd.read_csv(r'test.txt', skiprows=skip, sep ='\t')

answered Jun 24, 2021 at 19:07

The AG

70010 silver badges20 bronze badges

Collectives™ on Stack Overflow

skipping unknown number of lines to read the header python pandas

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related