13

i have an excel data that i read in with python pandas:

import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )

the mock data looks like this:

unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :

data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )

data looks like:

 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.

4
  • Just skip the lines yourself and pass a file object Commented Dec 1, 2015 at 19:32
  • @Padraic Cunningham sorry? i don't follow Commented Dec 1, 2015 at 19:35
  • use open to have a file object, iterate through your file object until you reached the end of your junk (you'll have to find out how to assess this) then pass the file object into pd.read_csv(fileobject, ..) instead of your filepath. Commented Dec 1, 2015 at 19:42
  • @Jessica, I added an answer, once you know the header then you just pass that to the function, the logic with readline and tell is all that is important, you can do whatever you want with it i.e take more args or just use the logic, it is just an example Commented Dec 1, 2015 at 19:48

2 Answers 2

13

If you know what the header startswith:

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Demo:

In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")

In [20]: df
Out[20]: 
      The header
0     foo    bar
1  foobar    foo

By calling .tell we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.

Or using the junk if they all started with something in common:

def skip_to(fle, junk,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while cur_line.startswith(junk):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

 df = skip_to("test.txt", "junk",sep="\t")
Sign up to request clarification or add additional context in comments.

5 Comments

@Jessica, because you need to pass sep="\t", I added an option to pass keywords that will be passed to read_csv, I also misses a not with my edit ;0
how to change this function to output excel file instead of .csv? I tried but the error i keep getting is: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 17: character maps to <undefined> , when applying it to read excel file.
@Jessica, set encoding="utf8" in thw open call
i tried 'encoding='latin-1' no error messages but no output either, the script seems to be running forever. @Padraic Cunningham
@Jessica, you need to specify the correct encoding or you could end up with garbage, do you know the actual encoding that the data was encoded to? Also, as far as running forever goes, if your file is very large, it could take quite a while to parse, there is no way the while loop never ends as it stops on a match or the EOF.
1

Another simple way to achieve a dynamic skiprows would something like this which worked for me:

# Open the file
with open('test.csv', encoding='utf-8') as readfile:
        ls_readfile = readfile.readlines()
        
        #Find the skiprows number with ID as the startswith
        skip = next(filter(lambda x: x[1].startswith('ID'), enumerate(ls_readfile)))[0]
        print(skip)

#import the file with the separator \t
df = pd.read_csv(r'test.txt', skiprows=skip, sep ='\t')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.