Getting a single file from a tar file using the tarfile lib in python

Question

I am trying to grab a single file from a tar archive. I have the tarfile library and I can do things like find the file in a list with the right extension:

like their example:

def xml_member_files(self,members): 
    for tarinfo in members:
        if os.path.splitext(tarinfo.name)[1] == ".xml":
            yield tarinfo


    member_file = self.xml_member_files(tar)
    for m in member_file:           
        print m.name

This is great and the output is:

RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutBeta.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutGamma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutSigma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/product.xml

If I say just look for product.xml then it doesn't work. So I tried this:

    ti = tar.getmember('product.xml')
    print ti.name

and it doesn't find product.xml because I am guessing the path information before hand. I have no idea how to retrieve just that pathing information so I can get at my product.xml file once extracted (feels like I am doing things the hard way anyway) but yah, how do I figure out just that path so I can concatenate it to my other file functions to read and load that xml file after it is the only file extracted from a tar file?

Please review my answer below, and upvote or mark as as accepted if it helped you in thinking through the problem. — Alex G Rice
– Alex G Rice, Commented Dec 20, 2016 at 17:38

pbuck · Accepted Answer · 2016-12-17 00:43:52Z

3

Return full path by iterating over result of getnames(). For example, to get full path for lutBeta.xml:

tar = tarfile.TarFile('mytarfile.tar')
membername = [x for x in tar.getnames() if os.path.basename(x) == 'lutBeta.xml'][0]

answered Dec 17, 2016 at 0:43

pbuck

4,5902 gold badges28 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex G Rice · Accepted Answer · 2016-12-16 22:19:02Z

1

I would try first doing TarFile.getnames(), which I imagine works a lot like tar tzf filename.tar.gz from the command line. Then you get find out what paths to feed to your getmember() or getmembers().

answered Dec 16, 2016 at 22:19

Alex G Rice

1,57911 silver badges16 bronze badges

Comments

CheekyBeeswaxer · Accepted Answer · 2022-07-31 20:06:34Z

You don't want to be iterating over the entire tar with getnames(), getmember() or getmembers(), because as soon as you find your file, you don't need to keep looking through the rest of the tar.

for example, it takes my machine about 47ms to extract a single file from a 2GB tar by iterating over all the file names:

with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
    membername = [x for x in tar.getnames() if x.endswith('myfile.txt')][0]
    file = tar.extractfile(membername).read().decode()

But stopping as soon as the file is found takes me only 0.27 ms, nearly 175x faster.

file = None
with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
    for member in tar:
        if member.name.endswith('myfile.txt'):
            file = tar.extractfile(member).read().decode()
            break

Note if the file you need is more near the end of the archive, you probably won't notice much of a change in speed, but it is still a good practice to not loop through the whole file if you don't have to.

Collectives™ on Stack Overflow

Getting a single file from a tar file using the tarfile lib in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related