1

I am trying to grab a single file from a tar archive. I have the tarfile library and I can do things like find the file in a list with the right extension:

like their example:

def xml_member_files(self,members): 
    for tarinfo in members:
        if os.path.splitext(tarinfo.name)[1] == ".xml":
            yield tarinfo


    member_file = self.xml_member_files(tar)
    for m in member_file:           
        print m.name

This is great and the output is:

RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutBeta.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutGamma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutSigma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/product.xml

If I say just look for product.xml then it doesn't work. So I tried this:

    ti = tar.getmember('product.xml')
    print ti.name

and it doesn't find product.xml because I am guessing the path information before hand. I have no idea how to retrieve just that pathing information so I can get at my product.xml file once extracted (feels like I am doing things the hard way anyway) but yah, how do I figure out just that path so I can concatenate it to my other file functions to read and load that xml file after it is the only file extracted from a tar file?

1
  • Please review my answer below, and upvote or mark as as accepted if it helped you in thinking through the problem. Commented Dec 20, 2016 at 17:38

3 Answers 3

3

Return full path by iterating over result of getnames(). For example, to get full path for lutBeta.xml:

tar = tarfile.TarFile('mytarfile.tar')
membername = [x for x in tar.getnames() if os.path.basename(x) == 'lutBeta.xml'][0]
Sign up to request clarification or add additional context in comments.

Comments

1

I would try first doing TarFile.getnames(), which I imagine works a lot like tar tzf filename.tar.gz from the command line. Then you get find out what paths to feed to your getmember() or getmembers().

Comments

1

You don't want to be iterating over the entire tar with getnames(), getmember() or getmembers(), because as soon as you find your file, you don't need to keep looking through the rest of the tar.

for example, it takes my machine about 47ms to extract a single file from a 2GB tar by iterating over all the file names:

with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
    membername = [x for x in tar.getnames() if x.endswith('myfile.txt')][0]
    file = tar.extractfile(membername).read().decode()

But stopping as soon as the file is found takes me only 0.27 ms, nearly 175x faster.

file = None
with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
    for member in tar:
        if member.name.endswith('myfile.txt'):
            file = tar.extractfile(member).read().decode()
            break

Note if the file you need is more near the end of the archive, you probably won't notice much of a change in speed, but it is still a good practice to not loop through the whole file if you don't have to.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.