13

I am working on a project in python in which I need to extract only a subfolder of tar archive not all the files. I tried to use

tar = tarfile.open(tarfile)
tar.extract("dirname", targetdir)

But this does not work, it does not extract the given subdirectory also no exception is thrown. I am a beginner in python. Also if the above function doesn't work for directories whats the difference between this command and tar.extractfile() ?

1
  • extractfile() doesn't write a file to the disk, it just gives you a python object. extract() writes to the disk. Commented Nov 4, 2011 at 12:14

3 Answers 3

23

Building on the second example from the tarfile module documentation, you could extract the contained sub-folder and all of its contents with something like this:

with tarfile.open("sample.tar") as tar:
    subdir_and_files = [
        tarinfo for tarinfo in tar.getmembers()
        if tarinfo.name.startswith("subfolder/")
    ]
    tar.extractall(members=subdir_and_files)

This creates a list of the subfolder and its contents, and then uses the recommended extractall() method to extract just them. Of course, replace "subfolder/" with the actual path (relative to the root of the tar file) of the sub-folder you want to extract.

Sign up to request clarification or add additional context in comments.

Comments

20

The other answer will retain the subfolder path, meaning that subfolder/a/b will be extracted to ./subfolder/a/b. To extract a subfolder to the root, so subfolder/a/b would be extracted to ./a/b, you can rewrite the paths with something like this:

def members(tf):
    l = len("subfolder/")
    for member in tf.getmembers():
        if member.path.startswith("subfolder/"):
            member.path = member.path[l:]
            yield member

with tarfile.open("sample.tar") as tar:
    tar.extractall(members=members(tar))

5 Comments

Works great. You can also rename the top-level folder with this style by doing member.path = os.path.join('new_dirname', member.path[l:])
Great tip. Having tarfile extracted with every useless subdirectory really bugged me.
This works great. Unfortunately, I skipped this answer before by only looking at the best answer.
I will delete it. Sorry. I was hoping for an answer to a very similar issue. By the way, this solution is not working for me. I'm getting an error: [Pyright reportGeneralTypeIssues] [E] Argument of type "Generator[TarInfo, None, None]" cannot be assigned to parameter "members" of type "List[TarInfo] | None" in function "extractall" Type "Generator[TarInfo, None, None]" cannot be assigned to type "List[TarInfo] | None" "Generator[TarInfo, None, None]" is incompatible with "List[TarInfo]" Cannot assign to "None"
Easier to open a new question than leaving a comment with not enough information. The code works correctly if you run it in Python. The error you're showing is a static typing error, not something that will stop the code from functioning. Fix for that error: github.com/python/typeshed/pull/5273
2

The problem with all of the other solutions is that they require to access the end of the file before extracting - which means that they cannot be applied to a stream which does not support seeking.

Staring with Python 3.11.4 (I haven't found a way with earlier versions):

strip1 = lambda member, path: member.replace(name=pathlib.Path(*pathlib.Path(member.path).parts[1:]))
with tarfile.open('file.tar.gz', mode='r:gz') as input:
    input.extractall(path=dest, filter=strip1)

extractall accepts a filter that gets called for each file with TarInfo - you unpack the filename, take all parts except the first one and then repack it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.