3

I need to iterate through a folder and find every instance where the filenames are identical (except for extension) and then zip (preferably using tarfile) each of these into one file.

So I have 5 files named: "example1" each with different file extensions. I need to zip them up together and output them as "example1.tar" or something similar.

This would be easy enough with a simple for loop such as:

tar = tarfile.open('example1.tar',"w")

for output in glob ('example1*'):

tar.add(output)

tar.close()

however, there are 300 "example" files and I need to iterate through each one and their associated 5 files in order to make this work. This is way over my head. Any advice greatly appreciated.

6 Answers 6

2

The pattern you're describing generalizes to MapReduce. I found a simple implementation of MapReduce online, from which an even-simpler version is:

def map_reduce(data, mapper, reducer):
    d = {}
    for elem in data:
        key, value = mapper(elem)
        d.setdefault(key, []).append(value)
    for key, grp in d.items():
        d[key] = reducer(key, grp)
    return d

You want to group all files by their name without the extension, which you can get from os.path.splitext(fname)[0]. Then, you want to make a tarball out of each group by using the tarfile module. In code, that is:

import os
import tarfile

def make_tar(basename, files):
    tar = tarfile.open(basename + '.tar', 'w')
    for f in files:
        tar.add(f)
    tar.close()

map_reduce(os.listdir('.'),
           lambda x: (os.path.splitext(x)[0], x),
           make_tar)

Edit: If you want to group files in different ways, you just need to modify the second argument to map_reduce. The code above groups files that have the same value for the expression os.path.splitext(x)[0]. So to group by the base file name with all the extensions stripped off, you could replace that expression with strip_all_ext(x) and add:

def strip_all_ext(path):
    head, tail = os.path.split(path)
    basename = tail.split(os.extsep)[0]
    return os.path.join(head, basename)
Sign up to request clarification or add additional context in comments.

1 Comment

anyway to alter this code or use os.path.extsep in order to split multiple extensions off one file. e.g. 'foobar.aux.xml'
2

You could do this:

  • list all files in the directory
  • create a dictionary where the basename is the key and all the extensions are values
  • then tar all the files by dictionary key

Something like this:

import os
import tarfile
from collections import defaultdict

myfiles = os.listdir(".")   # List of all files
totar = defaultdict(list)

# now fill the defaultdict with entries; basename as keys, extensions as values
for name in myfiles:
    base, ext = os.path.splitext(name)
    totar[base].append(ext)

# iterate through all the basenames
for base in totar:
    files = [base+ext for ext in totar[base]]
    # now tar all the files in the list "files"
    tar = tarfile.open(base+".tar", "w")
    for item in files:    
        tar.add(item)
    tar.close()

Comments

1

You have to problems. Solve the separately.

  1. Finding matching names. Use a collections.defaultict

  2. Creating tar files after you find the matching names. You've got that pretty well covered.

So. Solve problem 1 first.

Use glob to get all the names. Use os.path.basename to split the path and basename. Use os.path.splitext to split the name and extension.

A dictionary of lists can be used to save all files that have the same name.

Is that what you're doing in part 1?


Part 2 is putting the files into tar archives. For that, you've got most of the code you need.

Comments

0

Try using the glob module: http://docs.python.org/library/glob.html

Comments

0
#! /usr/bin/env python

import os
import tarfile

tarfiles = {}
for f in os.listdir ('files'):
    prefix = f [:f.rfind ('.') ]
    if prefix in tarfiles: tarfiles [prefix] += [f]
    else: tarfiles [prefix] = [f]

for k, v in tarfiles.items ():
    tf = tarfile.open ('%s.tar.gz' % k, 'w:gz')
    for f in v: tf.addfile (tarfile.TarInfo (f), file ('files/%s' % f) )
    tf.close ()

6 Comments

@Hyperboreus: -1 ... f = 'fubar'; prefix = f [:f.rfind ('.') ] produces 'fuba' ... use os.path.splitext()
@Hyboreus: while you are at it, lose the ugly spaces before [ in slices and dict accesses and ( in function calls
@Hyperboreus: - thanks for your help. When using the above code I ended up with a .tar of each file instead of each unique filename? Thoughts? @John Machin: not sure about your os.path.splitext() reference.
@KennyC: It is all about using os.path.splitext() to remove the extension (if any) from the end of the path, which is the right thing to do and is used by 3 of the answers. If there is no extension, it will returned the input unchanged. However the gimmick code used by @Hyboreus FAILS; it removes the last character (fubar -> fuba).
@KennyC: Not taking into consideration filenames without periods (my bad, but others already pointed out how to do this properly), the script packs tar.gz files grouping the files by there name. Here an example output:
|
-1
import os
import tarfile

allfiles = {}

for filename in os.listdir("."):
    basename = '.'.join (filename.split(".")[:-1] )
    if not basename in all_files:
        allfiles[basename] = [filename]
    else:
        allfiles[basename].append(filename)

for basename, filenames in allfiles.items():
    if len(filenames) < 2:
        continue
    tardata = tarfile.open(basename+".tar", "w")
    for filename in filenames:
        tardata.add(filename)
    tardata.close()

1 Comment

-1 Use os.path.splitext() -- '.'.join ('fubar'.split(".")[:-1]) produces an empty string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.