Extract text files into multiple columns in python

Question

I have different text files and I want to extract the values from there into a csv file. Each file has the following format

main cost: 30
additional cost: 5

I managed to do that but the problem that I want it to insert the values of each file into a different columns I also want the number of text files to be a user argument

This is what I'm doing now

  numFiles = sys.argv[1]
  d = [[] for x in xrange(numFiles+1)]
  for i in range(numFiles): 
      filename = 'mytext' + str(i) + '.text'
      with open(filename, 'r') as in_file:
      for line in in_file:
        items = line.split(' : ')
        num = items[1].split('\n')

        if i ==0:
            d[i].append(items[0])

        d[i+1].append(num[0])

        grouped = itertools.izip(*d[i] * 1)
        if i == 0:
            grouped1 = itertools.izip(*d[i+1] * 1)

        with open(outFilename, 'w') as out_file:
            writer = csv.writer(out_file)
            for j in range(numFiles):
                for val in itertools.izip(d[j]):
                    writer.writerow(val)

This is what I'm getting now, everything in one column

main cost   
additional cost   
30   
5   
40   
10

And I want it to be

main cost        | 30  | 40
additional cost  | 5   | 10

Where does the last column come from in the desired output? Are ther only two lines in each input file? — wwii
– wwii, Commented Jul 29, 2016 at 22:57
I'm assuming the input file looks something like: main cost: 30 additional cost: 5 main cost: 40 additional cost: 10 — Michael
– Michael, Commented Jul 29, 2016 at 22:57

Community · Accepted Answer · 2017-05-23 11:44:23Z

2

You could use a dictionary to do this where the key will be the "header" you want to use and the value be a list.

So it would look like someDict = {'main cost': [30,40], 'additional cost': [5,10]}

edit2: Went ahead and cleaned up this answer so it makes a little more sense.

You can build the dictionary and iterate over it like this:

from collections import OrderedDict

in_file = ['main cost : 30', 'additional cost : 5', 'main cost : 40', 'additional cost : 10']
someDict = OrderedDict()

for line in in_file:
    key,val = line.split(' : ')
    num = int(val)
    if key not in someDict:
        someDict[key] = []

    someDict[key].append(num)

for key in someDict:
    print(key)
    for value in someDict[key]:
        print(value)

The code outputs:

main cost
30
40
additional cost
5
10

Should be pretty straightforward to modify the example to fit your desired output.

I used the example @ append multiple values for one key in Python dictionary and thanks to @wwii for some suggestions.

I used an OrderedDict since a dictionary won't keep keys in order.

You can run my example @ https://ideone.com/myN2ge

edited May 23, 2017 at 11:44

CommunityBot

11 silver badge

answered Jul 29, 2016 at 21:55

Michael

1411 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

wwii Over a year ago

For this solution, you can be sure that there are only two keys, so you could construct the dictionary before-hand with those two keys and an empty list for values - then you can get rid of the if/else for the dictionary assignment. Alternatively if you are not sure about the keys beforehand you could use collections.defaultdict.

wwii Over a year ago

When you split text and plan on using the individual items later in your code, it is nice to give them names - it makes subsequent code easier to read. Take advantage of unpacking: in this case something like - key, value = line.split(':') ; value = value.strip()

Michael Over a year ago

Both great examples. For the first, I would probably keep it my way so in the future the file formats can change without having to modify the code. I agree with your second example.

wwii Over a year ago

Play around with collections.defaultdict, it solves the problem of trying to assign to a missing key without using if/thens or try/excepts.

Michael Over a year ago

That works as well unless you want to use an OrderedDict, which is probably what OP wants. Otherwise, it won't always output in the same order. I'll edit my example to include your first suggestion though. It's much easier to read that way.

beroe · Accepted Answer · 2016-07-30 06:06:11Z

0

This is how I might do it. Assumes the fields are the same in all the files. Make a list of names, and a dictionary using those field names as keys, and the list of values as the entries. Instead of running on file1.text, file2.text, etc. run the script with file*.text as a command line argument.

#! /usr/bin/env python

import sys

if len(sys.argv)<2:
    print "Give file names to process, with wildcards"
else:
    FileList= sys.argv[1:]
    FileNum = 0
    outFilename = "myoutput.dat"
    NameList = []
    ValueDict = {}
    for InfileName in FileList:
        Infile = open(InfileName, 'rU') 
        for Line in Infile: 
            Line=Line.strip('\n')
            Name,Value = Line.split(":")
            if FileNum==0:
                NameList.append(Name.strip())
            ValueDict[Name] = ValueDict.get(Name,[]) + [Value.strip()]
        FileNum += 1 # the last statement in the file loop
        Infile.close()
    # print NameList
    # print ValueDict

    with open(outFilename, 'w') as out_file:
        for N in NameList:
            OutString =  "{},{}\n".format(N,",".join(ValueDict.get(N)))
            out_file.write(OutString)

Output for my four fake files was:

main cost,10,10,40,10
additional cost,25.6,25.6,55.6,25.6

edited Jul 30, 2016 at 6:06

answered Jul 29, 2016 at 23:23

beroe

12.4k6 gold badges40 silver badges82 bronze badges

3 Comments

Lily Over a year ago

Thanks @beroe but I want the output to be saved in an csv file and the | representing a different column

Lily Over a year ago

this is what I get when I try the above code TypeError: can only join an iterable

beroe Over a year ago

Insert a line that prints ValueDict and see what it says. Each value should be a list of strings (numbers) if the data match your example. If there are blank lines or header lines, you could insert a check in the loop before the ValueDict[Name]= part...

Collectives™ on Stack Overflow

Extract text files into multiple columns in python

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related