0

I have a set of text files that I am trying to sort and get an output from. The idea is this: I have two files containing slightly matching data, e.g.

File 1:
000892834     13.663      0.098      0.871      0.093      0.745      4.611       4795

File 2:
892834  4916   75   37  4857 130 128  4795  4.61 -0.09    0 0

The main factor in matching both is the first number which is an ID number that does not change in either file save for the 000 in the front of file 1. I need to search both files, extract the rows that match that ID and output the results to a text file in which I can display the results side-by-side, such as:

output:
000892834     13.663      0.098      0.871      0.093      0.745      4.611       4795
892834  4916   75   37

The second part of the above output is not a typo, I also need the script to remove the parts after the fourth data point in file 2 for each row. I have been debating whether I should make these two files into lists first then go through them that way with list comprehension or whether it would be better to use something like a csv reader. Thanks for any help you can give.

EDIT: 1)The values in both lists (ID numbers) are not all the same, list 2 has different ID's and so does list 1. 2)Also, I need to filter out entire rows of data if certain parts do not meet a criteria, for example if column 2 in a certain row does not meet a requirement, then that line is disregarded. 3) Also, I just found out that I need to omit any ID's that are not in both file1 and file2, so if an ID such as the one above that matches is present, then it needs to be included, otherwise it must be left out of the final text file.

example:

for mergedData[(a, b, c), (e, f, g), .....]: 
    if mergedData[(a, e, (all first sub-indices))] > 15
        <delete the entire line from the .txt file> and/or <create a new text file containing only lines that meet the criteria>
8
  • 1
    Is the data in the example files all on one line, like you seem to show it? Commented Jul 3, 2013 at 16:06
  • 1
    split your tasks: reading the files and parsing them into suitable datatypes, and then do whatever you need with these datatypes you are holding Commented Jul 3, 2013 at 16:06
  • 1
    In your recent edit, you talk about filtering out rows of data if they do not meet certain criteria. Can you give us an example? Commented Jul 3, 2013 at 18:33
  • 1
    What do you mean by (a,b,c), (e,f,g), ...? Commented Jul 3, 2013 at 18:55
  • 1
    @ImmortalxR Okay, I need more clarification on "each second number in each row". Do you mean every second number, i.e. the second, the fourth, the sixth etc? Also, you keep saying "row" when I think you are talking about what I understand to be a "column." A "row" would be a line in the file. When you say "rows of individual numbers" I'm thinking you mean each separate number in the line... I do have an answer for #3 though. Commented Jul 3, 2013 at 19:02

1 Answer 1

1

Assuming your files are a series of lines, each line looking something like what you wrote, i.e.

000892834     13.663      0.098      0.871      0.093      0.745      4.611       4795

Then you can strip out the leading 0s by using lstrip(). When you read the file, you don't get integers, you get strings, so you have to strip the 0 characters. (Alternatively, you could cast that number with trailing 0s to an integer, then recast it to a string to write it again, but you don't need to.)

Use a dictionary to pair the lines by the ID, and have its key be a list, in which you store the line from the first file and the line from the second file.

mergedData = {}
with open('file1.txt', 'r') as file1, open('file2.txt', 'r') as file2, open('mergedData.txt', 'w') as outfile:
    for line in file1:
        mergedData[line.split()[0].lstrip('0')] = [line]
    for line in file2:
        mergedData[line.split()[0]].append(" ".join(line.split()[:4]))
    for k in mergedData:
        outfile.write("\n".join(mergedData[k]) + "\n")

If your data has keys in the second file which are not in the first, you should use a defaultdict for mergedData instead. (This solves #1 in your edit.)

from collections import defaultdict
mergedData = defaultdict(list)
with open('file1.txt', 'r') as file1, open('file2.txt', 'r') as file2, open('mergedData.txt', 'w') as outfile:
    for line in file1:
        mergedData[line.split()[0].lstrip('0')].append(line)
    for line in file2:
        mergedData[line.split()[0]].append(" ".join(line.split()[:4]))
    ...

If you need to only write data which meets a particular requirement, you can use filter() to get only the elements which meet a particular requirement. filter() takes a filter function which must return True if the element meets that requirement. This is a good change to use a lambda expression for a quick inline function.

   ...
   filteredMergedData = filter(lambda x: (len(x[1]) == 2) and (int(x[1][0].split()[1]) > 15 and int(x[1][1].split()[1]) > 15), mergedData.iteritems()
   for d in filteredMergedData:
       outfile.write("\n".join(d[1]) + "\n")

That was pretty convoluted, but basically, it turns the key, value pairs in the dictionary into tuples like (key, value) and iterates through them, checking to see if the lambda returns True. The lambda takes the value part, which is the list as you recall, and checks both of the second columns for a value greater than 15. It has to cast these values to int because they're strings normally, and won't compare to an int. In order for the subindexing to work, you also have to check to make sure that the value part contains two lines - this also takes care of #3 for you.

Now, if you want to put this all together and support an arbitrary criteria and arbitrary filenames, you should put this code into a function and make it take four arguments: the three filenames, as well as a function (yes, you can take functions as arguments) to act as the filter function.

from collections import defaultdict

def mergeData(file1name, file2name, outfilename, a_filter_func):
    """ Merge the data of two files. """
    mergedData = defaultdict(list)
    with open(file1name, 'r') as file1, open(file2name, 'r') as file2, open(outfilename, 'w') as outfile:
        for line in file1:
            mergedData[line.split()[0].lstrip('0')].append(line)
        for line in file2:
            mergedData[line.split()[0]].append(" ".join(line.split()[:4]))
        filteredMergedData = filter(a_filter_func, mergedData.iteritems())
        for d in filteredMergedData:
            outfile.write("\n".join(d[1]) + "\n")

# finally, call the function.
filter_func = lambda x: (len(x[1]) == 2) and (int(x[1][0].split()[1]) > 15 and int(x[1][1].split()[1]) > 15)
mergeData('file1.txt', 'file2.txt', 'mergedData.txt', filter_func)

Just pass something other than that lambda filter_func if you want other criteria - you can create a named, "def"'d function and pass that if you like, too e.g. if you have def foo(x): you can pass foo as the argument. Just make sure it returns True or False.


Edit: on second thought, the lambda-based solution requires four linear iterations. Here's an optimized (and probably simpler) version:

def mergeData(file1name, file2name, outfilename, a_filter_func):
    """ Merge the data of two files. """
    mergedData = defaultdict(list)
    with open(file1name, 'r') as file1, open(file2name, 'r') as file2, open(outfilename, 'w') as outfile:
        for line in file1:
            splt = line.split()
            if a_filter_func(splt[1]):
                mergedData[splt[0].lstrip('0')].append(line)
        for line in file2:
            splt = line.split()
            if a_filter_func(splt[1]):
                mergedData[splt[0]].append(" ".join(splt[:4]))
        for k in mergedData:
            outfile.write("\n".join(mergedData[k]) + "\n")

Now a_filter_func may be something as simple as:

lambda x: x > 15

In my excitement of getting to use "functional programming" functions (such as filter()) I forgot that it could be simpler. I also split the line only once, rather than multiple times.

Sign up to request clarification or add additional context in comments.

9 Comments

@ImmortalxR Yeah, woops, I wrote file1.txt instead of file1.
@ImmortalxR Do you have keys in the first file which are not present in the second, as well?
@ImmortalxR I edited my answer to deal with this case; please see my other comment in response to your question with regard to the other edit you made.
@ImmortalxR Please give me a moment to address all of these extra requirements - I have something else I need to do. I'll be back.
@ImmortalxR Sure. It's been a while since I looked at this but if I recall correctly you only wanted to write out the data which had two entries. I don't think I see that in my last section of code but it'll be as easy as if len(mergedData[k]) > 1: before the outfile.write() part.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.