Assuming your files are a series of lines, each line looking something like what you wrote, i.e.
000892834 13.663 0.098 0.871 0.093 0.745 4.611 4795
Then you can strip out the leading 0s by using lstrip(). When you read the file, you don't get integers, you get strings, so you have to strip the 0 characters. (Alternatively, you could cast that number with trailing 0s to an integer, then recast it to a string to write it again, but you don't need to.)
Use a dictionary to pair the lines by the ID, and have its key be a list, in which you store the line from the first file and the line from the second file.
mergedData = {}
with open('file1.txt', 'r') as file1, open('file2.txt', 'r') as file2, open('mergedData.txt', 'w') as outfile:
for line in file1:
mergedData[line.split()[0].lstrip('0')] = [line]
for line in file2:
mergedData[line.split()[0]].append(" ".join(line.split()[:4]))
for k in mergedData:
outfile.write("\n".join(mergedData[k]) + "\n")
If your data has keys in the second file which are not in the first, you should use a defaultdict for mergedData instead. (This solves #1 in your edit.)
from collections import defaultdict
mergedData = defaultdict(list)
with open('file1.txt', 'r') as file1, open('file2.txt', 'r') as file2, open('mergedData.txt', 'w') as outfile:
for line in file1:
mergedData[line.split()[0].lstrip('0')].append(line)
for line in file2:
mergedData[line.split()[0]].append(" ".join(line.split()[:4]))
...
If you need to only write data which meets a particular requirement, you can use filter() to get only the elements which meet a particular requirement. filter() takes a filter function which must return True if the element meets that requirement. This is a good change to use a lambda expression for a quick inline function.
...
filteredMergedData = filter(lambda x: (len(x[1]) == 2) and (int(x[1][0].split()[1]) > 15 and int(x[1][1].split()[1]) > 15), mergedData.iteritems()
for d in filteredMergedData:
outfile.write("\n".join(d[1]) + "\n")
That was pretty convoluted, but basically, it turns the key, value pairs in the dictionary into tuples like (key, value) and iterates through them, checking to see if the lambda returns True. The lambda takes the value part, which is the list as you recall, and checks both of the second columns for a value greater than 15. It has to cast these values to int because they're strings normally, and won't compare to an int. In order for the subindexing to work, you also have to check to make sure that the value part contains two lines - this also takes care of #3 for you.
Now, if you want to put this all together and support an arbitrary criteria and arbitrary filenames, you should put this code into a function and make it take four arguments: the three filenames, as well as a function (yes, you can take functions as arguments) to act as the filter function.
from collections import defaultdict
def mergeData(file1name, file2name, outfilename, a_filter_func):
""" Merge the data of two files. """
mergedData = defaultdict(list)
with open(file1name, 'r') as file1, open(file2name, 'r') as file2, open(outfilename, 'w') as outfile:
for line in file1:
mergedData[line.split()[0].lstrip('0')].append(line)
for line in file2:
mergedData[line.split()[0]].append(" ".join(line.split()[:4]))
filteredMergedData = filter(a_filter_func, mergedData.iteritems())
for d in filteredMergedData:
outfile.write("\n".join(d[1]) + "\n")
# finally, call the function.
filter_func = lambda x: (len(x[1]) == 2) and (int(x[1][0].split()[1]) > 15 and int(x[1][1].split()[1]) > 15)
mergeData('file1.txt', 'file2.txt', 'mergedData.txt', filter_func)
Just pass something other than that lambda filter_func if you want other criteria - you can create a named, "def"'d function and pass that if you like, too e.g. if you have def foo(x): you can pass foo as the argument. Just make sure it returns True or False.
Edit: on second thought, the lambda-based solution requires four linear iterations. Here's an optimized (and probably simpler) version:
def mergeData(file1name, file2name, outfilename, a_filter_func):
""" Merge the data of two files. """
mergedData = defaultdict(list)
with open(file1name, 'r') as file1, open(file2name, 'r') as file2, open(outfilename, 'w') as outfile:
for line in file1:
splt = line.split()
if a_filter_func(splt[1]):
mergedData[splt[0].lstrip('0')].append(line)
for line in file2:
splt = line.split()
if a_filter_func(splt[1]):
mergedData[splt[0]].append(" ".join(splt[:4]))
for k in mergedData:
outfile.write("\n".join(mergedData[k]) + "\n")
Now a_filter_func may be something as simple as:
lambda x: x > 15
In my excitement of getting to use "functional programming" functions (such as filter()) I forgot that it could be simpler. I also split the line only once, rather than multiple times.
(a,b,c), (e,f,g), ...?