1

I'm trying to process a pipe separated text file with the following format:

18511|1|2587198|2004-03-31|0|100000|0|1.97|0.49988|100000||||
18511|2|2587198|2004-06-30|0|160000|0|3.2|0.79669|60000|60|||
18511|3|2587198|2004-09-30|0|160000|0|2.17|0.79279|0|0|||
18511|4|2587198|2004-09-30|0|160000|0|1.72|0.79118|0|0|||
18511|5|2587198|2005-03-31|0|0|0|0|0|-160000|-100|||19
18511|1|2587940|2004-03-31|0|240000|0|0.78|0.27327|240000||||
18511|2|2587940|2004-06-30|0|560000|0|1.59|0.63576|320000|133.33||24|
18511|3|2587940|2004-09-30|0|560000|0|1.13|0.50704|0|0|||
18511|4|2587940|2004-09-30|0|560000|0|0.96|0.50704|0|0|||
18511|5|2587940|2005-03-31|0|0|0|0|0|-560000|-100|||14

For each line I want to isolate the second field and write that line to a file with that field as part of the filename e.g issue1.txt, issue2.txt where the number is the second field in the above file excerpt. This number can be in the range 1 to 56. My code is shown below:

with open('d:\\tmp\issueholding.txt') as f, open('d:\\tmp\issue1.txt', 'w') as out_f1,\
open('d:\\tmp\issue2.txt', 'w') as out_f2,open('d:\\tmp\issue3.txt', 'w') as out_f3,\
open('d:\\tmp\issue4.txt', 'w') as out_f4,open('d:\\tmp\issue5.txt', 'w') as out_f5,\
open('d:\\tmp\issue6.txt', 'w') as out_f6,open('d:\\tmp\issue7.txt', 'w') as out_f7,\
open('d:\\tmp\issue8.txt', 'w') as out_f8,open('d:\\tmp\issue9.txt', 'w') as out_f9,\
open('d:\\tmp\issue10.txt', 'w') as out_f10,open('d:\\tmp\issue11.txt', 'w') as out_f11,\
open('d:\\tmp\issue12.txt', 'w') as out_f12,open('d:\\tmp\issue13.txt', 'w') as out_f13,\
open('d:\\tmp\issue14.txt', 'w') as out_f14,open('d:\\tmp\issue15.txt', 'w') as out_f15,\
open('d:\\tmp\issue16.txt', 'w') as out_f16,open('d:\\tmp\issue17.txt', 'w') as out_f17,\
open('d:\\tmp\issue18.txt', 'w') as out_f18,open('d:\\tmp\issue19.txt', 'w') as out_f19,\
open('d:\\tmp\issue20.txt', 'w') as out_f20,open('d:\\tmp\issue21.txt', 'w') as out_f21,\
open('d:\\tmp\issue22.txt', 'w') as out_f22,open('d:\\tmp\issue23.txt', 'w') as out_f23,\
open('d:\\tmp\issue24.txt', 'w') as out_f24,open('d:\\tmp\issue25.txt', 'w') as out_f25,\
open('d:\\tmp\issue32.txt', 'w') as out_f32,open('d:\\tmp\issue33.txt', 'w') as out_f33,\
open('d:\\tmp\issue34.txt', 'w') as out_f34,open('d:\\tmp\issue35.txt', 'w') as out_f35,\
open('d:\\tmp\issue36.txt', 'w') as out_f36,open('d:\\tmp\issue37.txt', 'w') as out_f37,\
open('d:\\tmp\issue38.txt', 'w') as out_f38,open('d:\\tmp\issue39.txt', 'w') as out_f39,\
open('d:\\tmp\issue40.txt', 'w') as out_f40,open('d:\\tmp\issue41.txt', 'w') as out_f41,\
open('d:\\tmp\issue42.txt', 'w') as out_f42,open('d:\\tmp\issue43.txt', 'w') as out_f43,\
open('d:\\tmp\issue44.txt', 'w') as out_f44,open('d:\\tmp\issue45.txt', 'w') as out_f45,\
open('d:\\tmp\issue46.txt', 'w') as out_f46,open('d:\\tmp\issue47.txt', 'w') as out_f47,\
open('d:\\tmp\issue48.txt', 'w') as out_f48,open('d:\\tmp\issue49.txt', 'w') as out_f49,\
open('d:\\tmp\issue50.txt', 'w') as out_f50,open('d:\\tmp\issue51.txt', 'w') as out_f51,\
open('d:\\tmp\issue52.txt', 'w') as out_f52,open('d:\\tmp\issue53.txt', 'w') as out_f53,\
open('d:\\tmp\issue54.txt', 'w') as out_f54,open('d:\\tmp\issue55.txt', 'w') as out_f55,\
open('d:\\tmp\issue56.txt', 'w') as out_f56:
    for line in f:
        field1_end = line.find('|') +1
        field2_end = line.find('|',field1_end)
        f2=line[field1_end:field2_end]
        out_f56.write(line)

My two issue are:

1) When trying to run the above I get the following error message

File "", line unknown SyntaxError: too many statically nested blocks

2) How do I change this line out_f56.write(line) so that I can use the variable f2 as part of the file descriptor rather than hard coding it.

I am running this in a jupyter notebook running python3 under Windows. To be clear, the input file has approx 235 Million records so performance is key.

Appreciate any help or suggestions

1
  • This isn't an answer. But in my experience often the bottleneck isn't Python, but disk I/O. No matter how much the Python is optimized, the latter may be a bottleneck. Commented Feb 15, 2018 at 11:55

1 Answer 1

1

Try something like this (see comments in code for explanation):

with open(R"d:\tmp\issueholding.txt") as f:
    for line in f:
        # splitting line into list of strings at '|' character
        fields = line.split('|')

        # defining output file name according to issue code in second field
        # NB: list-indexes are zero-based, therefore use 1
        out_name = R"d:\tmp\issue%s.txt" % fields[1]

        # opening output file and writing current line to it
        # NB: make sure you use the 'a+' mode to append to existing file
        with open(out_name, 'a+') as ff:
            ff.write(line)

To avoid opening files repeatedly inside the reading loop, you could do the following:

from collections import defaultdict

with open(R"D:\tmp\issueholding.txt") as f:

    # setting up dictionary to hold lines grouped by issue code
    # using a defaultdict here to automatically create a list when inserting
    # the first item
    collected_issues = defaultdict(list)

    for line in f:
        # splitting line into list of strings at '|' character and retrieving
        # current issue code from second token
        issue_code = line.split('|')[1]
        # appending current line to list of collected lines associated with
        # current issue code
        collected_issues[issue_code].append(line)
    else:
        for issue_code in collected_issues:
            # defining output file name according to issue code
            out_name = R"D:\tmp\issue%s.txt" % issue_code
            # opening output file and writing collected lines to it
            with open(out_name, 'a+') as ff:
                ff.write("".join(collected_issues[issue_code]))

This of course creates an in-memory dictionary holding all lines retrieved from the input file. Given your specification this could very well be not feasible with your machine. An alternative would be to split up the input file and processing it chunk by chunk instead. This can be achieved in code by defining a corresponding generator that reads a defined amount of lines (here: 1000) from the input file. A possible final solution could then look like this:

from itertools import islice
from collections import defaultdict


def get_chunk_of_lines(file, N):
    """
    Retrieves N lines from specified opened file.
    """
    return [x.strip() for x in islice(file, N)]


def collect_issues(lines):
    """
    Collects and groups issues from specified lines.
    """
    collected_issues = defaultdict(list)

    for line in lines:
        # splitting line into list of strings at '|' character and retrieving
        # current issue code from second token
        issue_code = line.split('|')[1]
        # appending current line to list of collected lines associated with
        # current issue code
        collected_issues[issue_code].append(line)

    return collected_issues


def export_grouped_issues(issues):
    """
    Exports collected and grouped issues.
    """
    for issue_code in issues:
        # defining output file name according to issue code
        out_name = R"D:\tmp\issue%s.txt" % issue_code
        # opening output file and writing collected lines to it
        with open(out_name, 'a+') as f:
            f.write("".join(issues[issue_code]))


with open(R"D:\tmp\issueholding.txt") as issue_src:

    chunk_cnt = 0

    while True:
        # retrieving 1000 input lines at a time
        line_chunk = get_chunk_of_lines(issue_src, 1000)

        # exiting while loop if no more chunk is left
        if not line_chunk:
            break

        chunk_cnt += 1
        print("+ Working on chunk %d" % chunk_cnt)

        # collecting, grouping and exporting issues
        issues = collect_issues(line_chunk)
        export_grouped_issues(issues)
Sign up to request clarification or add additional context in comments.

2 Comments

Sorry, I should have said earlier, performance is key here. The input file has approx 235 Million records. I cant be opening files inside the read loop. Will update orig Q with this info
Great answer le_affan, thanks for helping out a python newbie. If anyone is interested , with some slight modifications e.g increased chunk size to 1M, this ran in 1033 secs on my desktop PC. Not too shabby.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.