1

I have a text file that is structured like so:

SOURCE: RCM
DESTINATIONS BEGIN
JCK SF3
DESTINATIONS END
SOURCE: TRO
DESTINATIONS BEGIN
GFN SF3
SYD SF3 DH4
DESTINATIONS END

I am trying to create a nested dictionary where the resulting dictionary would look like:

handout_routes = {
'RCM': {'JCK': ['SF3']},
'TRO': {'GFN': ['SF3'], 'SYD': ['SF3', 'DH4']}
}

Now this is just a sample of the data but when reading the data we can assume the following: The very first line begins with SOURCE: followed by a three letter IATA airport code. The line after every line that begins with SOURCE: is DESTINATIONS BEGIN. There are one or more lines between DESTINATIONS BEGIN and DESTINATIONS END. After every line with DESTINATIONS BEGIN there is a corresponding line with DESTINATIONS END. The lines between DESTINATIONS BEGIN and DESTINATIONS END start with a three-letter IATA airport code, which is followed by one or more three-character alphaneumeric plane codes. Each code is separated by a space. The lines after DESTINATIONS END will begin with SOURCE:, or you will have reached the end of the file.

So far I've tried

with open ("file_path", encoding='utf-8') as text_data:
    answer = {}
    for line in text_data:
        line = line.split()
        if not line:  # empty line?
            continue
        answer[line[0]] = line[1:]
    print(answer)

But it returns the data like this:

{'SOURCE:': ['WYA'], 'DESTINATIONS': ['END'], 'KZN': ['146'], 'DYU': ['320']}

I think it's how I structured the code to read the file. Any help will be appreciated. It's possible my code is way too simple for what needs to be done with the file. Thank you.

3 Answers 3

1

Here's a program I wrote that works quite well:

def unpack(file):
  contents:dict = {}
  source:str
  
  for line in file.split('\n'):

    if line[:12] == 'DESTINATIONS':
      pass
    #these lines don't affect the program so we ignore them

    elif not line:
      pass
    #empty line so we ignore it
    
    elif line[:6] == 'SOURCE':
      source = line.rpartition(' ')[-1]
      if source not in contents:
        contents[source] = {}
      
    else:
      idx, *data = line.split(' ')
      contents[source][idx] = list(data)

  return contents
      

with open('file.txt') as file:
  handout_routes = unpack(file.read())
  print(handout_routes)
Sign up to request clarification or add additional context in comments.

4 Comments

This is putting me on the right track but it only returns this: {'AER': {}} Perhaps I'm implementing your code incorrectly? What does it return for you?
That's odd, for me it returns {'RCM': {'JCK': ['SF3']}, 'TRO': {'GFN': ['SF3'], 'SYD': ['SF3', 'DH4']}}, exactly the dict you said it should return. Could you show me the file you're trying to open?
Sure, it's a .dat file. It's a large dataset so how can I show you?
Never mind, it was my implementation that was wrong. This works great! Thanks!
0

I know there's already an accepted answer, but I used an approach that may actually help you find the formatting errors in your file, rather than just ignoring the extra bits:

from tokenize import TokenInfo, tokenize, ENCODING, ENDMARKER, NEWLINE, NAME
from typing import Callable, Generator

class TripParseException(Exception):
    pass

def assert_token_string(token:TokenInfo, expected_string: str):
    if token.string != expected_string:
        raise TripParseException("Unable to parse trip file: expected {}, found {} in line {} ({})".format(
            expected_string, token.string, str(token.start[0]), token.line
        ))
def assert_token_type(token:TokenInfo, expected_type: int):
    if token.type != expected_type:
        raise TripParseException("Unable to parse trip file: expected type {}, found type {} in line {} ({})".format(
            expected_type, token.type, str(token.start[0]), token.line
        ))

def parse_destinations(token_stream: Generator[TokenInfo, None, None])->dict:
    destinations = dict()
    assert_token_string(next(token_stream), "DESTINATIONS")
    assert_token_string(next(token_stream), "BEGIN")
    assert_token_type(next(token_stream), NEWLINE)
    current_token = next(token_stream)
    while(current_token.string != "DESTINATIONS"):
        assert_token_type(current_token, NAME)
        destination = current_token.string
        plane_codes = list()
        current_token = next(token_stream)
        while(current_token.type != NEWLINE):
            assert_token_type(current_token, NAME)
            plane_codes.append(current_token.string)
            current_token = next(token_stream)
        destinations[destination] = plane_codes
        # current token is NEWLINE, get the first token on the next line.
        current_token = next(token_stream)


    # Just parsed "DESTINATIONS", expecting "DESTINATIONS END"
    assert_token_string(next(token_stream), "END")
    assert_token_type(next(token_stream), NEWLINE)
    return destinations

def parse_trip(token_stream: Generator[TokenInfo, None, None]):
    current_token = next(token_stream)
    if(current_token.type == ENDMARKER):
        return None, None
    assert_token_string(current_token, "SOURCE")
    assert_token_string(next(token_stream), ":")
    tok_origin = next(token_stream)
    assert_token_type(tok_origin, NAME)
    assert_token_type(next(token_stream), NEWLINE)
    destinations = parse_destinations(token_stream)

    return tok_origin.string, destinations

def parse_trips(readline: Callable[[], bytes]) -> dict:
    token_gen = tokenize(readline)
    assert_token_type(next(token_gen), ENCODING)
    trips = dict()
    while(True):
        origin, destinations = parse_trip(token_gen)
        if(origin is not None and destinations is not None):
            trips[origin] = destinations
        else:
            break

    return trips

Then your implementation would look like this:

import pprint

with open("trips.dat", "rb") as trips_file:
    trips = parse_trips(trips_file.readline)
    pprint.pprint(
        trips
    )

which yields the expected result:

{'RCM': {'JCK': ['SF3']}, 'TRO': {'GFN': ['SF3'], 'SYD': ['SF3', 'DH4']}}

This also is more flexible if you end up wanting to throw other information into your files later.

Comments

0
from itertools import takewhile
import re


def destinations(lines):
    if next(lines).startswith('DESTINATIONS BEGIN'):
        dest = takewhile(lambda l: not l.startswith('DESTINATIONS END'), lines)
        yield from map(str.split, dest)


def sources(lines):
    source = re.compile('SOURCE:\s*(\w+)')
    while m := source.match(next(lines, '')):
        yield (m.group(1),
               {dest: crafts for dest, *crafts in destinations(lines)})


handout_routes = {s: d for s, d in sources(open('file_path', encoding='utf-8'))}
print(handout_routes)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.