Parsing graph data file with Python

Question

I have one relatively small issue, but I can't keep to wrap my head around it. I have a text file which has information about a graph, and the structure is as follows:

first line contains the number of nodes
a blank line is used for separation
information about nodes follows, each chunk is separated from another by the empty line
chunks contain the node id one one line, type on second, and information about edges follows
there are two types of edges, up and down, and first number after node types denotes number of "up" edges, and their IDs follow in line after (if that number is 0, no "up" edges exist and the next number denotes the "down" edges)
same goes for the "down" edges, number of them and their ids in line below

So, sample data with two nodes is:

So, node 1 has type 1, two up edges, 2 and 3, and no down edges. Node 2 has type 1, zero up edges, and 2 down edges, 1 and 3 Node 3 has type 2, one up edge, 1, and 1 down edge, 2.

This info is clearly readable by human, but I am having issues writing a parser to take this information and store it in usable form.

I have written a sample code:

f = open('C:\\data', 'r')
lines = f.readlines()
num_of_nodes = lines[0]
nodes = {}
counter = 0
skip_next = False
for line in lines[1:]:
    new = False
    left = False
    right = False
    if line == "\n":
        counter += 1
        nodes[counter] = []
        new = True
        continue
    nodes[counter].append(line.replace("\n", ""))

Which kinda gets me the info split for each node. I would like something like a dictionary, which would hold the ID, up and down neighbors for each (or False if there are none available). I suppose that I could now parse through this list of nodes again and do each on its own, but I am wondering can I modify this loop I have to do that nicely in the first place.

your definition of clearly readable by human is different from mine, but I'm thinking of a solution for your problem — puredevotion
– puredevotion, Commented Dec 15, 2013 at 20:47
Haha, well, I have definitely read some more readable things in my life, but I was trying to say that the data structure is "defined", meaning that when I look at the series of number I can represent that in my mind easily, node id, its type and neighbors (if it has them) on each side. This clause, "if it has them", seems the critical part here which I can't describe in code. — wont_compile
– wont_compile, Commented Dec 15, 2013 at 20:50
Could you consider giving your question a less vague title than "parse text file with Python"? Something specific to the data you're trying to read. — Iguananaut
– Iguananaut, Commented Dec 15, 2013 at 20:51
@puredevotion Something like this nodes = { node_id: {ups: [], downs:[]} or something in that method. — wont_compile
– wont_compile, Commented Dec 15, 2013 at 21:22

bruno desthuilliers · Accepted Answer · 2013-12-15 21:37:20Z

2

Is that what you want ?

{1: {'downs': [], 'ups': [2, 3], 'node_type': 1}, 
 2: {'downs': [1, 3], 'ups': [], 'node_type': 1}, 
 3: {'downs': [2], 'ups': [1], 'node_type': 2}}

Then here's the code:

def parse_chunk(chunk):
    node_id = int(chunk[0])
    node_type = int(chunk[1])

    nb_up = int(chunk[2])
    if nb_up:
        ups = map(int, chunk[3].split())
        next_pos = 4
    else:
        ups = []
        next_pos = 3

    nb_down = int(chunk[next_pos])
    if nb_down:
        downs = map(int, chunk[next_pos+1].split())
    else:
        downs = []

    return node_id, dict(
        node_type=node_type,
        ups=ups,
        downs=downs
        )

def collect_chunks(lines):
    chunk = []
    for line in lines:
        line = line.strip()
        if line:
            chunk.append(line)
        else:
            yield chunk
            chunk = []
    if chunk:
        yield chunk

def parse(stream):
    nb_nodes = int(stream.next().strip())
    if not nb_nodes:
        return []
    stream.next()
    return dict(parse_chunk(chunk) for chunk in collect_chunks(stream))

def main(*args):
    with open(args[0], "r") as f:
        print parse(f)

if __name__ == "__main__":
    import sys
    main(*sys.argv[1:])

edited Dec 15, 2013 at 21:37

answered Dec 15, 2013 at 21:18

bruno desthuilliers

78.3k6 gold badges103 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

wont_compile Over a year ago

This is perfect! I only modified this a bit, as I need slightly different output for further processing, so I added this:

nodes = {}   for node in list_of_nodes:       nodes[node['node_id']] = {'type': node['node_type'], 'right': node['right'], 'left': node['left']}

wont_compile Over a year ago

Its perfect now! Thank you for blazingly fast answer and for your time and help!

puredevotion · Accepted Answer · 2013-12-15 21:35:21Z

1

I would do it as presented below. I would add a try-catch around file-reading, and read your files with the with-statement

nodes = {}
counter = 0
with open(node_file, 'r', encoding='utf-8') as file:
     file.readline()                              # skip first line, not a node
     for line in file.readline():
         if line == "\n":
             line = file.readline()               # read next line
             counter = line[0]
             nodes[counter] = {}                  # create a nested dict per node
             line = file.readline() 
             nodes[counter]['type'] = line[0]     # add node type
             line = file.readline()
             if line[0] != '0':
                 line = file.readline()           # there are many ways
                 up_edges = line[0].split()       # you can store edges
                 nodes[counter]['up'] = up_edges  # here a list
                 line = file.readline()
             else: 
                 line = file.readline()
             if line[0] != '0':
                 line = file.readline()
                 down_edges = line[0].split()     # store down-edges as a list  
                 nodes[counter]['down'] = down_edges  
             # end of chunk/node-set, let for-loop read next line
         else:
              print("this should never happen! line: ", line[0])

This reads the files per line. I'm not sure about your data-files, but this is easier on your memory. IF memory is an issue, this will be slower in terms of HDD reading (although a SSD does miracles)

Haven't tested the code, but the concept is clear :)

edited Dec 15, 2013 at 21:35

answered Dec 15, 2013 at 21:27

puredevotion

1,2251 gold badge12 silver badges27 bronze badges

5 Comments

bruno desthuilliers Over a year ago

Your code won't work - you're reading strings but comparing to ints.

puredevotion Over a year ago

ah, thought it would read the numbers as ints, but that will be a small edit coming up...

wont_compile Over a year ago

Yup, not working. There are some weird things like: 'line = file.readline()' and in the next line 'counter = line[0]', which brings some errors.

puredevotion Over a year ago

ok, then I will test it :) --> or not, since @bruno's answer was correct

wont_compile Over a year ago

Yeah, it seems he delivered what I was asking, so no need to spend your valuable time on this issue anymore. Thanks! :)

Collectives™ on Stack Overflow

Parsing graph data file with Python

2 Answers 2

2 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related