2

I have the following file:

1    real madrid,barcelona,chelsea,arsenal,cska
2    chelsea,arsenal,milan,napoli,juventus
5    bayern,dortmund,celtic,napoli
7    cska,psg,arsenal,atalanta
9    atletic bilbao,las palmas,milan,barcelona

and I want to produce a new file with this output (where I had the nodes, now I have every team and in the second column I have the nodes that has this team as attribute):

real madrid    1
barcelona    1,9
chelsea    1,2
arsenal    1,2,7
cska    1,7
milan    2,9
etc...

First of all i opened the file and I saved each column to a list:

file1 = open("myfile.txt","r")
lines1 = file1.readlines()
nodes1 = []
attrs1 = []


for x in lines1:
    x = x.strip()
    x = x.split('\t')
    nodes1.append(x[0])
    attrs1.append(x[1].split(','))

but now how can I check the attrs and nodes to produce the output file?

5 Answers 5

4

Better, create a dictionary when reading the file:

line_map = {}
for x in lines1:
    (row_no, teams) = x.strip().split("\t")
    for i in teams.split(","):
        if not i in line_map:
            line_map[i] = set()
        line_map[i].add(row_no)

Now line_map contains a mapping of the team name to a list of lines it is contained on. You can easily print that:

for (k, v) in line_map.items():
    print("%s: %s" % (k, ",".join(v)))

if I am not much mistaken...

Edit: append should have been add.

Sign up to request clarification or add additional context in comments.

2 Comments

Only thing I would add is to use a defaultdict with string keys and set values to avoid the check whether i is in line_map already.
line_map[i] is a set, therefore use .add()
3

You can create a dictionary to hold your teams and populate it with nodes as you encounter them:

import collections

teams = collections.defaultdict(set)  # initiate each team with a set for nodes
with open("myfile.txt", "r") as f:  # open the file for reading
    for line in f:  # read the file line by line
        row = line.strip().split("\t")  # assuming a tab separator as in your code
        if not row:  # just a precaution for empty lines
            continue
        for team in row[1].split(","):  # split and iterate over each team
            teams[team].add(row[0].strip())  # add a node to the current team

# and you can now print it out:
for team, nodes in teams.items():
    print("{}\t{}".format(team, ",".join(nodes)))

This will yield:

arsenal    2,1,7
atalanta    7
chelsea 2,1
cska    1,7
psg 7
juventus    2
real madrid 1
barcelona   9,1
dortmund    5
celtic  5
napoli  2,5
milan   9,2
las palmas  9
atletic bilbao  9
bayern  5

For your data. Order is not guaranteed, tho, but you can always apply sorted() to get them in the order you want.

UPDATE: To save the result into a file all you need is to use handle.write():

with open("out_file.txt", "w") as f:  # open the file for writing
    for team, nodes in teams.items():  # iterate through the collected team-node pairs
        f.write("{}\t{}\n".format(team, ",".join(nodes)))  # write each as a new line

1 Comment

your code works fine, but when i try to save it in a text file, it saves all in one line. How can i save with the same format that is printed?
2

Here's an approach(?) using regular expressions. Happy coding :)

#!/usr/bin/env python3.6
import re, io, itertools

if __name__ == '__main__':
    groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') 
              for line in io.StringIO(open('f.txt').read())]
    enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
    for a, b in itertools.groupby(enums, lambda x: x[0]):
        print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))

Explanation (of sorts)

#!/usr/bin/env python3.6
import re, io, itertools

if __name__ == '__main__':
    # ('\d*') <-- match and capture leading integers
    # '\s*' <---- match but don't capture intervening space
    # ('.*') <--- match and capture the everything else

    # ('\g<2>|\g<1>') <--- swaps the second capture group with the first
    #                      and puts a "|" in between for easy splitting

    # io.StringIO is a great wrapper for a string, makes it easy to process text

    # re.subn is used to perform the regex swapping
    groups = [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|') for line in io.StringIO(open('f.txt').read())]

    # convert [[place1,place2 1], [place3,place4, 2] ...] -> [[place1, 1], place2, 1], [place3, 2], [place4, 2] ...]
    enums = sorted([[word, n] for group, n in groups for word in group.split(',')], key=lambda x: x[0])
    # group together, extract numbers, ...?, profit!
    for a, b in itertools.groupby(enums, lambda x: x[0]):
        print(a, ','.join(sorted(map(lambda x: x[1], b), key=int)))

Bonus: one line "piss off your coworkers" edition

#!/usr/bin/env python3.6
import io
import itertools
import re

if __name__ == '__main__':
    groups = [[place, lines]
              for a, b in itertools.groupby(sorted([[word, n]
              for line in io.StringIO(open('f.txt').read())
              for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
              for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
              for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]

    for place, lines in groups:
        print(place, lines)

"Bonus" #2: write output directly to file, piss-off-co-worker-no-life edition v1.2

#!/usr/bin/env python3.6
import io
import itertools
import re

if __name__ == '__main__':
    with open('output.txt', 'w') as f:
        groups = [print(place, lines, file=f)
                  for a, b in itertools.groupby(sorted([[word, n]
                  for line in io.StringIO(open('f.txt').read())
                  for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
                  for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
                  for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]

"Bonus" #3: terminal-tables-because-I-got-fired-for-pissing-off-my-coworkers-so-I-have-free-time-edition v75.2

Note: requires terminaltables 3rd party library

#!/usr/bin/env python3.6
import io
import itertools
import re
import terminaltables

if __name__ == '__main__':
    print(terminaltables.AsciiTable(
        [['Places', 'Line No.'], *[[place, lines]
          for a, b in itertools.groupby(sorted([[word, n]
          for line in io.StringIO(open('f.txt').read())
          for group, n in [re.subn('(\d*)\s*(.*)', '\g<2>|\g<1>', line, 1)[0].strip().split('|')]
          for word in group.split(',')], key=lambda x: x[0]), key=lambda x: x[0])
          for place, lines in [[a, ','.join(sorted(map(lambda x: x[1], b), key=int))]]]]).table)
output
+----------------+----------+
| Places         | Line No. |
+----------------+----------+
| arsenal        | 1,2,7    |
| atalanta       | 7        |
| atletic bilbao | 9        |
| barcelona      | 1,9      |
| bayern         | 5        |
| celtic         | 5        |
| chelsea        | 1,2      |
| cska           | 1,7      |
| dortmund       | 5        |
| juventus       | 2        |
| las palmas     | 9        |
| milan          | 2,9      |
| napoli         | 2,5      |
| psg            | 7        |
| real madrid    | 1        |
+----------------+----------+

Comments

1
# for this example, instead of reading a file just include the contents as string ..
file1 = """
1\treal madrid,barcelona,chelsea,arsenal,cska
2\tchelsea,arsenal,milan,napoli,juventus
5\tbayern,dortmund,celtic,napoli
7\tcska,psg,arsenal,atalanta
9\tatletic bilbao,las palmas,milan,barcelona
"""

# .. which can be split into a list (same result as with readlines)
lines1 = file1.strip().split('\n')
print(lines1)

# using separate lists requires handling indexes, so I'd use a dictionary instead
output_dict = {}

# iterate as before
for x in lines1:
    # you can chain the methods, and assign both parts of the line 
    # simultaneously (must be two parts exactly, so one TAB, or there
    # will be an error (Exception))
    node, attrs = x.strip().split('\t')

    # separate the list of clubs
    clubs = attrs.split(',')

    # collect each club in the output ..
    for club in clubs:
        # and with it, a list of the node(s)
        if club in output_dict:
            # add entry to the list for the existing club
            output_dict[club].append(node)
        else:
            # insert the club with a new list containing the first entry
            output_dict[club] = [node]

    # that should be it, let's see ..

# iterate the dict(ionary)
for club in output_dict:
    # convert list of node(s) to a string by joining the elements with a comma
    nodestr = ','.join(output_dict[club])

    # create a formatted string with the club and its nodes
    clubstr = "{:20}\t{}".format(club, nodestr)

    # print to stdout (e.g. console)
    print( clubstr )

prints

['1\treal madrid,barcelona,chelsea,arsenal,cska', '2\tchelsea,arsenal,milan,napoli,juventus', '5\tbayern,dortmund,celtic,napoli', '7\tcska,psg,arsenal,atalanta', '9\tatletic bilbao,las palmas,milan,barcelona']
real madrid             1
barcelona               1,9
chelsea                 1,2
arsenal                 1,2,7
cska                    1,7
milan                   2,9
napoli                  2,5
juventus                2
bayern                  5
dortmund                5
celtic                  5
psg                     7
atalanta                7
atletic bilbao          9
las palmas              9

Comments

0

Here is a solution with pandas (why not)

import pandas as pd
path_file_input = 'path\to\input_file.txt'
path_file_output = 'path\to\output_file.txt'

# Read the data from a txt file (with a tab separating the columns)
data = pd.read_csv(path_file_input, sep ='\t', header=None, names=[ 'Nodes', 'List Teams'], dtype=str)
# Create a column with all couple team-node
data_split = data['List Teams'].str.split(',', expand=True).stack().reset_index(level=0)\
                .set_index('level_0').rename(columns={0:'Teams'}).join(data.drop('List Teams',1), how='left')             
# Merge the data per team and join the nodes
data_merged = data_split.groupby('Teams')['Nodes'].apply(','.join).reset_index()

# Save as a txt file
data_merged.to_csv(path_file_output, sep='\t', index=False, header=False, float_format = str)
# or display the data
print (data_merged.to_csv(sep='\t', header=False, index=False))

see normalizing data by duplication for a really good explanation of the line starting by data_split

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.