1

I have a file with lines, each line is split on "|", I want to compare arguments 5 from each line and if intersect, then proceed. This gets to a second part:

  • first arguments1,2 are compared by dictionary, and if they are same AND
  • if arguments5,6 are overlapping, then those lines get concatenated.

How to compare intersection of values under the same key? The code below works cross-key but not within same key:

from functools import reduce 
reduce(set.intersection, (set(val) for val in query_dict.values()))

Here is an example of lines: text1|text2|text3|text4 text 5| text 6| text 7 text 8| text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|

The output should be: text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6

In other words, only those lines that are matching by 1st,2nd arguments (cells equal) and if 5th,6th arguments are overlapping (intersection) are concatenated.

Here is input file:

Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']

The output should look like:

Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...

Angela Darvill gets stacked because two records share same name, same country and same city(-ies).

2
  • Can you provide an example of inputs and outputs ? And what do you mean by "intersect" for "arguments" ? Commented Jan 27, 2022 at 15:16
  • @Lenormju see updated Commented Jan 27, 2022 at 16:36

2 Answers 2

1

Based on your improved question :

import itertools


data = """\
Angela Darvill|19036321|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['GB','US']|['Salford', 'Eccles', 'Manchester']
Helen Stanley|19036320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
Angela Darvill|190323121|School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.|['US']|['Brighton', 'Eccles', 'Manchester']
Helen Stanley|19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']|['Brighton', 'Brighton']
"""

lines = tuple(tuple(line.split('|')) for line in data.splitlines())

results = []
for line_a_index, line_a in enumerate(lines):
    # we want to compare each line with each other, so we start at index+1
    for line_b_index, line_b in enumerate(lines[line_a_index+1:], start=line_a_index+1):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        columns0_are_equal = line_a[0] == line_b[0]
        columns1_are_equal = line_a[1] == line_b[1]
        columns3_are_overlap = set(line_a[3]).issubset(set(line_b[3])) or set(line_b[3]).issubset(set(line_a[3]))
        columns4_are_overlap = set(line_a[4]).issubset(set(line_b[4])) or set(line_b[4]).issubset(set(line_a[4]))
        print(f"between lines index={line_a_index} and index={line_b_index}, {columns0_are_equal=} {columns1_are_equal=} {columns3_are_overlap=} {columns4_are_overlap=}")
        if (
            columns0_are_equal
            # and columns1_are_equal
            and (columns3_are_overlap or columns4_are_overlap)
        ):
            print("MATCH!")
            results.append(
                (line_a_index, line_b_index,) + tuple(
                    ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b
                    else cell_a
                    for cell_a, cell_b in itertools.zip_longest(line_a, line_b)
                )
            )

print("Fancy output :")
lines_to_display = set(itertools.chain.from_iterable((lines[result[0]], lines[result[1]], result[2:]) for result in results))
columns_widths = (max(len(str(index)) for result in results for index in (result[0], result[1])),) + tuple(
    max(len(cell) for cell in column)
    for column in zip(*lines_to_display)
)

for width in columns_widths:
    print("-" * width, end="|")
print("")

for result in results:
    for line_index, original_line in zip((result[0], result[1]), (lines[result[0]], lines[result[1]])):
        for column_index, cell in zip(itertools.count(), (str(line_index),) + original_line):
            if cell:
                print(cell.ljust(columns_widths[column_index]), end='|')
        print("", end='\n')  # explicit newline
    for column_index, cell in zip(itertools.count(), ("=",) + result[2:]):
        if cell:
            print(cell.ljust(columns_widths[column_index]), end='|')
    print("", end='\n')  # explicit newline

for width in columns_widths:
    print("-" * width, end="|")
print("")

expected_outputs = """\
Angela Darvill|19036321;190323121|...
Helen Stanley|19036320;19576876320|...
""".splitlines()

for result, expected_output in itertools.zip_longest(results, expected_outputs):
    actual_output = "|".join(result[2:])
    assert actual_output.startswith(expected_output[:-3])  # minus the "..."
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|
0|Angela Darvill|19036321            |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['GB','US']       |['Salford', 'Eccles', 'Manchester']                                     |
2|Angela Darvill|190323121           |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['US']            |['Brighton', 'Eccles', 'Manchester']                                    |
=|Angela Darvill|19036321;190323121  |School of Nursing, University of Salford, Peel House Eccles, Manchester M30 0NN, UK.                                                   |['GB','US'];['US']|['Salford', 'Eccles', 'Manchester'];['Brighton', 'Eccles', 'Manchester']|
1|Helen Stanley |19036320            |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
3|Helen Stanley |19576876320         |Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
=|Helen Stanley |19036320;19576876320|Senior Lecturer, Institute of Nursing and Midwifery, University of Brighton, Westlain House, Village Way, Falmer, BN1 9PH Brighton, UK.|['US']            |['Brighton', 'Brighton']                                                |
-|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------|

You can see that the lines index 0 and 2 have been merged, same for the lines index 1 and 3.

Sign up to request clarification or add additional context in comments.

Comments

1
from itertools import zip_longest


data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""

lines = tuple(line.split('|') for line in data.splitlines())
number_of_lines = len(lines)
print(f"number of lines : {number_of_lines}")
print(f"number of cells in line 1 : {len(lines[0])}")
print(f"number of cells in line 2 : {len(lines[1])}")
print(f"{lines[0]=}")
print(f"{lines[1]=}")

result = []

# we want to compare each line with each other :
for line_a_index, line_a in enumerate(lines):
    for line_b_index, line_b in enumerate(lines[line_a_index+1:]):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
                line_a[5] in line_b[5] or line_a[6] in line_b[6]  # A in B
            or line_b[5] in line_a[5] or line_b[6] in line_a[6]  # B in A
        ):
            result.append(tuple(
                ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
                for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1])  # <-- here I truncated the lines
            ))

# I decided to have a fancy output, but I made some simplifying assumptions to make it simple
if len(result) > 1:
    raise NotImplementedError
widths = tuple(max(len(a) if a is not None else 0, len(b) if b is not None else 0, len(c) if c is not None else 0)
               for a, b, c in zip_longest(lines[0], lines[1], result[0]))
length = max(len(lines[0]), len(lines[1]), len(result[0]))
for line in (lines[0], lines[1], result[0]):
    for index, cell in zip_longest(range(length), line):
        if cell:
            print(cell.ljust(widths[index]), end='|')
    print("", end='\n')  # explicit newline

original_expected_output = "text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6"
print(f"{original_expected_output}         <-- expected")

lenormju_expected_output = "text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7"
print(f"{lenormju_expected_output}             <-- fixed")

output

number of lines : 2
number of cells in line 1 : 7
number of cells in line 2 : 13
lines[0]=['text1', 'text2', 'text3', 'text4 text 5', ' text 6', ' text 7 text 8', '']
lines[1]=['text1', 'text2', 'text12', 'text4 text 5', ' text 6', ' text 7', ' text9', 'text10', 'text3', 'text4 text 5', ' text 11', ' text 12 text 8', '']
text1|text2|text3       |text4 text 5| text 6| text 7 text 8        |
text1|text2|text12      |text4 text 5| text 6| text 7               | text9|text10|text3|text4 text 5| text 11| text 12 text 8|
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7|
text1|text2|text3;text12|text4;text5;text4;text5|text6;text7 text8;text6         <-- expected
text1|text2|text3;text12|text4 text 5| text 6| text 7 text 8; text 7             <-- fixed

EDIT:

from dataclasses import dataclass
from itertools import zip_longest


data = """\
text1|text2|text3|text4 text 5| text 6| text 7 text 8|
text1|text2|text12|text4 text 5| text 6| text 7| text9|text10|text3|text4 text 5| text 11| text 12 text 8|
"""


@dataclass
class Match:  # a convenient way to store the solutions
    line_a_index: int
    line_b_index: int
    line_result: tuple


lines = tuple(line.split('|') for line in data.splitlines())

results = []
for line_a_index, line_a in enumerate(lines):
    for line_b_index, line_b in enumerate(lines[line_a_index+1:], line_a_index+1):
        assert len(line_a) >= 5, f"not enough cells ({len(line_a)}) in line {line_a_index}"
        assert len(line_b) >= 5, f"not enough cells ({len(line_b)}) in line {line_b_index}"
        assert all(isinstance(cell, str) for cell in line_a)
        assert all(isinstance(cell, str) for cell in line_b)

        if line_a[0] == line_b[0] and line_a[1] == line_b[1] and (
                line_a[5] in line_b[5] or line_a[6] in line_b[6]  # A in B
            or line_b[5] in line_a[5] or line_b[6] in line_a[6]  # B in A
        ):
            line_result = tuple(
                ((cell_a or "") + (";" if (cell_a or cell_b) else "") + (cell_b or "")) if cell_a != cell_b else cell_a
                for cell_a, cell_b in zip_longest(line_a[:5+1], line_b[:5+1])  # <-- here I truncated the lines
            )
            results.append(Match(line_a_index=line_a_index, line_b_index=line_b_index, line_result=line_result))

# simple output of the solution
for result in results:
    print(f"line n°{result.line_a_index} matches with n°{result.line_b_index} : {result.line_result}")
line n°0 matches with n°1 : ('text1', 'text2', 'text3;text12', 'text4 text 5', ' text 6', ' text 7 text 8; text 7')

7 Comments

thank you, it is indeed fancy. It gives 3 rows as output: 2 those that were compared and 1 concatenated. How/where can one change a code so it gives only row which gets concatenated OR in case if they are not concatenated just 2 of them. So it shouldn’t be 3 rows in any case - either 1 concatenated or if expression doesn’t hold true - each row separately. Also how do you start the code from file not from data = """ string"""? When I do from file, I see IndexError: list index out of range at for a, b, c in zip_longest(lines[0], lines[1], result[0])) position
also if I add more lines even in data = """ string""", it gives NotImplementedError at raise NotImplementedError. I am supposed to have many lines
You can remove the prints you don't want. You did not ask for a fancy output, but I thought it would be more clear that way to present the different solutions, so I made one specifically for your example case of 2 lines, and added a check to make sure it would not run on more. Just remove everything under the I decided to have a fancy output ... comment and it will run fine.
My goal was to show that the expected output you provided seems incorrect to me, it does not match the description of the problem you gave. So I wanted to hightlight visually the differences, so I used a fancy formatting. But if you don't care about that, I have edited my answer to add a simpler way to display ALL the solutions.
it works but outputs only 1 common line, so if I would have many common lines in a result, how would that be? I have def combinations functions from itertools.combinations but it is pretty slow function when you make pairs. I will have millions of lines to compare. Any more efficient solution?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.