Python - Merging two strings that overlap

Question

I am looking into trying to create a full address, but the data I have comes in the form of:

Line 1                     | Line 2                   | Postcode
1, First Street, City, X13
1, First Street             First Street, City          X13 
1                           1, First Street, City, X13  X13

There are a few other permutations of how this data is created, but I want to be able to merge all this into one string where there is no overlap. So I want to create the string:
1, First Street, City, X13

But not 1, First Street, First Street, City, X13 etc.

How can I concat or merge these without duplicating data already there? There are also some cells like on the top line where there is no information past the first cell.

how do you decide what is a valid combination or are you sure words won't repeat? — Padraic Cunningham
– Padraic Cunningham, Commented Dec 10, 2015 at 10:59

Kasravnd · Accepted Answer · 2015-12-10 10:56:12Z

2

If you have a plain text you can split your text with \n in order to get the line and split the lines with , to get the separate fields :

>>> s = """1, First Street, City, X13
... 1, First Street             First Street, City,          X13 
... 1                           1, First Street, City, X13  X13"""
>>> 
>>> lines = s.split('\n')
>>> 
>>> splitted_lines = [line.split(',') for line in lines]

Note that as a more pythonic way you can use csv module to read your text by specifying the comma , as the delimiter.

import csv
with open('file_name') as f:
    splitted_lines = csv.reader(f,delimiter=',')

Then you can use following list comprehension to get the unique fields in each column :

>>> import re
>>> ' '.join([set([set(re.split(r'\s{2,}',i)).pop() for i in column]).pop() for column in zip(*splitted_lines)])
'1  First Street  City'

Note that here you can get the columns using zip() function and then split the items with re.split() with regex r'\s{2,}' which split your string with 2 or more white-space, then you can sue set() to preserve the unique items.

Note : If you care about the order you can use collections.OrderedDict instead of set

>>> from collections import OrderedDict
>>> 
>>> d = OrderedDict()
>>> ' '.join([d.fromkeys([set(re.split('\s{2,}',i)).pop() for i in column]).keys()[0] for column in zip(*splitted_lines)])
'1  First Street  City  X13'

edited Dec 10, 2015 at 10:56

answered Dec 10, 2015 at 10:33

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Abi Over a year ago

Each of the contents are in different cells in a pandas table. So I need a way to merge the contents of the cells without repeating words.

Kasravnd Over a year ago

@Abi You can read the table and put the rows in an iterable object like splitted_lines then put it in aforementioned list comprehension.

Kasravnd Over a year ago

@PadraicCunningham Yep I added and OrderedDict approach too, and missing X13 was because of omitted delimiter.

Padraic Cunningham · Accepted Answer · 2015-12-10 10:56:44Z

2

If you don't mind losing punctuation:

from collections import OrderedDict
od = OrderedDict()


from string import punctuation
with open("test.txt") as f:
    next(f)
    print("".join(od.fromkeys(word.strip(punctuation) for line in f    
          for word in line.split())))

1 First Street City X13

If you have repeated words you won't be able to use the approach but based on your input there is no way to know what possible combination are possible bar the second line actually being always intact in which case you would just need pull the second line.

edited Dec 10, 2015 at 10:56

answered Dec 10, 2015 at 10:47

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Collectives™ on Stack Overflow

Python - Merging two strings that overlap

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related