7

I am looking into trying to create a full address, but the data I have comes in the form of:

Line 1                     | Line 2                   | Postcode
1, First Street, City, X13
1, First Street             First Street, City          X13 
1                           1, First Street, City, X13  X13

There are a few other permutations of how this data is created, but I want to be able to merge all this into one string where there is no overlap. So I want to create the string:
1, First Street, City, X13

But not 1, First Street, First Street, City, X13 etc.

How can I concat or merge these without duplicating data already there? There are also some cells like on the top line where there is no information past the first cell.

1
  • 1
    how do you decide what is a valid combination or are you sure words won't repeat? Commented Dec 10, 2015 at 10:59

2 Answers 2

2

If you have a plain text you can split your text with \n in order to get the line and split the lines with , to get the separate fields :

>>> s = """1, First Street, City, X13
... 1, First Street             First Street, City,          X13 
... 1                           1, First Street, City, X13  X13"""
>>> 
>>> lines = s.split('\n')
>>> 
>>> splitted_lines = [line.split(',') for line in lines]

Note that as a more pythonic way you can use csv module to read your text by specifying the comma , as the delimiter.

import csv
with open('file_name') as f:
    splitted_lines = csv.reader(f,delimiter=',') 

Then you can use following list comprehension to get the unique fields in each column :

>>> import re
>>> ' '.join([set([set(re.split(r'\s{2,}',i)).pop() for i in column]).pop() for column in zip(*splitted_lines)])
'1  First Street  City'

Note that here you can get the columns using zip() function and then split the items with re.split() with regex r'\s{2,}' which split your string with 2 or more white-space, then you can sue set() to preserve the unique items.

Note : If you care about the order you can use collections.OrderedDict instead of set

>>> from collections import OrderedDict
>>> 
>>> d = OrderedDict()
>>> ' '.join([d.fromkeys([set(re.split('\s{2,}',i)).pop() for i in column]).keys()[0] for column in zip(*splitted_lines)])
'1  First Street  City  X13'
Sign up to request clarification or add additional context in comments.

3 Comments

Each of the contents are in different cells in a pandas table. So I need a way to merge the contents of the cells without repeating words.
@Abi You can read the table and put the rows in an iterable object like splitted_lines then put it in aforementioned list comprehension.
@PadraicCunningham Yep I added and OrderedDict approach too, and missing X13 was because of omitted delimiter.
2

If you don't mind losing punctuation:

from collections import OrderedDict
od = OrderedDict()


from string import punctuation
with open("test.txt") as f:
    next(f)
    print("".join(od.fromkeys(word.strip(punctuation) for line in f    
          for word in line.split())))

1 First Street City X13

If you have repeated words you won't be able to use the approach but based on your input there is no way to know what possible combination are possible bar the second line actually being always intact in which case you would just need pull the second line.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.