Parse CSV file with commas in some columns Python

Question

I have a file with the below example lines:

(22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com')
(1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")

I'm trying to parse it to a list of objects with this code:

import csv

file_path = 'Data/example.txt'
data = []

with open(file_path, 'r') as f:
    reader = csv.reader(f, skipinitialspace=True)
    for row in reader:
        data.append({'id' : row[0], 'source' : row[1], 'content' : row[2]})

As expected, the content is truncated due to the ',' in the content column. Is there any package that can help me parse this out of the box?

Yes, unfortunately. I don't know which language was used to write such a file, but it's a dataset I need to load — Mokhtar Ashour
– Mokhtar Ashour, Commented Dec 28, 2017 at 17:28
You can't parse this code with python3. Your numbers have the Long L suffix at the end. My guess is someone foolishly strd a list of tuples into a file using python2. Please kick them. — cs95
– cs95, Commented Dec 28, 2017 at 17:31
That looks like a pure python print of a list of tuples... Eval comes to mind, although is probably not such a great idea (stackoverflow.com/questions/1832940/…) — Savir
– Savir, Commented Dec 28, 2017 at 17:35

cs95 · Accepted Answer · 2017-12-28 18:16:36Z

2

Looking at your data, someone has dumped the str version of a list into a file as-is, using python2.

One thing's for sure - you can't use a CSV reader for this data. You can't even use a JSON parser (which would've been the next best thing).

What you can do, is use ast.literal_eval. With python2, this works out of the box.

import ast

data = []
with open('file.txt') as f:
    for line in f:
        try:
            data.append(ast.literal_eval(line))
        except (SyntaxError, ValueError):
            pass

data should look something like this -

[(22642441022L,
  '<a href="http://example.com">Click</a>',
  'fox, dog, cat are examples http://example.com'),
 (1153634043,
  '<a href="http://example.com">Click</a>',
  "I learned so much from my mistakes, I think I'm gonna make some more")]

You can then pass data into a DataFrame as-is -

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

             A                                       B  \
0  22642441022  <a href="http://example.com">Click</a>   
1   1153634043  <a href="http://example.com">Click</a>   

                                                   C  
0      fox, dog, cat are examples http://example.com  
1  I learned so much from my mistakes, I think I'...

If you want this to work with python3, you'll need to get rid of the long suffix L, and the unicode prefix u. You might be able to do this using re.sub from the re module.

import re

for line in f:
    try:
        i = re.sub('(\d+)L', r'\1', line)       # remove L suffix
        j = re.sub('(?<=,\s)u(?=\')', '', i)    # remove u prefix
        data.append(ast.literal_eval(j))
    except (SyntaxError, ValueError):
        pass

Notice the added re.sub('(\d+)L', r'\1', line), which removes the L suffix at the end of a string of digits.

edited Dec 28, 2017 at 18:16

answered Dec 28, 2017 at 17:43

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Mokhtar Ashour Over a year ago

Thanks for you answer, but it didn't work. It throws syntax error at all lines. I'm using python 3.6 BTW.

cs95 Over a year ago

@MokhtarAshour If you could show me one of the lines that are erroring out, that would help... because, it works for the data you posted ;-(

Mokhtar Ashour Over a year ago

(22642586115L, 7248952, 1283282654000L, 0, -1, -1, None, -999999.0, -999999.0, u'<a href="example.com" rel="nofollow">Ping.fm</a>', 0, 0, u'CPPRI Recruitment 2010 at example.com', -1, u'', u'')

cs95 Over a year ago

@MokhtarAshour Weird that ast sometimes parses unicode, and sometimes not. This would've worked on python2, but I'll need regex to get rid of the unicodes. Give me a few minutes.

cs95 Over a year ago

@MokhtarAshour Edits made, try it now and let me know.

Savir · Accepted Answer · 2017-12-28 18:03:48Z

1

So it looks like the file was generated doing something like this (a pure dump of a Python str() or print):

data_list = [
    (22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com'),
    (1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")
]  # List of tuples

with open('./stack_084.txt', 'w') as f:
    f.write('\n'.join([str(data) for data in data_list]))

Regular expressions come to mind (assuming that the values on your second "column") always start with <a and end with a>:

import pprint
import re

line_re = re.compile(
    r'\('
    r'(?P<num>\d+)L{0,1}.'
    r'+?'
    r'[\'\"](?P<source>\<a.+?a\>)[\"\']'
    r'.+?'
    r'[\'\"](?P<content>.+?)[\"\']'
    r'\)'
)

data = []
with open('./stack_084.txt', 'r') as f:
    for line in f:
        match = line_re.match(line)
        if match:
            data.append({
                'id': int(match.groupdict()['num']),
                'source': match.groupdict()['source'],
                'content': match.groupdict()['content']
            })

# You should see parsed data here:
print(pprint.pformat(data))

This outputs:

[{'content': 'fox, dog, cat are examples http://example.com',
  'id': 22642441022,
  'source': '<a href="http://example.com">Click</a>'},
 {'content': "I learned so much from my mistakes, I think I'm gonna make some "
             'more',
  'id': 1153634043,
  'source': '<a href="http://example.com">Click</a>'}]

answered Dec 28, 2017 at 18:03

Savir

18.6k18 gold badges88 silver badges146 bronze badges

4 Comments

Mokhtar Ashour Over a year ago

I see you use Regex to handle it, but the actual lines in the file are longer (I included a subset). This is one line of the real dataset : (22642586115L, 7248952, 1283282654000L, 0, -1, -1, None, -999999.0, -999999.0, u'<a href="example.com/"; rel="nofollow">Ping.fm</a>', 0, 0, u'CPPRI Recruitment 2010 at example.com/';, -1, u'', u'') This way I will need carefully to write a REGEX that handles the whole line

Savir Over a year ago

Numbers are easy... Just more of the (\d+)L{0,1} groups... And those None... Those are suspicious. I imagine you have rows where the 7th value is not None? (I don't imagine that whatever generated your data is gonna have ALL None(s), right?)

Mokhtar Ashour Over a year ago

Yes, some rows contain something like u'MyComfyCat'

Mokhtar Ashour Over a year ago

I have accepted the first answer, still like your answer though. so I'm up voting

Collectives™ on Stack Overflow

Parse CSV file with commas in some columns Python

2 Answers 2

5 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related