1

I have a file with the below example lines:

(22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com')
(1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")

I'm trying to parse it to a list of objects with this code:

import csv

file_path = 'Data/example.txt'
data = []

with open(file_path, 'r') as f:
    reader = csv.reader(f, skipinitialspace=True)
    for row in reader:
        data.append({'id' : row[0], 'source' : row[1], 'content' : row[2]})

As expected, the content is truncated due to the ',' in the content column. Is there any package that can help me parse this out of the box?

8
  • Does the file actually contain ( and u' ? Commented Dec 28, 2017 at 17:26
  • Does the file look exactly like this? Commented Dec 28, 2017 at 17:26
  • Yes, unfortunately. I don't know which language was used to write such a file, but it's a dataset I need to load Commented Dec 28, 2017 at 17:28
  • 1
    You can't parse this code with python3. Your numbers have the Long L suffix at the end. My guess is someone foolishly strd a list of tuples into a file using python2. Please kick them. Commented Dec 28, 2017 at 17:31
  • That looks like a pure python print of a list of tuples... Eval comes to mind, although is probably not such a great idea (stackoverflow.com/questions/1832940/…) Commented Dec 28, 2017 at 17:35

2 Answers 2

2

Looking at your data, someone has dumped the str version of a list into a file as-is, using python2.

One thing's for sure - you can't use a CSV reader for this data. You can't even use a JSON parser (which would've been the next best thing).

What you can do, is use ast.literal_eval. With python2, this works out of the box.

import ast

data = []
with open('file.txt') as f:
    for line in f:
        try:
            data.append(ast.literal_eval(line))
        except (SyntaxError, ValueError):
            pass

data should look something like this -

[(22642441022L,
  '<a href="http://example.com">Click</a>',
  'fox, dog, cat are examples http://example.com'),
 (1153634043,
  '<a href="http://example.com">Click</a>',
  "I learned so much from my mistakes, I think I'm gonna make some more")]

You can then pass data into a DataFrame as-is -

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

             A                                       B  \
0  22642441022  <a href="http://example.com">Click</a>   
1   1153634043  <a href="http://example.com">Click</a>   

                                                   C  
0      fox, dog, cat are examples http://example.com  
1  I learned so much from my mistakes, I think I'...  

If you want this to work with python3, you'll need to get rid of the long suffix L, and the unicode prefix u. You might be able to do this using re.sub from the re module.

import re
for line in f:
    try:
        i = re.sub('(\d+)L', r'\1', line)       # remove L suffix
        j = re.sub('(?<=,\s)u(?=\')', '', i)    # remove u prefix
        data.append(ast.literal_eval(j))
    except (SyntaxError, ValueError):
        pass

Notice the added re.sub('(\d+)L', r'\1', line), which removes the L suffix at the end of a string of digits.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for you answer, but it didn't work. It throws syntax error at all lines. I'm using python 3.6 BTW.
@MokhtarAshour If you could show me one of the lines that are erroring out, that would help... because, it works for the data you posted ;-(
(22642586115L, 7248952, 1283282654000L, 0, -1, -1, None, -999999.0, -999999.0, u'<a href="example.com" rel="nofollow">Ping.fm</a>', 0, 0, u'CPPRI Recruitment 2010 at example.com', -1, u'', u'')
@MokhtarAshour Weird that ast sometimes parses unicode, and sometimes not. This would've worked on python2, but I'll need regex to get rid of the unicodes. Give me a few minutes.
@MokhtarAshour Edits made, try it now and let me know.
1

So it looks like the file was generated doing something like this (a pure dump of a Python str() or print):

data_list = [
    (22642441022L, u'<a href="http://example.com">Click</a>', u'fox, dog, cat are examples http://example.com'),
    (1153634043, u'<a href="http://example.com">Click</a>', u"I learned so much from my mistakes, I think I'm gonna make some more")
]  # List of tuples

with open('./stack_084.txt', 'w') as f:
    f.write('\n'.join([str(data) for data in data_list]))

Regular expressions come to mind (assuming that the values on your second "column") always start with <a and end with a>:

import pprint
import re

line_re = re.compile(
    r'\('
    r'(?P<num>\d+)L{0,1}.'
    r'+?'
    r'[\'\"](?P<source>\<a.+?a\>)[\"\']'
    r'.+?'
    r'[\'\"](?P<content>.+?)[\"\']'
    r'\)'
)

data = []
with open('./stack_084.txt', 'r') as f:
    for line in f:
        match = line_re.match(line)
        if match:
            data.append({
                'id': int(match.groupdict()['num']),
                'source': match.groupdict()['source'],
                'content': match.groupdict()['content']
            })

# You should see parsed data here:
print(pprint.pformat(data))

This outputs:

[{'content': 'fox, dog, cat are examples http://example.com',
  'id': 22642441022,
  'source': '<a href="http://example.com">Click</a>'},
 {'content': "I learned so much from my mistakes, I think I'm gonna make some "
             'more',
  'id': 1153634043,
  'source': '<a href="http://example.com">Click</a>'}]

4 Comments

I see you use Regex to handle it, but the actual lines in the file are longer (I included a subset). This is one line of the real dataset : (22642586115L, 7248952, 1283282654000L, 0, -1, -1, None, -999999.0, -999999.0, u'<a href="example.com/"; rel="nofollow">Ping.fm</a>', 0, 0, u'CPPRI Recruitment 2010 at example.com/';, -1, u'', u'') This way I will need carefully to write a REGEX that handles the whole line
Numbers are easy... Just more of the (\d+)L{0,1} groups... And those None... Those are suspicious. I imagine you have rows where the 7th value is not None? (I don't imagine that whatever generated your data is gonna have ALL None(s), right?)
Yes, some rows contain something like u'MyComfyCat'
I have accepted the first answer, still like your answer though. so I'm up voting

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.