Parsing a column of JSON strings

Question

I have a tab seperated flatfile, one column of which is JSON data stored as a string, e.g.

Col1        Col2                    Col3
1491109818  2017-08-02 00:00:09.250 {"type":"Tipper"}
1491110071  2017-08-02 00:00:19.283 {"type":"HGV"}
1491110798  2017-08-02 00:00:39.283 {"type":"Tipper"}
1491110798  2017-08-02 00:00:39.283 \N
...

What I want to do is load the table as a pandas dataframe, and for col3 change the data to a string with just the information from the type key. Where there is no JSON or a JSON without a type key I want to return None.

e.g.

Col1        Col2                    Col3
1491109818  2017-08-02 00:00:09.250 Tipper
1491110071  2017-08-02 00:00:19.283 HGV
1491110798  2017-08-02 00:00:39.283 Tipper
1491110798  2017-08-02 00:00:39.283 None
...

The only way I can think to do this is with iterrows, however this is very slow when dealing with large files.

for index, row in df.iterrows():
    try:
        df.loc[index, 'Col3'] = json.loads(row['Col3'])['type']
    except:
        df.loc[index, 'Col3'] = None

Any suggestions on a quicker approach?

cs95 · Accepted Answer · 2017-09-12 11:47:10Z

2

Using np.vectorize and json.loads

import json

def foo(x):
    try:
        return json.loads(x)['type']
    except (ValueError, KeyError):
        return None

v = np.vectorize(foo)
df.Col3 = v(df.Col3)

Note that it is never recommended to use a bare except, as you can inadvertently catch and drop errors you didn't mean to.

df

         Col1                     Col2    Col3
0  1491109818  2017-08-02 00:00:09.250  Tipper
1  1491110071  2017-08-02 00:00:19.283     HGV
2  1491110798  2017-08-02 00:00:39.283  Tipper
3  1491110798  2017-08-02 00:00:39.283    None

edited Sep 12, 2017 at 11:47

answered Sep 12, 2017 at 11:40

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

James Over a year ago

Doesn't vectorize just hide the for loop?

cs95 Over a year ago

@James It hides, and it speeds it up. You need to see this: stackoverflow.com/a/46163829/4909087 Also, the try-except approach is fast -- (faster than if conditions, EAFP and so on), so I'd recommend sticking with it.

James Over a year ago

Interesting, does the speed up come from broadcasting to the underlying numpy array?

cs95 Over a year ago

@James I'm not sure broadcasting is applicable here, but I definitely know there's some amount of parallelisation going on.

Collectives™ on Stack Overflow

Parsing a column of JSON strings

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related