0

I have a tab seperated flatfile, one column of which is JSON data stored as a string, e.g.

Col1        Col2                    Col3
1491109818  2017-08-02 00:00:09.250 {"type":"Tipper"}
1491110071  2017-08-02 00:00:19.283 {"type":"HGV"}
1491110798  2017-08-02 00:00:39.283 {"type":"Tipper"}
1491110798  2017-08-02 00:00:39.283 \N
...

What I want to do is load the table as a pandas dataframe, and for col3 change the data to a string with just the information from the type key. Where there is no JSON or a JSON without a type key I want to return None.

e.g.

Col1        Col2                    Col3
1491109818  2017-08-02 00:00:09.250 Tipper
1491110071  2017-08-02 00:00:19.283 HGV
1491110798  2017-08-02 00:00:39.283 Tipper
1491110798  2017-08-02 00:00:39.283 None
...

The only way I can think to do this is with iterrows, however this is very slow when dealing with large files.

for index, row in df.iterrows():
    try:
        df.loc[index, 'Col3'] = json.loads(row['Col3'])['type']
    except:
        df.loc[index, 'Col3'] = None

Any suggestions on a quicker approach?

0

1 Answer 1

2

Using np.vectorize and json.loads

import json

def foo(x):
    try:
        return json.loads(x)['type']
    except (ValueError, KeyError):
        return None

v = np.vectorize(foo)
df.Col3 = v(df.Col3)

Note that it is never recommended to use a bare except, as you can inadvertently catch and drop errors you didn't mean to.


df

         Col1                     Col2    Col3
0  1491109818  2017-08-02 00:00:09.250  Tipper
1  1491110071  2017-08-02 00:00:19.283     HGV
2  1491110798  2017-08-02 00:00:39.283  Tipper
3  1491110798  2017-08-02 00:00:39.283    None
Sign up to request clarification or add additional context in comments.

4 Comments

Doesn't vectorize just hide the for loop?
@James It hides, and it speeds it up. You need to see this: stackoverflow.com/a/46163829/4909087 Also, the try-except approach is fast -- (faster than if conditions, EAFP and so on), so I'd recommend sticking with it.
Interesting, does the speed up come from broadcasting to the underlying numpy array?
@James I'm not sure broadcasting is applicable here, but I definitely know there's some amount of parallelisation going on.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.