5

This may be redundant, but after reading previous posts and answers I still have not gotten my code to work. I have a very large file containing multiple json objects that are not delimited by any values:

{"_index": "1234", "_type": "11", "_id": "1234", "_score": 0.0, "fields": {"c_u": ["url.com"], "tawgs.id": ["p6427"]}}{"_index": "1234", "_type": "11", "_id": "786fd4ad2415aa7b", "_score": 0.0, "fields": {"c_u": ["url2.com"], "tawgs.id": ["p12519"]}}{"_index": "1234", "_type": "11", "_id": "5826e7cbd92d951a", "_score": 0.0, "fields": {"tawgs.id": ["p8453", "p8458"]}}

I've read that this is exactly what JSON-RPC is supposed to look like, but still can't achieve opening/parsing the file to create a dataframe in python.

I tried something of the format of:

i = 0
d = json.JSONDecoder()
while True:
    try:
        obj, i = d.raw_decode(s, i)
    except ValueError:
        return
    yield obj

but it didn't work.

I've also tried a basic:

with open('output.json','r') as f:
    data = json.load(f)

but am thrown the error:

JSONDecodeError: Extra data: line 1 column 184 (char 183) 

Trying json.decode() with append didn't work either and returned data empty []

data = []
with open('es-output.json', 'r') as f:
    for line in f:
        try:
            data.append(json.loads(line))
        except json.decoder.JSONDecodeError:
            pass # skip this line 
3

2 Answers 2

3

This will try to decode the JSON stream inside s iteratively:

s = '''{"_index": "1234", "_type": "11", "_id": "1234", "_score": 0.0, "fields": {"c_u": ["url.com"], "tawgs.id": ["p6427"]}}{"_index": "1234", "_type": "11", "_id": "786fd4ad2415aa7b", "_score": 0.0, "fields": {"c_u": ["url2.com"], "tawgs.id": ["p12519"]}}{"_index": "1234", "_type": "11", "_id": "5826e7cbd92d951a", "_score": 0.0, "fields": {"tawgs.id": ["p8453", "p8458"]}}'''

import json

d = json.JSONDecoder()

idx = 0
while True:
    if idx >= len(s):
        break
    data, i = d.raw_decode(s[idx:])
    idx += i
    print(data)
    print('*' * 80)

Prints:

{'_index': '1234', '_type': '11', '_id': '1234', '_score': 0.0, 'fields': {'c_u': ['url.com'], 'tawgs.id': ['p6427']}}
********************************************************************************
{'_index': '1234', '_type': '11', '_id': '786fd4ad2415aa7b', '_score': 0.0, 'fields': {'c_u': ['url2.com'], 'tawgs.id': ['p12519']}}
********************************************************************************
{'_index': '1234', '_type': '11', '_id': '5826e7cbd92d951a', '_score': 0.0, 'fields': {'tawgs.id': ['p8453', 'p8458']}}
********************************************************************************
Sign up to request clarification or add additional context in comments.

5 Comments

so if my "s" value is a json file that isn't a string object because I used json.dump() to write the file, how would I initially convert my file to become a json string type? when I try to use json.dumps() to write my file, I get back an empty set
@aesthetics Just load the content of the file with JSON objects inside s: s = open('your_file.txt', 'r').read()
@aesthetics I don't have any experience with elasticsearch, but you either have some string that is containing JSON values or you will need to load this string from file.
as a next step, and just for clarification, would there be a way to flatten the json if the type is a string to get it into a dataframe?
@aesthetics That's question for Panda/Numpy specialists, but I bet there are some methods for loading data directly from json. You may open other question, these comments are not suitable for it.
0

The problem is in the data itself! In this data you use 3 values but without keys.

The first one is :

{"_index".... ["p6427"]}}

The second one is :

{"_index".... ["p12519"]}}

The third one is :

{"_index".... ["p8458"]}}

You'd rather affect to each value a key, for example :

{
"k1":{"_index": "1234", "_type": "11", "_id": "1234", "_score": 0.0, "fields": {"c_u": ["url.com"], "tawgs.id": ["p6427"]}},

"k2":{"_index": "1234", "_type": "11", "_id": "786fd4ad2415aa7b", "_score": 0.0, "fields": {"c_u": ["url2.com"], "tawgs.id": ["p12519"]}},

"k3":{"_index": "11_20190714_184325_01", "_type": "11", "_id": "5826e7cbd92d951a", "_score": 0.0, "fields": {"tawgs.id": ["p8453", "p8458"]}}
}

This way everything will work well and data will be well loaded.

3 Comments

hmm, I am pulling my data down from elasticsearch-py so I am not sure how to manipulate and introduce keys? Also very novice with integrating python and elasticsearch :/
Try to identify some unique charactiritics to create keys!
how am I able to create keys in the json if I am unable to intially read/open the file that contains the data?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.