0

I have multiple documents that together are approximately 400 GB and I want to convert them to json format in order to drop to elasticsearch for analysis.

Each file is approximately 200 MB.

Original file looked like:

IUGJHHGF@BERLIN:lhfrjy
0t7yfudf@WARSAW:qweokm246
0t7yfudf@CRACOW:Er747474
0t7yfudf@cracow:kui666666
000t7yf@Vienna:1йй2ц2й2цй2цц3у

It has the characters that are not only English. key1 is always separated with @, where city was separated either by ; or :

After I have parsed it with code:

#!/usr/bin/env python

# coding: utf8
import json


with open('2') as f:
   for line in f:
      s1 = line.find("@")
      rest = line[s1+1:]
      if rest.find(";") != -1:
         if rest.find(":") != -1:
            print "FOUND BOTH : ; "
            s2 = -0
         else:
            s2 = s1+1+rest.find(";")
      elif rest.find(":") != -1:
         s2 = s1+1+rest.find(":")
      else:
         print "FOUND NO : ; "
         s2 = -0

      key1 = line[:s1]
      city = line[s1+1:s2]
      description = line[s2+1:len(line)-1]

All file looks like:

RRS12345 Cracow Sunflowers
RRD12345 Berin Data

After that parsing I want to have the output:

  {  
   "location_data":[  
      {  
         "key1":"RRS12345",
         "city":"Cracow",
         "description":"Sunflowers"
      },
      {  
         "key1":"RRD123dsd45",
         "city":"Berlin",
         "description":"Data"
      },
      {  
         "key1":"RRD123dsds45",
         "city":"Berlin",
         "description":"1йй2ц2й2цй2цц3у"
      }
   ]
}

How can I convert it to the required json format quickly, where we do not have only English characters?

5
  • Can you show what you tried and describe how exactly it failed? Commented May 23, 2018 at 12:10
  • Do you need to use Python in particular, or would a faster non-Python solution do? Commented May 23, 2018 at 12:10
  • Do any of the cities have spaces in their names? Or spaces in the descriptions? If so, what does that look like? Commented May 23, 2018 at 12:29
  • No spaces in the names exist. The language does not matter. Commented May 23, 2018 at 12:42
  • I could do theoretically print at the end of the script that I have wrote and force that json syntax manually, but that is just so dump solution. Commented May 23, 2018 at 12:43

2 Answers 2

3
import json


def process_text_to_json():
    location_data = []
    with open("file.txt") as f:
        for line in f:
            line = line.split()
            location_data.append({"key1": line[0], "city": line[1], "description": line[2]})

    location_data = {"location_data": location_data}
    return json.dumps(location_data)

Output sample:

{"location_data": [{"city": "Cracow", "key1": "RRS12345", "description": "Sunflowers"}, {"city": "Berin", "key1": "RRD12345", "description": "Data"}, {"city": "Cracow2", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin2", "key1": "RRD12346", "description": "Data"}, {"city": "Cracow3", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin3", "key1": "RRD12346", "description": "Data"}]}

Sign up to request clarification or add additional context in comments.

Comments

0

Iterate over each line and form your dict.

Ex:

d = {"location_data":[]}
with open(filename, "r") as infile:
    for line in infile:
        val = line.split()
        d["location_data"].append({"key1": val[0], "city": val[1], "description": val[2]})

print(d)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.