Python script - TSV conversion to JSON

Question

I have a very large chunk of data that looks like this:

#Software: SGOS 5.4.3.7
#Version: 1.0
#Start-Date: 2011-07-22 20:34:51
#Date: 2011-06-02 10:20:47
#Fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id
#Remark: 2610140037 "SG-42" "82.137.200.42" "main"
2011-07-22 20:34:51 282 ce6de14af68ce198 - - - OBSERVED "unavailable" http://www.surfjunky.com/members/sj-a.php?r=44864  200 TCP_NC_MISS GET text/html http www.surfjunky.com 80 /members/sj-a.php ?r=66556 php "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24" 82.137.200.42 1395 663 -
2011-07-22 20:34:51 216 6154d919f8d56690 - - - OBSERVED "unavailable" http://x31.iloveim.com/build_3.9.2.1/comet.html  200 TCP_NC_MISS GET text/html;charset=UTF-8 http x31.iloveim.com 80 /servlets/events ?1122064400327 - "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18" 82.137.200.42 473 1129 -
2011-07-22 20:34:51 307 6d98469a3f1de6f4 - - - OBSERVED "unavailable" http://www.xnxx.com/  200 TCP_MISS GET image/jpeg http img100.xvideos.com 80 /videos/thumbsl/2/e/5/2e5fd679f1118757314fb9a94c0f626c.25.jpg - jpg "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C)" 82.137.200.42 16188 415 -

I have about 10 - 15 files that range from 700 MB to 20 GB. I need to convert those to JSON so I can use them for testing some JSON analytic software. I have the start of something but I am getting some errors and have some questions

This is what I have for a conversion script:

#!/usr/bin/python
import csv

jsfile = file('sg_main__test.json', 'w')
jsfile.write('[\r\n')

with open('sg_main__test.log','r') as f:
    next(f)
    reader=csv.reader(f,delimiter='\t')

    row_count = len(list(reader))
    ite = 0

    f.seek(0)
    next(f)

    for date,time,time_taken,c_ip,cs_username,cs_auth_group,x_exception_id,sc_filter_result,cs_categories,cs_Referer,sc_status,s_action,cs_method,rs_Content_Type,cs_uri_scheme,cs_host,cs_uri_port,cs_uri_path,cs_uri_query,cs_uri_extension,cs_User_Agent,s_ip,sc_bytes,cs_bytes,x_virus_id in reader:

    ite+= 1

    jsfile.write('\t{\r\n')


    d = '\t\t\"date\": \"' + date + '\",\r\n'
    t = '\t\t\"time\": \"' + time + '\",\r\n'
    tt = '\t\t\"date\": \"' + date + '\",\r\n'
    ci = '\t\t\"c_ip\": \"' + c_ip + '\",\r\n'
    c = '\t\t\"cs_username\": \"' + cs_username + '\",\r\n'
    ca = '\t\t\"cs_auth_group\": \"' + cs_auth_group + '\",\r\n'
    xe = '\t\t\"x_exception_id\": \"' + x_exception_id + '\",\r\n'
    sf = '\t\t\"sc_filter_result\": \"' + sc_filter_result + '\",\r\n'
    cc = '\t\t\"cs_categories\": \"' + cs_categories + '\",\r\n'
    cr = '\t\t\"cs_Referer\": \"' + cs_Referer + '\",\r\n'
    ss = '\t\t\"sc_status\": \"' + sc_status + '\",\r\n'
    sa = '\t\t\"s_action\": \"' + s_action + '\",\r\n'
    cm = '\t\t\"cs_method\": \"' + cs_method + '\",\r\n'
    rc = '\t\t\"rs_Content_Type\": \"' + rs_Content_Type + '\",\r\n'
    cu = '\t\t\"cs_uri_scheme\": \"' + cs_uri_scheme + '\",\r\n'
    ch = '\t\t\"cs_host\": \"' + cs_host + '\",\r\n'
    cp = '\t\t\"cs_uri_port\": \"' + cs_uri_port + '\",\r\n'
    cpa = '\t\t\"cs_uri_path\": \"' + cs_uri_path + '\",\r\n'
    cq = '\t\t\"cs_uri_query\": \"' + cs_uri_query + '\",\r\n'
    ce = '\t\t\"cs_uri_extension\": \"' + cs_uri_extension + '\",\r\n'
    cua = '\t\t\"cs_User_Agent\": \"' + cs_User_Agent + '\",\r\n'
    si = '\t\t\"s_ip\": \"' + s_ip + '\",\r\n'
    sb = '\t\t\"sc_bytes\": \"' + sc_bytes + '\",\r\n'
    cb = '\t\t\"cs_bytes\": \"' + cs_bytes + '\",\r\n'
    xv = '\t\t\"x_virus_id\": \"' + x_virus_id + '\",\r\n'

    jsfile.write(d)
    jsfile.write(t)
    jsfile.write(tt)
    jsfile.write(ci)
    jsfile.write(c)
    jsfile.write(ca)
    jsfile.write(xe)
    jsfile.write(sf)
    jsfile.write(cc)
    jsfile.write(cr)
    jsfile.write(ss)
    jsfile.write(sa)
    jsfile.write(cm)
    jsfile.write(rc)
    jsfile.write(cu)
    jsfile.write(ch)
    jsfile.write(cp)
    jsfile.write(cpa)
    jsfile.write(cq)
    jsfile.write(ce)
    jsfile.write(cua)
    jsfile.write(si)
    jsfile.write(sb)
    jsfile.write(cb)
    jsfile.write(xv)

    jsfile.write('\t}')

    if ite < row_count:
        jsfile.write('\r\n')

    jsfile.write('\r\n')

jsfile.write(']')
jsfile.close()

Error returned on execution -

for date,time,time_taken,c_ip,cs_username,cs_auth_group,x_exception_id,sc_filter_result,cs_categories,cs_Referer,sc_status,s_action,cs_method,rs_Content_Type,cs_uri_scheme,cs_host,cs_uri_port,cs_uri_path,cs_uri_query,cs_uri_extension,cs_User_Agent,s_ip,sc_bytes,cs_bytes,x_virus_id in reader: ValueError: need more than 1 value to unpack

Questions - Is there something in the context I'm missing as to why it is returning the "need more than 1 value to unpack"?

Is there a way I can have it read a directory and convert all the files in a directory with having to define the input file name?

With the question above can I have it save the exported file using the original filename but with a .json file extension without having to manually define the output filename?

why not tablib package? It will load and export your data in few code rows — sKwa
– sKwa, Commented Feb 12, 2018 at 17:36
@sKwa - I didn't use this because I didn't know it existed. Will give it a spin. Thanks. — Jason Belanger
– Jason Belanger, Commented Feb 12, 2018 at 18:12
csv.reader default to fields separated by comma. If you have no commas you get one big field. — Mark Tolonen
– Mark Tolonen, Commented Feb 12, 2018 at 18:21

sKwa · Accepted Answer · 2018-02-12 19:53:51Z

Your sample data isn't in tsv format, it isn't formatted well.

#!/usr/bin/env python

import csv
import json
import collections

# aliases
OrderedDict = collections.OrderedDict

src = '/tmp/data.log'
dst = '/tmp/data.json'
header = [
    'date', 'time', 'time_taken', 'c_ip', 'cs_username', 'cs_auth_group',
    'x_exception_id', 'sc_filter_result', 'cs_categories', 'cs_Referer',
    'sc_status', 's_action', 'cs_method', 'rs_Content_Type', 'cs_uri_scheme',
    'cs_host', 'cs_uri_port', 'cs_uri_path', 'cs_uri_query',
    'cs_uri_extension', 'cs_User_Agent', 's_ip', 'sc_bytes', 'cs_bytes',
    'x_virus_id'
]

data = []
with open(src, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    for row in reader:
        if row[0].strip()[0] == '#':  #
            continue
        row = filter(None, row)
        data.append(OrderedDict(zip(header, row)))

with open(dst, 'w') as jsonfile:
    json.dump(data, jsonfile, indent=2)

Witch gives me output:

$ cat data.json
[
  {
    "date": "2011-07-22", 
    "time": "20:34:51", 
    "time_taken": "282", 
    "c_ip": "ce6de14af68ce198", 
    "cs_username": "-", 
    "cs_auth_group": "-", 
    "x_exception_id": "-", 
    "sc_filter_result": "OBSERVED", 
    "cs_categories": "unavailable", 
    "cs_Referer": "http://www.surfjunky.com/members/sj-a.php?r=44864", 
    "sc_status": "200", 
    "s_action": "TCP_NC_MISS", 
    "cs_method": "GET", 
    "rs_Content_Type": "text/html", 
    "cs_uri_scheme": "http", 
    "cs_host": "www.surfjunky.com", 
    "cs_uri_port": "80", 
    "cs_uri_path": "/members/sj-a.php", 
    "cs_uri_query": "?r=66556", 
    "cs_uri_extension": "php", 
    "cs_User_Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24", 
    "s_ip": "82.137.200.42", 
    "sc_bytes": "1395", 
    "cs_bytes": "663", 
    "x_virus_id": "-"
  }, 
  {
    "date": "2011-07-22", 
    "time": "20:34:51", 
    "time_taken": "216", 
    "c_ip": "6154d919f8d56690", 
    "cs_username": "-", 
    "cs_auth_group": "-", 
    "x_exception_id": "-", 
    "sc_filter_result": "OBSERVED", 
    "cs_categories": "unavailable", 
    "cs_Referer": "http://x31.iloveim.com/build_3.9.2.1/comet.html", 
    "sc_status": "200", 
    "s_action": "TCP_NC_MISS", 
    "cs_method": "GET", 
    "rs_Content_Type": "text/html;charset=UTF-8", 
    "cs_uri_scheme": "http", 
    "cs_host": "x31.iloveim.com", 
    "cs_uri_port": "80", 
    "cs_uri_path": "/servlets/events", 
    "cs_uri_query": "?1122064400327", 
    "cs_uri_extension": "-", 
    "cs_User_Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18", 
    "s_ip": "82.137.200.42", 
    "sc_bytes": "473", 
    "cs_bytes": "1129", 
    "x_virus_id": "-"
  }, 
  {
    "date": "2011-07-22", 
    "time": "20:34:51", 
    "time_taken": "307", 
    "c_ip": "6d98469a3f1de6f4", 
    "cs_username": "-", 
    "cs_auth_group": "-", 
    "x_exception_id": "-", 
    "sc_filter_result": "OBSERVED", 
    "cs_categories": "unavailable", 
    "cs_Referer": "http://www.xnxx.com/", 
    "sc_status": "200", 
    "s_action": "TCP_MISS", 
    "cs_method": "GET", 
    "rs_Content_Type": "image/jpeg", 
    "cs_uri_scheme": "http", 
    "cs_host": "img100.xvideos.com", 
    "cs_uri_port": "80", 
    "cs_uri_path": "/videos/thumbsl/2/e/5/2e5fd679f1118757314fb9a94c0f626c.25.jpg", 
    "cs_uri_query": "-", 
    "cs_uri_extension": "jpg", 
    "cs_User_Agent": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C)", 
    "s_ip": "82.137.200.42", 
    "sc_bytes": "16188", 
    "cs_bytes": "415", 
    "x_virus_id": "-"
  }
]

@JasonBelanger: can you define log format? much more better if you will define "good" format(no double spaces between fields for example), so parsing process will be more robust.
Its not my data. The data is bluecoat device logs from site link[/link]

Jason Belanger · Accepted Answer · 2018-02-13 21:23:41Z

1

I ended up using

#!/usr/bin/python

import pandas as pd
import json

master_df = pd.read_csv('sg_main__test.txt', parse_dates={'$date' :['date', 'time']}, sep='\s+', encoding='utf-8')

out_file = master_df.to_json('sg_main__test.json', date_format='iso', orient='records')

But the issue with this is date formatting.

I need my data set to export to - "Timestamp" : {$date:"2018-02-12T01:50:37.000Z"} Currently its exporting as - {"$date":"2011-07-22T20:34:51.000Z"}

Doesn't seem to be an easy solution.

answered Feb 13, 2018 at 21:23

Jason Belanger

331 silver badge8 bronze badges

Collectives™ on Stack Overflow

Python script - TSV conversion to JSON

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related