1

I have a very large chunk of data that looks like this:

#Software: SGOS 5.4.3.7
#Version: 1.0
#Start-Date: 2011-07-22 20:34:51
#Date: 2011-06-02 10:20:47
#Fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id
#Remark: 2610140037 "SG-42" "82.137.200.42" "main"
2011-07-22 20:34:51 282 ce6de14af68ce198 - - - OBSERVED "unavailable" http://www.surfjunky.com/members/sj-a.php?r=44864  200 TCP_NC_MISS GET text/html http www.surfjunky.com 80 /members/sj-a.php ?r=66556 php "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24" 82.137.200.42 1395 663 -
2011-07-22 20:34:51 216 6154d919f8d56690 - - - OBSERVED "unavailable" http://x31.iloveim.com/build_3.9.2.1/comet.html  200 TCP_NC_MISS GET text/html;charset=UTF-8 http x31.iloveim.com 80 /servlets/events ?1122064400327 - "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18" 82.137.200.42 473 1129 -
2011-07-22 20:34:51 307 6d98469a3f1de6f4 - - - OBSERVED "unavailable" http://www.xnxx.com/  200 TCP_MISS GET image/jpeg http img100.xvideos.com 80 /videos/thumbsl/2/e/5/2e5fd679f1118757314fb9a94c0f626c.25.jpg - jpg "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C)" 82.137.200.42 16188 415 -

I have about 10 - 15 files that range from 700 MB to 20 GB. I need to convert those to JSON so I can use them for testing some JSON analytic software. I have the start of something but I am getting some errors and have some questions

This is what I have for a conversion script:

#!/usr/bin/python
import csv

jsfile = file('sg_main__test.json', 'w')
jsfile.write('[\r\n')

with open('sg_main__test.log','r') as f:
    next(f)
    reader=csv.reader(f,delimiter='\t')

    row_count = len(list(reader))
    ite = 0

    f.seek(0)
    next(f)

    for date,time,time_taken,c_ip,cs_username,cs_auth_group,x_exception_id,sc_filter_result,cs_categories,cs_Referer,sc_status,s_action,cs_method,rs_Content_Type,cs_uri_scheme,cs_host,cs_uri_port,cs_uri_path,cs_uri_query,cs_uri_extension,cs_User_Agent,s_ip,sc_bytes,cs_bytes,x_virus_id in reader:

    ite+= 1

    jsfile.write('\t{\r\n')


    d = '\t\t\"date\": \"' + date + '\",\r\n'
    t = '\t\t\"time\": \"' + time + '\",\r\n'
    tt = '\t\t\"date\": \"' + date + '\",\r\n'
    ci = '\t\t\"c_ip\": \"' + c_ip + '\",\r\n'
    c = '\t\t\"cs_username\": \"' + cs_username + '\",\r\n'
    ca = '\t\t\"cs_auth_group\": \"' + cs_auth_group + '\",\r\n'
    xe = '\t\t\"x_exception_id\": \"' + x_exception_id + '\",\r\n'
    sf = '\t\t\"sc_filter_result\": \"' + sc_filter_result + '\",\r\n'
    cc = '\t\t\"cs_categories\": \"' + cs_categories + '\",\r\n'
    cr = '\t\t\"cs_Referer\": \"' + cs_Referer + '\",\r\n'
    ss = '\t\t\"sc_status\": \"' + sc_status + '\",\r\n'
    sa = '\t\t\"s_action\": \"' + s_action + '\",\r\n'
    cm = '\t\t\"cs_method\": \"' + cs_method + '\",\r\n'
    rc = '\t\t\"rs_Content_Type\": \"' + rs_Content_Type + '\",\r\n'
    cu = '\t\t\"cs_uri_scheme\": \"' + cs_uri_scheme + '\",\r\n'
    ch = '\t\t\"cs_host\": \"' + cs_host + '\",\r\n'
    cp = '\t\t\"cs_uri_port\": \"' + cs_uri_port + '\",\r\n'
    cpa = '\t\t\"cs_uri_path\": \"' + cs_uri_path + '\",\r\n'
    cq = '\t\t\"cs_uri_query\": \"' + cs_uri_query + '\",\r\n'
    ce = '\t\t\"cs_uri_extension\": \"' + cs_uri_extension + '\",\r\n'
    cua = '\t\t\"cs_User_Agent\": \"' + cs_User_Agent + '\",\r\n'
    si = '\t\t\"s_ip\": \"' + s_ip + '\",\r\n'
    sb = '\t\t\"sc_bytes\": \"' + sc_bytes + '\",\r\n'
    cb = '\t\t\"cs_bytes\": \"' + cs_bytes + '\",\r\n'
    xv = '\t\t\"x_virus_id\": \"' + x_virus_id + '\",\r\n'

    jsfile.write(d)
    jsfile.write(t)
    jsfile.write(tt)
    jsfile.write(ci)
    jsfile.write(c)
    jsfile.write(ca)
    jsfile.write(xe)
    jsfile.write(sf)
    jsfile.write(cc)
    jsfile.write(cr)
    jsfile.write(ss)
    jsfile.write(sa)
    jsfile.write(cm)
    jsfile.write(rc)
    jsfile.write(cu)
    jsfile.write(ch)
    jsfile.write(cp)
    jsfile.write(cpa)
    jsfile.write(cq)
    jsfile.write(ce)
    jsfile.write(cua)
    jsfile.write(si)
    jsfile.write(sb)
    jsfile.write(cb)
    jsfile.write(xv)

    jsfile.write('\t}')

    if ite < row_count:
        jsfile.write('\r\n')

    jsfile.write('\r\n')

jsfile.write(']')
jsfile.close()

Error returned on execution -

for date,time,time_taken,c_ip,cs_username,cs_auth_group,x_exception_id,sc_filter_result,cs_categories,cs_Referer,sc_status,s_action,cs_method,rs_Content_Type,cs_uri_scheme,cs_host,cs_uri_port,cs_uri_path,cs_uri_query,cs_uri_extension,cs_User_Agent,s_ip,sc_bytes,cs_bytes,x_virus_id in reader: ValueError: need more than 1 value to unpack

Questions - Is there something in the context I'm missing as to why it is returning the "need more than 1 value to unpack"?

Is there a way I can have it read a directory and convert all the files in a directory with having to define the input file name?

With the question above can I have it save the exported file using the original filename but with a .json file extension without having to manually define the output filename?

4
  • why not tablib package? It will load and export your data in few code rows Commented Feb 12, 2018 at 17:36
  • @sKwa - I didn't use this because I didn't know it existed. Will give it a spin. Thanks. Commented Feb 12, 2018 at 18:12
  • csv.reader default to fields separated by comma. If you have no commas you get one big field. Commented Feb 12, 2018 at 18:21
  • are you sure that its TSV, i.e. delimited by tab('\t')? Commented Feb 12, 2018 at 18:55

2 Answers 2

3

Your sample data isn't in tsv format, it isn't formatted well.

#!/usr/bin/env python

import csv
import json
import collections

# aliases
OrderedDict = collections.OrderedDict

src = '/tmp/data.log'
dst = '/tmp/data.json'
header = [
    'date', 'time', 'time_taken', 'c_ip', 'cs_username', 'cs_auth_group',
    'x_exception_id', 'sc_filter_result', 'cs_categories', 'cs_Referer',
    'sc_status', 's_action', 'cs_method', 'rs_Content_Type', 'cs_uri_scheme',
    'cs_host', 'cs_uri_port', 'cs_uri_path', 'cs_uri_query',
    'cs_uri_extension', 'cs_User_Agent', 's_ip', 'sc_bytes', 'cs_bytes',
    'x_virus_id'
]

data = []
with open(src, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    for row in reader:
        if row[0].strip()[0] == '#':  #
            continue
        row = filter(None, row)
        data.append(OrderedDict(zip(header, row)))

with open(dst, 'w') as jsonfile:
    json.dump(data, jsonfile, indent=2)

Witch gives me output:

$ cat data.json
[
  {
    "date": "2011-07-22", 
    "time": "20:34:51", 
    "time_taken": "282", 
    "c_ip": "ce6de14af68ce198", 
    "cs_username": "-", 
    "cs_auth_group": "-", 
    "x_exception_id": "-", 
    "sc_filter_result": "OBSERVED", 
    "cs_categories": "unavailable", 
    "cs_Referer": "http://www.surfjunky.com/members/sj-a.php?r=44864", 
    "sc_status": "200", 
    "s_action": "TCP_NC_MISS", 
    "cs_method": "GET", 
    "rs_Content_Type": "text/html", 
    "cs_uri_scheme": "http", 
    "cs_host": "www.surfjunky.com", 
    "cs_uri_port": "80", 
    "cs_uri_path": "/members/sj-a.php", 
    "cs_uri_query": "?r=66556", 
    "cs_uri_extension": "php", 
    "cs_User_Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24", 
    "s_ip": "82.137.200.42", 
    "sc_bytes": "1395", 
    "cs_bytes": "663", 
    "x_virus_id": "-"
  }, 
  {
    "date": "2011-07-22", 
    "time": "20:34:51", 
    "time_taken": "216", 
    "c_ip": "6154d919f8d56690", 
    "cs_username": "-", 
    "cs_auth_group": "-", 
    "x_exception_id": "-", 
    "sc_filter_result": "OBSERVED", 
    "cs_categories": "unavailable", 
    "cs_Referer": "http://x31.iloveim.com/build_3.9.2.1/comet.html", 
    "sc_status": "200", 
    "s_action": "TCP_NC_MISS", 
    "cs_method": "GET", 
    "rs_Content_Type": "text/html;charset=UTF-8", 
    "cs_uri_scheme": "http", 
    "cs_host": "x31.iloveim.com", 
    "cs_uri_port": "80", 
    "cs_uri_path": "/servlets/events", 
    "cs_uri_query": "?1122064400327", 
    "cs_uri_extension": "-", 
    "cs_User_Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18", 
    "s_ip": "82.137.200.42", 
    "sc_bytes": "473", 
    "cs_bytes": "1129", 
    "x_virus_id": "-"
  }, 
  {
    "date": "2011-07-22", 
    "time": "20:34:51", 
    "time_taken": "307", 
    "c_ip": "6d98469a3f1de6f4", 
    "cs_username": "-", 
    "cs_auth_group": "-", 
    "x_exception_id": "-", 
    "sc_filter_result": "OBSERVED", 
    "cs_categories": "unavailable", 
    "cs_Referer": "http://www.xnxx.com/", 
    "sc_status": "200", 
    "s_action": "TCP_MISS", 
    "cs_method": "GET", 
    "rs_Content_Type": "image/jpeg", 
    "cs_uri_scheme": "http", 
    "cs_host": "img100.xvideos.com", 
    "cs_uri_port": "80", 
    "cs_uri_path": "/videos/thumbsl/2/e/5/2e5fd679f1118757314fb9a94c0f626c.25.jpg", 
    "cs_uri_query": "-", 
    "cs_uri_extension": "jpg", 
    "cs_User_Agent": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C)", 
    "s_ip": "82.137.200.42", 
    "sc_bytes": "16188", 
    "cs_bytes": "415", 
    "x_virus_id": "-"
  }
]
Sign up to request clarification or add additional context in comments.

3 Comments

Your right its not a tsv but just a messy log. Thanks
@JasonBelanger: can you define log format? much more better if you will define "good" format(no double spaces between fields for example), so parsing process will be more robust.
Its not my data. The data is bluecoat device logs from site link[/link]
1

I ended up using

#!/usr/bin/python

import pandas as pd
import json

master_df = pd.read_csv('sg_main__test.txt', parse_dates={'$date' :['date', 'time']}, sep='\s+', encoding='utf-8')

out_file = master_df.to_json('sg_main__test.json', date_format='iso', orient='records')

But the issue with this is date formatting.

I need my data set to export to - "Timestamp" : {$date:"2018-02-12T01:50:37.000Z"} Currently its exporting as - {"$date":"2011-07-22T20:34:51.000Z"}

Doesn't seem to be an easy solution.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.