python - Split CSV column into two

Question

i am trying to split the process hash fields into two fields, so that it's "md5" "sha256" "process_name" "process_effective_reputation", I've tried the code above but i get

row = {'md5': data['process_hash'][0], 'sha256': data['process_hash'][1]}
IndexError: list index out of range

json data:

{'results': [{'device_name': 'faaadc2',
          'device_timestamp': '2020-10-27T00:50:46.176Z',
          'event_id': '9b1bvfaa11eb81b',
          'process_effective_reputation': 'LIST5',
          'process_hash': ['bfc7dcf5935f3avda9df8e9b6425c37a',
                           'ca9f3a2450asd518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
          'process_name': 'c:\\program files '
                          '(x86)\\to122soft\\thcaadf3\\tohossce.exe',
          'process_username': ['JOHN\\user1']},
         {'device_name': 'fk6saadc2',
          'device_timestamp': '2020-10-27T00:50:46.176Z',
          'event_id': '9b151f6e17ee11eb81b',
          'process_effective_reputation': 'LIST1',
          'process_hash': ['bfc7dcf5935f3a9df8e9baaa425c37a',
                           'ca9f3aaa506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'],
          'process_name': 'c:\\program files '
                          '(x86)\\oaaft\\tf3\\toaaotsice.exe',
          'process_username': ['JOHN\\user2']},
         {'device_name': 'sdddsdc2',
          'device_timestamp': '2020-10-27T00:50:46.176Z',
          'event_id': '9b151f698e11eb81b',
          'process_effective_reputation': 'LIST',
          'process_hash': ['9df8ebfc7dcf5935830f3a9b6asdcd7a',
                           'ca9f3a24506cc518fdfrcv39a33c100b2d557f96e040f7124641ad1734e2f19'],
          'process_name': 'c:\\program files '
                          '(x86)\\toht\\thaa3\\toasce.exe',
          'process_username': ['JOHN\\user3']}]}

response = json.loads(r.text)
r = response['results']

selected_fields = []
for d in r:
    selected_fields.append({k: d[k] for k in ("process_hash", "process_name", "process_effective_reputation")})

new_data = []
for data in selected_fields:
    fieldnames = 'md5 sha256 process_name process_effective_reputation'.split()
    row = {'md5': data['process_hash'][0], 'sha256': data['process_hash'][1]}
    # Copy process_name and process_effective_reputation fields.
    row.update({fieldname: data[fieldname] for fieldname in fieldnames[-2:]})
    new_data.append(row)
return new_data

Current csv data:

process_hash    process_name    process_effective_reputation
 ['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
 ['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0']    c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml   NOT_LISTED
 ['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
 ['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
 ['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37']    c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe   ADAPTIVE_WHITE_LIST

What I'm trying to achieve with the CSV file:

 md5   sha256   process_name  process_effective_reputation

Thank you

Update: Thanks that code works buran, but now it's returning duplicates again and csv is not formatted properly, for example, if there is only one type of hash, the row will shift to the right and all the columns will not line up properly. sorry im still learning python, please help

md5 sha256  process_name    process_effective_reputation
082642cf23a33a9c6fd1e5e671c075e4    ad0020c2b55708528edb7e54dc35878b7309084d011357398051d2644fe707c7    \\plaapp01\hupzar\winsad\winsadib.exe   ADAPTIVE_WHITE_LIST
082642cf23a33a9c6fd1e5e671c075e4    ad0020c2b55708528edb7e54dc35878b7309084d011357398051d2644fe707c7    \\plaapp01\hupzar\winsad\winsadib.exe   ADAPTIVE_WHITE_LIST
5c3471076193ef7c1d0df4cd42b58249bfd49fd68332d38c645c35d709b449d9    c:\users\it\appdata\local\temp\{a70cbf04-a246-434a-bd96-b5cfd84e765d}\qualcomm atheros ethernet driver installer.msi    NOT_LISTED  
082642cf23a33a9c6fd1e5e671c075e4    ad0020c2b55708528edb7e54dc35878b7309084d011357398051d2644fe707c7    \\plaapp01\hupzar\winsad\winsadib.exe   ADAPTIVE_WHITE_LIST

Created my answer that solves this problem with shifted md5/sha256 columns or missing values that you talk about in last paragraph of your question's post. — Arty
– Arty, Commented Nov 6, 2020 at 12:32

Arty · Accepted Answer · 2020-11-06 11:41:04Z

Here's my solution. I create value for md5 column if in the list of hashes there is a value of length in range 30 to 40 otherwise there is no md5 and I leave md5 column empty, same for sha256 but for range 60 to 70, because in your data md5 and sha256 for some reason sometimes are 31 or 32 and 63 or 64 in length. If you want strictly 32 and 64 replace ranges in my code to (32, 33) and (64, 65). Also leftmost value for md5 is taken if there are several md5 values in a list, same for sha256. Remaining needed columns (process_name and process_effective_reputation) are just copied over.

For creating CSV I use default settings (see creation of object csv.DictWriter(...) in my code), this means that CSV has , as delimiter, to use e.g. tabs as you mentioned in your question just add extra argument delimiter = '\t' as csv.DictWriter(..., delimiter = '\t' ,...). You may want to add other CSV writing params if needed, read about them here.

Try it online!

def create_csv(json_data):
    import csv, io
    fbuf = io.StringIO()
    writer = csv.DictWriter(fbuf, fieldnames = [
        'md5', 'sha256', 'process_name', 'process_effective_reputation'])
    writer.writeheader()
    for device in json_data['results']:
        writer.writerow({
            **{h : ([e for e in device['process_hash'] if l0 <= len(e) < l1] + [''])[0]
                for h, l0, l1 in (('md5', 30, 40), ('sha256', 60, 70))},
            **{e : device[e] for e in ('process_name', 'process_effective_reputation')},
        })
    print(fbuf.getvalue())
    with open('output.csv', 'w', encoding = 'utf-8') as f:
        f.write(fbuf.getvalue())

json_data = {
    "results": [
        {
            "device_name": "faaadc2",
            "device_timestamp": "2020-10-27T00:50:46.176Z",
            "event_id": "9b1bvfaa11eb81b",
            "process_effective_reputation": "LIST5",
            "process_hash": [
                "bfc7dcf5935f3avda9df8e9b6425c37a",
                "ca9f3a2450asd518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19",
            ],
            "process_name": "c:\\program files "
            "(x86)\\to122soft\\thcaadf3\\tohossce.exe",
            "process_username": ["JOHN\\user1"],
        },
        {
            "device_name": "fk6saadc2",
            "device_timestamp": "2020-10-27T00:50:46.176Z",
            "event_id": "9b151f6e17ee11eb81b",
            "process_effective_reputation": "LIST1",
            "process_hash": [
                "bfc7dcf5935f3a9df8e9baaa425c37a",
                "ca9f3aaa506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19",
            ],
            "process_name": "c:\\program files " "(x86)\\oaaft\\tf3\\toaaotsice.exe",
            "process_username": ["JOHN\\user2"],
        },
        {
            "device_name": "sdddsdc2",
            "device_timestamp": "2020-10-27T00:50:46.176Z",
            "event_id": "9b151f698e11eb81b",
            "process_effective_reputation": "LIST",
            "process_hash": [
                "9df8ebfc7dcf5935830f3a9b6asdcd7a",
                "ca9f3a24506cc518fdfrcv39a33c100b2d557f96e040f7124641ad1734e2f19",
            ],
            "process_name": "c:\\program files " "(x86)\\toht\\thaa3\\toasce.exe",
            "process_username": ["JOHN\\user3"],
        },
    ]
}

create_csv(json_data)

Output:

md5,sha256,process_name,process_effective_reputation
bfc7dcf5935f3avda9df8e9b6425c37a,ca9f3a2450asd518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19,c:\program files (x86)\to122soft\thcaadf3\tohossce.exe,LIST5
bfc7dcf5935f3a9df8e9baaa425c37a,ca9f3aaa506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19,c:\program files (x86)\oaaft\tf3\toaaotsice.exe,LIST1
9df8ebfc7dcf5935830f3a9b6asdcd7a,ca9f3a24506cc518fdfrcv39a33c100b2d557f96e040f7124641ad1734e2f19,c:\program files (x86)\toht\thaa3\toasce.exe,LIST

buran · Accepted Answer · 2020-11-06 10:07:46Z

1

You complicate it way too much

import csv

json_data = {...}
# json_data = r.json() # you can use convenient method provided in requests

fieldnames=("md5", "sha256", "process_name", "process_effective_reputation")

with open ('output.csv', 'w', newline='') as f:
    wrtr = csv.DictWriter(f, fieldnames=fieldnames)
    wrtr.writeheader()
    for device in json_data['results']:
        data = device['process_hash'] + [device['process_name'], device['process_effective_reputation']]
        wrtr.writerow(dict(zip(fieldnames, data)))

output.csv

md5,sha256,process_name,process_effective_reputation
bfc7dcf5935f3avda9df8e9b6425c37a,ca9f3a2450asd518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19,c:\program files (x86)\to122soft\thcaadf3\tohossce.exe,LIST5
bfc7dcf5935f3a9df8e9baaa425c37a,ca9f3aaa506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19,c:\program files (x86)\oaaft\tf3\toaaotsice.exe,LIST1
9df8ebfc7dcf5935830f3a9b6asdcd7a,ca9f3a24506cc518fdfrcv39a33c100b2d557f96e040f7124641ad1734e2f19,c:\program files (x86)\toht\thaa3\toasce.exe,LIST

answered Nov 6, 2020 at 10:07

buran

14.4k13 gold badges45 silver badges76 bronze badges

2 Comments

user3704597 Over a year ago

Update: Thanks that code works buran, but now it's returning duplicates again and csv is not formatted properly, for example, if there is only one type of hash, the row will shift to the right and all the columns will not line up properly. sorry im still learning python, please help

buran Over a year ago

we are working with sample data you provide us with. there is no information about possible missing hashes. How do we know if there is just one hash, what kind it is - md5 or sha256? from the length? I cannot say anything about duplicates - where do you see them?

Collectives™ on Stack Overflow

python - Split CSV column into two

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related