Parsing .txt files to a single .csv output

Question

I am currently trying to parse 2 text files, then have a .csv output. One contains a list of path/file location, and the other is contains other info related to the path/file location.

1st text file contains (path.txt):

C:/Windows/System32/vssadmin.exe
C:/Users/Administrator/Desktop/google.com

2nd text file contains (filelist.txt):

-= List of files in hash: =-

$VAR1 = {
          'File' => [
                      {
                        'RootkitInfo' => 'Normal',
                        'FileVersionLabel' => '6.1.7600.16385',
                        'ProductVersion' => '6.1.7601.17514',
                        'Path' => 'C:/Windows/System32/vssadmin.exe',
                        'Signer' => 'Microsoft Windows',
                        'Size' => '210944',
                        'SHA1' => 'da39a3ee5e6b4b0d3255bfef95601890afd80709'
                        },
                        {
                        'RootkitInfo' => 'Normal',
                        'FileVersionLabel' => '6.1.7600.16385',
                        'ProductVersion' => '6.1.7601.17514',
                        'Path' => 'C:/Users/Administrator/Desktop/steam.exe',
                        'Signer' => 'Valve Inc.',
                        'Size' => '300944',
                        'SHA1' => 'cf23df2207d99a74fbe169e3eba035e633b65d94'
                        },
                        {
                        'RootkitInfo' => 'Normal',
                        'FileVersionLabel' => '6.1.7600.16385',
                        'ProductVersion' => '6.1.7601.17514',
                        'Path' => 'C:/Users/Administrator/Desktop/google.com',
                        'Signer' => 'Valve Inc.',
                        'Size' => '300944',
                        'SHA1' => 'cf23df2207d99a74fbe169e3eba035e633b78987'
                        },
                        .
                        .
                        .
                    ]
          }

How do I go about having a .csv output containing the path of the file with its corresponding hash value? Also, in case I would like to add additional column/info corresponding to the path?

Sample table output:

    <table>
      <tr>
        <th>File Path</th>
        <th>Hash Value</th> 
      </tr>
      <tr>
        <td>C:/Windows/System32/vssadmin.exe</td>
        <td>da39a3ee5e6b4b0d3255bfef95601890afd80709</td> 
      </tr>
      <tr>
        <td>C:/Users/Administrator/Desktop/google.com</td>
        <td>cf23df2207d99a74fbe169e3eba035e633b78987</td> 
      </tr>
    </table>

Your question doesn't show own effort and is too broad. It contains at least: parsing of PHP data definition, joining the data from two sources and formatting as HTML output. — Michael Butscher
– Michael Butscher, Commented May 23, 2019 at 0:45
The second file is not a .txt and will require fairly more effort to parse, since it doesn't nicely evaluate into a dict or other data structure. What have you tried so far? — C.Nivs
– C.Nivs, Commented May 23, 2019 at 1:14

wwii · Accepted Answer · 2019-05-23 02:19:42Z

You could construct regex pattern that matches what you are looking for

pattern = r"""{.*?(C:/Windows/System32/vssadmin.exe).*?'SHA1' => '([^']*)'.*?}"""

To use it with multiple file names in a loop turn that pattern into a format string.

fmt = r"""{{.*?({}).*?'SHA1' => '([^']*)'.*?}}"""

Something like this:

import re
with open('filelist.txt') as f:
    s = f.read()
with open('path.txt') as f:
    for line in f:
        pattern = fmt.format(line.strip())
        m = re.search(pattern, s, flags=re.DOTALL)
        if m:
            print(m.groups())
        else:
            print('no match for', fname)

It's a little inefficient and depends on the contents of the files to be exactly like you represented - like capitalization being the same.

Or without regular expressions: iterate over the lines of filelist.txt; find the Path line; extract the path with a slice, see if it is a path from path.txt; find the very next SHA1 line; extract the hash with a slice. This relies on the position of the two lines relative to each other and the position of the characters in each line. This will probably be more efficient.

with open('path.txt') as f:
    fnames = set(line.strip() for line in f)
with open('filelist.text') as f:
    for line in f:
        line = line.strip()
        if line.startswith("'Path'") and line[11:-2] in fnames:
            name = line[11:-2]
            while not line.startswith("'SHA1'"):
                line = next(f)
                line = line.strip()
            print((name, line[11:-2]))

This one also assumes the text files are as you represented them.

C.Nivs · Accepted Answer · 2019-05-23 01:40:56Z

To parse the alleged second .txt (of which it is not), you will need to re-structure it so that it looks like a normal python data structure. It's pretty close, and there are ways to coerce it to look like one:

import ast

contents = "" # this will be to hold the read contents of that file
filestart = False 

with open('filelist.txt') as fh:
    for line in fh:
        if not filestart and not line.startswith("$VAR"):
            continue
        elif line.startswith("$VAR"):
            contents+="{" # start the dictionary
            filestart = True # to kill the first if statement
        else:
            contents += line # fill out with rest of file


# create dictionary, we use ast here because json will fail
result = ast.literal_eval(contents.replace("=>", ":"))

# {'File': [{'RootkitInfo': 'Normal', 'FileVersionLabel': '6.1.7600.16385', 'ProductVersion': '6.1.7601.17514', 'Path': 'C:/Windows/System32/vssadmin.exe', 'Signer': 'Microsoft Windows', 'Size': '210944', 'SHA1': 'da39a3ee5e6b4b0d3255bfef95601890afd80709'}, {'RootkitInfo': 'Normal', 'FileVersionLabel': '6.1.7600.16385', 'ProductVersion': '6.1.7601.17514', 'Path': 'C:/Users/Administrator/Desktop/steam.exe', 'Signer': 'Valve Inc.', 'Size': '300944', 'SHA1': 'cf23df2207d99a74fbe169e3eba035e633b65d94'}, {'RootkitInfo': 'Normal', 'FileVersionLabel': '6.1.7600.16385', 'ProductVersion': '6.1.7601.17514', 'Path': 'C:/Users/Administrator/Desktop/google.com', 'Signer': 'Valve Inc.', 'Size': '300944', 'SHA1': 'cf23df2207d99a74fbe169e3eba035e633b78987'}]}

files = result["File"] # get your list from here

Now that it's in a tolerable format, I'd convert it to a dict of file: hash key-value pairs for easy lookup against your other file

files_dict = {file['Path']: file['SHA1'] for file in files}

# now grab your other file, and lookups should be quite simple

with open("path.txt") as fh:
    results = [f"{filepath.strip()}, {files_dict.get(filepath.strip())}" for filepath in fh]

# Now you can put that to a csv
with open("paths.csv", "w") as fh:
    fh.write('File Path,  Hash Value') # write the header
    fh.write('\n'.join(results))

There are better ways to do this, but that could be left as an exercise to the reader

Collectives™ on Stack Overflow

Parsing .txt files to a single .csv output

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related