-3

My log file format is as below :

2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.

Taken from here https://github.com/logpai/loghub/blob/master/Windows/Windows_2k.log

I want to read this file in a pandas dataframe with proper column headings. It is having multiple delimeters as , space and tab. Please share your thoughts with a running code sample.

5
  • Please use code block formatting so the data is readable. As posted it is impossible to tell where each record starts and ends. Commented Feb 22 at 13:48
  • 2
    Sorry but you're literally just asking us to do your job for you. This is way too broad. Commented Feb 22 at 14:11
  • function like read_csv() has option to use regex in sep= - so it allows to use | to create multiple delimiters, and even \s to match whitespace (it means space tab newline) as delimiters (or \s+ for multiple spaces, multiple tabs, and mix of them) Commented Feb 22 at 14:23
  • sometimes it is simple to read it as normal text and split every line using line.split() or slicing line[star:end] (especially if columns have constant sizes) Commented Feb 22 at 14:28
  • if you know widths of columns then you can use pandas.read_fwf for fixed-width format. I think it may remove all spaces tabs automatically but it may need to remove , on your own and later convert string to datetime object. Commented Feb 22 at 14:32

1 Answer 1

0

read_csv allows to use regex in sep= so you can use | (as OR) to define many chars (or even strings) as delimiters.

Because there are spaces in text which shouldn't be used as separator so I use , (comma + space) and \s{2,} (2 or more whitespaces) to detect correct places (but this keeps date and time as single string):

sep=r', |\s{2,}', engine='python'

Minimal working code.

I use io only to create file like object in memory so everyone can simply copy and test it - but you could use filename (or even directly url to raw data read_csv("https://raw.githubusercontent.com/logpai/loghub/refs/heads/master/Windows/Windows_2k.log", ...))

text = r'''
2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.
'''

import pandas as pd
import io

file_like_object = io.StringIO(text)

df = pd.read_csv(file_like_object, sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text'])

print(df)

Result:

                  date  info other                                               text
0  2016-09-28 04:30:30  Info   CBS  Loaded Servicing Stack v6.1.7601.23505 with Co...
1  2016-09-28 04:30:31  Info   CSI  00000001@2016/9/27:20:30:31.455 WcpInitialize ...
2  2016-09-28 04:30:31  Info   CSI  00000002@2016/9/27:20:30:31.458 WcpInitialize ...
3  2016-09-28 04:30:31  Info   CSI  00000003@2016/9/27:20:30:31.458 WcpInitialize ...
4  2016-09-28 04:30:31  Info   CBS            Ending TrustedInstaller initialization.
5  2016-09-28 04:30:31  Info   CBS           Starting the TrustedInstaller main loop.
6  2016-09-28 04:30:31  Info   CBS      TrustedInstaller service starts successfully.
7  2016-09-28 04:30:31  Info   CBS  SQM: Initializing online with Windows opt-in: ...
8  2016-09-28 04:30:31  Info   CBS  SQM: Cleaning up report files older than 10 days.
9  2016-09-28 04:30:31  Info   CBS      SQM: Requesting upload of all unsent reports.

Other idea: pandas has function read_fwf to read fixed-width formatted data - and you can use width of columns to split it in correct way.

It seems it automatically splits your data but it needs to remove , from date after reading. But it also splits date and time as separate columns. It treats date (without time) as index.

df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'])

df['date'] = df['date'].str.rstrip(',')

print(df)

Using widths you can resolve this problem - but this needs to calculate all widths manually. And this still needs to remove ,

df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'], widths=[21,22,7,1000])

df['date'] = df['date'].str.rstrip(',')

Probably using colspecs you can define [start,end[ for every column and skip , but this still need to calculate values start,end manually.


Other idea: sometimes it is simpler to read it as normal text and use line.split() or slicing line[start:end] (and other string functions, and also if/elif/else) to split columns and later use DataFrame(data). It can be useful when file has more complex data (ie. text in many lines)


Doc: read_csv, read_fwf

Sign up to request clarification or add additional context in comments.

3 Comments

While reading the file directly with the following options df = pd.read_csv("resources/Windows_2k.log", sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text']) i get the error: pandas.errors.ParserError: Expected 4 fields in line 11, saw 5. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. Did u try reading the file directly ? reading the file from txt = ' ' works fine
I didn't test it with full file. I think that it doesn't care of quotes when it uses own separator. But maybe one column has longer text and two columns are too close (less than 2 spaces) and it may treat it as single column. I will check full file later. maybe it will need to use read_fwf. Or it may need to write own method to read it (line-by-line).
I checked full file - problem is , in line 11 in last column in with ... \*_std.sqm, flags: ... because sep=r', |\s{2,}' treats it as another separator and it creates 5 columns instead of 4 columns (only for this line). But methods with read_fwf works with full file.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.