How to read log files in Pandas dataframe having multiple delimiters? [closed]

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Closed 9 months ago.

Improve this question

My log file format is as below :

2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.

Taken from here https://github.com/logpai/loghub/blob/master/Windows/Windows_2k.log

I want to read this file in a pandas dataframe with proper column headings. It is having multiple delimeters as , space and tab. Please share your thoughts with a running code sample.

Please use code block formatting so the data is readable. As posted it is impossible to tell where each record starts and ends. — OldBoy
– OldBoy, Commented Feb 22 at 13:48
Sorry but you're literally just asking us to do your job for you. This is way too broad. — roganjosh
– roganjosh, Commented Feb 22 at 14:11
function like read_csv() has option to use regex in sep= - so it allows to use | to create multiple delimiters, and even \s to match whitespace (it means space tab newline) as delimiters (or \s+ for multiple spaces, multiple tabs, and mix of them) — furas
– furas, Commented Feb 22 at 14:23
sometimes it is simple to read it as normal text and split every line using line.split() or slicing line[star:end] (especially if columns have constant sizes) — furas
– furas, Commented Feb 22 at 14:28
if you know widths of columns then you can use pandas.read_fwf for fixed-width format. I think it may remove all spaces tabs automatically but it may need to remove , on your own and later convert string to datetime object. — furas
– furas, Commented Feb 22 at 14:32

furas · Accepted Answer · 2025-02-22 15:35:55Z

read_csv allows to use regex in sep= so you can use | (as OR) to define many chars (or even strings) as delimiters.

Because there are spaces in text which shouldn't be used as separator so I use , (comma + space) and \s{2,} (2 or more whitespaces) to detect correct places (but this keeps date and time as single string):

sep=r', |\s{2,}', engine='python'

Minimal working code.

I use io only to create file like object in memory so everyone can simply copy and test it - but you could use filename (or even directly url to raw data read_csv("https://raw.githubusercontent.com/logpai/loghub/refs/heads/master/Windows/Windows_2k.log", ...))

text = r'''
2016-09-28 04:30:30, Info                  CBS    Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info                  CSI    00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info                  CSI    00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info                  CSI    00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info                  CBS    Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info                  CBS    Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info                  CBS    TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info                  CBS    SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info                  CBS    SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info                  CBS    SQM: Requesting upload of all unsent reports.
'''

import pandas as pd
import io

file_like_object = io.StringIO(text)

df = pd.read_csv(file_like_object, sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text'])

print(df)

Result:

                  date  info other                                               text
0  2016-09-28 04:30:30  Info   CBS  Loaded Servicing Stack v6.1.7601.23505 with Co...
1  2016-09-28 04:30:31  Info   CSI  00000001@2016/9/27:20:30:31.455 WcpInitialize ...
2  2016-09-28 04:30:31  Info   CSI  00000002@2016/9/27:20:30:31.458 WcpInitialize ...
3  2016-09-28 04:30:31  Info   CSI  00000003@2016/9/27:20:30:31.458 WcpInitialize ...
4  2016-09-28 04:30:31  Info   CBS            Ending TrustedInstaller initialization.
5  2016-09-28 04:30:31  Info   CBS           Starting the TrustedInstaller main loop.
6  2016-09-28 04:30:31  Info   CBS      TrustedInstaller service starts successfully.
7  2016-09-28 04:30:31  Info   CBS  SQM: Initializing online with Windows opt-in: ...
8  2016-09-28 04:30:31  Info   CBS  SQM: Cleaning up report files older than 10 days.
9  2016-09-28 04:30:31  Info   CBS      SQM: Requesting upload of all unsent reports.

Other idea: pandas has function read_fwf to read fixed-width formatted data - and you can use width of columns to split it in correct way.

It seems it automatically splits your data but it needs to remove , from date after reading. But it also splits date and time as separate columns. It treats date (without time) as index.

df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'])

df['date'] = df['date'].str.rstrip(',')

print(df)

Using widths you can resolve this problem - but this needs to calculate all widths manually. And this still needs to remove ,

df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'], widths=[21,22,7,1000])

df['date'] = df['date'].str.rstrip(',')

Probably using colspecs you can define [start,end[ for every column and skip , but this still need to calculate values start,end manually.

Other idea: sometimes it is simpler to read it as normal text and use line.split() or slicing line[start:end] (and other string functions, and also if/elif/else) to split columns and later use DataFrame(data). It can be useful when file has more complex data (ie. text in many lines)

Doc: read_csv, read_fwf

While reading the file directly with the following options df = pd.read_csv("resources/Windows_2k.log", sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text']) i get the error: pandas.errors.ParserError: Expected 4 fields in line 11, saw 5. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. Did u try reading the file directly ? reading the file from txt = ' ' works fine
I didn't test it with full file. I think that it doesn't care of quotes when it uses own separator. But maybe one column has longer text and two columns are too close (less than 2 spaces) and it may treat it as single column. I will check full file later. maybe it will need to use read_fwf. Or it may need to write own method to read it (line-by-line).
I checked full file - problem is , in line 11 in last column in with ... \*_std.sqm, flags: ... because sep=r', |\s{2,}' treats it as another separator and it creates 5 columns instead of 4 columns (only for this line). But methods with read_fwf works with full file.

Collectives™ on Stack Overflow

How to read log files in Pandas dataframe having multiple delimiters? [closed]

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related