read_csv allows to use regex in sep= so you can use | (as OR) to define many chars (or even strings) as delimiters.
Because there are spaces in text which shouldn't be used as separator so I use , (comma + space) and \s{2,} (2 or more whitespaces) to detect correct places (but this keeps date and time as single string):
sep=r', |\s{2,}', engine='python'
Minimal working code.
I use io only to create file like object in memory so everyone can simply copy and test it - but you could use filename (or even directly url to raw data read_csv("https://raw.githubusercontent.com/logpai/loghub/refs/heads/master/Windows/Windows_2k.log", ...))
text = r'''
2016-09-28 04:30:30, Info CBS Loaded Servicing Stack v6.1.7601.23505 with Core: C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_6.1.7601.23505_none_681aa442f6fed7f0\cbscore.dll
2016-09-28 04:30:31, Info CSI 00000001@2016/9/27:20:30:31.455 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fef9fb9b6d @0x7fef9f8358f @0xff83e97c @0xff83d799 @0xff83db2f)
2016-09-28 04:30:31, Info CSI 00000002@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa006ade @0x7fef9fd2984 @0x7fef9f83665 @0xff83e97c @0xff83d799)
2016-09-28 04:30:31, Info CSI 00000003@2016/9/27:20:30:31.458 WcpInitialize (wcp.dll version 0.0.0.6) called (stack @0x7fed806eb5d @0x7fefa1c8728 @0x7fefa1c8856 @0xff83e474 @0xff83d7de @0xff83db2f)
2016-09-28 04:30:31, Info CBS Ending TrustedInstaller initialization.
2016-09-28 04:30:31, Info CBS Starting the TrustedInstaller main loop.
2016-09-28 04:30:31, Info CBS TrustedInstaller service starts successfully.
2016-09-28 04:30:31, Info CBS SQM: Initializing online with Windows opt-in: False
2016-09-28 04:30:31, Info CBS SQM: Cleaning up report files older than 10 days.
2016-09-28 04:30:31, Info CBS SQM: Requesting upload of all unsent reports.
'''
import pandas as pd
import io
file_like_object = io.StringIO(text)
df = pd.read_csv(file_like_object, sep=r', |\s{2,}', engine='python', names=['date', 'info', 'other', 'text'])
print(df)
Result:
date info other text
0 2016-09-28 04:30:30 Info CBS Loaded Servicing Stack v6.1.7601.23505 with Co...
1 2016-09-28 04:30:31 Info CSI 00000001@2016/9/27:20:30:31.455 WcpInitialize ...
2 2016-09-28 04:30:31 Info CSI 00000002@2016/9/27:20:30:31.458 WcpInitialize ...
3 2016-09-28 04:30:31 Info CSI 00000003@2016/9/27:20:30:31.458 WcpInitialize ...
4 2016-09-28 04:30:31 Info CBS Ending TrustedInstaller initialization.
5 2016-09-28 04:30:31 Info CBS Starting the TrustedInstaller main loop.
6 2016-09-28 04:30:31 Info CBS TrustedInstaller service starts successfully.
7 2016-09-28 04:30:31 Info CBS SQM: Initializing online with Windows opt-in: ...
8 2016-09-28 04:30:31 Info CBS SQM: Cleaning up report files older than 10 days.
9 2016-09-28 04:30:31 Info CBS SQM: Requesting upload of all unsent reports.
Other idea: pandas has function read_fwf to read fixed-width formatted data - and you can use width of columns to split it in correct way.
It seems it automatically splits your data but it needs to remove , from date after reading. But it also splits date and time as separate columns. It treats date (without time) as index.
df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'])
df['date'] = df['date'].str.rstrip(',')
print(df)
Using widths you can resolve this problem - but this needs to calculate all widths manually. And this still needs to remove ,
df = pd.read_fwf(file_like_object, names=['date', 'info', 'other', 'text'], widths=[21,22,7,1000])
df['date'] = df['date'].str.rstrip(',')
Probably using colspecs you can define [start,end[ for every column and skip , but this still need to calculate values start,end manually.
Other idea: sometimes it is simpler to read it as normal text and use line.split() or slicing line[start:end] (and other string functions, and also if/elif/else) to split columns and later use DataFrame(data). It can be useful when file has more complex data (ie. text in many lines)
Doc: read_csv, read_fwf
read_csv()has option to use regex insep=- so it allows to use|to create multiple delimiters, and even\sto matchwhitespace(it means space tab newline) as delimiters (or\s+for multiple spaces, multiple tabs, and mix of them)line.split()or slicingline[star:end](especially if columns have constant sizes)fixed-width format. I think it may remove all spaces tabs automatically but it may need to remove,on your own and later convert string to datetime object.