I have a | delimited csv file with data as shown below.
AccountID|BounceSubcategory|BounceTypeID|BounceType|SMTPBounceReason|SMTPMessage|SMTPCode|TriggererSendDefinitionObjectID|TriggeredSendCustomerKey|IsFalseBounce|dwHashKey
1111111111|Unknown|1|delayed|5.0.350 Remote server returned an error|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS107"}|580|||True|
1212121212|Mailbox Full|1|delayed|5.2.2 Generating server: MW4PR84MB2114.NAMPRD84.PROD.OUTLOOK.COM
[email protected]
Remote server returned 554 5.2.2 mailbox full;
STOREDRV.Deliver.Exception:=
QuotaExceededException; Failed to process message due to a permanent
except=
ion wit|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS117"}|534|||False|
However, when the data is read from csv into a data frame, the dataframe output is as below:
|AccountID|BounceSubcategory|BounceTypeID|BounceType|SMTPBounceReason|SMTPMessage|SMTPCode|TriggererSendDefinitionObjectID|TriggeredSendCustomerKey|IsFalseBounce|dwHashKey
|1111111111|Unknown|1|delayed|5.0.350 Remote server returned an error|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS107"}|580|NULL|NULL|True|NULL|
|1212121212|Mailbox Full|1|delayed|5.2.2 Generating server: MW4PR84MB2114.NAMPRD84.PROD.OUTLOOK.COM|NULL|NULL|NULL|NULL|NULL|NULL|
|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|[email protected]|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|Remote server returned 554 5.2.2 mailbox full;|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|STOREDRV.Deliver.Exception:=|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|QuotaExceededException; Failed to process message due to a permanent|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|except=|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|ion wit|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS117"}|534|NULL|NULL|False|NULL|NULL|NULL|NULL|
How can I read the data in pyspark in proper format where the data can be written in a dataframe as shown below
Expected output is as below:
AccountID|BounceSubcategory|BounceTypeID|BounceType|SMTPBounceReason|SMTPMessage|SMTPCode|TriggererSendDefinitionObjectID|TriggeredSendCustomerKey|IsFalseBounce|dwHashKey
1111111111|Unknown|1|delayed|5.0.350 Remote server returned an error|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS107"}|580|||True|
1212121212|Mailbox Full|1|delayed|5.2.2 Generating server: MW4PR84MB2114.NAMPRD84.PROD.OUTLOOK.COM [email protected] Remote server returned 554 5.2.2 mailbox full; STOREDRV.Deliver.Exception:= QuotaExceededException; Failed to process message due to a permanent except=ion wit|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS117"}|534|||False|
.option("multiline","true")?"5.2.2 Generating server: ... ion wit". If you can change how source file is generated, I would change the source file. Otherwise, you need to fix either manually or process with some raw operations.