0

I have a | delimited csv file with data as shown below.

AccountID|BounceSubcategory|BounceTypeID|BounceType|SMTPBounceReason|SMTPMessage|SMTPCode|TriggererSendDefinitionObjectID|TriggeredSendCustomerKey|IsFalseBounce|dwHashKey
1111111111|Unknown|1|delayed|5.0.350  Remote server returned an error|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS107"}|580|||True|
1212121212|Mailbox Full|1|delayed|5.2.2  Generating server: MW4PR84MB2114.NAMPRD84.PROD.OUTLOOK.COM

[email protected]
Remote server returned 554 5.2.2 mailbox full;
STOREDRV.Deliver.Exception:=
QuotaExceededException; Failed to process message due to a permanent
except=
ion wit|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS117"}|534|||False|

However, when the data is read from csv into a data frame, the dataframe output is as below:

|AccountID|BounceSubcategory|BounceTypeID|BounceType|SMTPBounceReason|SMTPMessage|SMTPCode|TriggererSendDefinitionObjectID|TriggeredSendCustomerKey|IsFalseBounce|dwHashKey
|1111111111|Unknown|1|delayed|5.0.350  Remote server returned an error|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS107"}|580|NULL|NULL|True|NULL|
|1212121212|Mailbox Full|1|delayed|5.2.2  Generating server: MW4PR84MB2114.NAMPRD84.PROD.OUTLOOK.COM|NULL|NULL|NULL|NULL|NULL|NULL|
|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|[email protected]|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|Remote server returned 554 5.2.2 mailbox full;|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|STOREDRV.Deliver.Exception:=|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|QuotaExceededException; Failed to process message due to a permanent|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|except=|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|
|ion wit|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS117"}|534|NULL|NULL|False|NULL|NULL|NULL|NULL|

How can I read the data in pyspark in proper format where the data can be written in a dataframe as shown below

Expected output is as below:

AccountID|BounceSubcategory|BounceTypeID|BounceType|SMTPBounceReason|SMTPMessage|SMTPCode|TriggererSendDefinitionObjectID|TriggeredSendCustomerKey|IsFalseBounce|dwHashKey
1111111111|Unknown|1|delayed|5.0.350  Remote server returned an error|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS107"}|580|||True|
1212121212|Mailbox Full|1|delayed|5.2.2  Generating server: MW4PR84MB2114.NAMPRD84.PROD.OUTLOOK.COM [email protected] Remote server returned 554 5.2.2 mailbox full; STOREDRV.Deliver.Exception:= QuotaExceededException; Failed to process message due to a permanent except=ion wit|{"Source":"MITS_SMTP", "Machine":"ATL1S11MITS117"}|534|||False| 
7
  • did you try .option("multiline","true")? Commented May 12 at 19:14
  • 1
    yes the multiline did not work. Commented May 12 at 19:17
  • Please edit your question to show a minimal reproducible example demonstrating what you've tried so far. It would also help us if you could provide some example CSV data in the proper format. Commented May 12 at 19:24
  • Please show us the actual raw csv file contents. The image you posted looks like it is an Excel screenshot, which does not give us enough information. Commented May 12 at 19:36
  • 1
    i don't think multiLine would work unless the field with newlines is enclosed with double quotes. In your case, you need double quotes around this text "5.2.2 Generating server: ... ion wit". If you can change how source file is generated, I would change the source file. Otherwise, you need to fix either manually or process with some raw operations. Commented May 14 at 2:59

1 Answer 1

0

this should return the results you expect

df = (spark.read.option('header', True)
               .option('multiline', True)
               .option('mode', 'PERMISSIVE')
               .option('quote', '"')
               .option('escape', '"')
               .csv('yourpath/CSVFilename.csv'))
display(df)
Sign up to request clarification or add additional context in comments.

1 Comment

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.