Redshift. COPY from invalid JSON on S3

Question

I am trying to load data into Redshift from JSON file on S3.

But this file contains a format error - lines QUOTES '$'

${"id":1,"title":"title 1"}$
${"id":2,"title":"title 2"}$

An error was made while exporting data from PostgreSQL.

Now when I try to load data into Redshift, I get the message "Invalid value" for raw_line "$".

Is there any way how to escape these symbols using the Redshift COPY command and avoid data re-uploading or transforming?

MY COMMANDS

-- CREATE TABLE
create table my_table (id BIGINT, title VARCHAR);

-- COPY DATA FROM S3
copy my_table from 's3://my-bucket/my-file.json' 
credentials 'aws_access_key_id=***;aws_secret_access_key=***'
format as json 'auto'

Thanks in advance!

Also, now I'm thinking of how to modify data on S3 without downloading the files themselves locally (with according to huge size of data) — Alex
– Alex, Commented Dec 22, 2020 at 18:02

Bill Weiner · Accepted Answer · 2020-12-22 19:12:05Z

I don't think there is a simple "ignore this" option that will work in your case. You could try NULL AS '$' but I expect that will just confuse things in different ways.

Your best bet is to filter the files and replace the originals with the fixed version. As you note in your comment downloading them to your system, modifying, and pushing back is not a good option due to size. This will impact you in transfer speed (over the internet) and data-out costs from S3. You want to do this "inside" of AWS.

There are a number of ways to do this and I expect the best choice will be based on what you can do quickly and not the absolute best way. (Sounds like this is a one time fix operation.) Here are a few:

Fire up an EC2 instance and do the download-modify-upload process to this system inside of AWS. Remember to have an S3 endpoint in your VPC.
Create a Lambda function to stream the data in, modify it, and push back to S3. Just do this as a streaming process since you won't want to download very large files to Lambda in their entirety.
Define a Glue process to strip out the unwanted characters. This will need some custom coding as your files are not in a valid json format.
Use CloudShell to download the files, modify, and upload. There's a 1GB storage limit on CloudShell so this will need to work on smallish chucks of your data but doesn't need you to start an EC2. This is a new service so there may be other issues with this path but could be an interesting choice.

There are other choices that are possible (EMR) but these seem like the likely ones. I like playing with new things (especially when they are free) so if it was me I'd try CloudShell.

Collectives™ on Stack Overflow

Redshift. COPY from invalid JSON on S3

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related