3

I need to import huge XML files to a database. After that, I need to transform it into another format.

At the moment I try to do that using Postgres.

I've already imported a 250 MB file to a table using

insert into test
(name, "element")
SELECT 
     (xpath('//title/text()', myTempTable.myXmlColumn))[1]::text AS name
     ,myTempTable.myXmlColumn as "element"
FROM unnest(
    xpath
    (    '//test'
        ,XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('test.xml'), 'UTF8'))
    )
) AS myTempTable(myXmlColumn)
;

But with bigger files (i tried a > 1 GB file I get

SQL Error [22023]: ERROR: requested length too large ERROR: requested length too large ERROR: requested length too large

My goal is to import and transform files with a size ~50 GB.

Any suggestions/alternatives?

Update:

The idea is not to import 1GB files into one field. The code above was able to load AND unnest my 250MB file into 1773844 rows in 3m 57s on my machine. I think this is not bad. After the file is imported I can transform the data relatively fast cause Postgres is good at that.

Any better ideas?

3
  • Consider an alternative language that connects to postgres and has support for a streaming XML reader. Commented Feb 21, 2018 at 8:40
  • sure, I tried to do that using csharp and it worked but iterating through huge files always takes a long time. I consider now to split the files and use the above script. Commented Feb 21, 2018 at 8:49
  • The only limitation of your code is that the file has to be on the database server, which is not always a possible approach. To be more flexible, better use COPY from STDIN Commented Feb 21, 2018 at 18:18

1 Answer 1

1

Have you tried this combination of \COPY + UNNEST?

Using an intermediate table ..

CREATE TABLE tmp_tb (tmp_xml XML);

Perform the import using psql ..

cat huge.xml | psql db -c "\COPY tmp_tb (tmp_xml) FROM STDIN;"

Once you have your XML loaded, you can internally parse it ..

INSERT INTO tb (test) 
SELECT UNNEST(XPATH('//test',tmp_xml)) FROM tmp_tb
Sign up to request clarification or add additional context in comments.

2 Comments

Unfortunately this won't work if your XML source has any formatting errors. You'll receive errors like "ERROR: invalid XML content; DETAIL: line 1: Premature end of data in tag x line 1"
@BenWilson, well with an invalid xml doc it is ok to get an error :-D Perhaps you shoulkd remove the invalid chars with sed or even perl, e.g. cat huge.xml | perl -pe 's/\\\n/\n/g' | psql db -c "COPY ...;"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.