Why PostgreSQL is not parsing large XML data?

Question

I want to check the capability of XML in PostgreSQL before moving large data into the Database.

I have a simple test to check the XML Capability in PostgreSQL. It works with 3000000 XML formed data but not with 4000000.

✅ select xml_is_well_formed('<test>' || repeat('월', 3000000) || '</test>');

❌ select xml_is_well_formed('<test>' || repeat('월', 4000000) || '</test>');

This is what my XML Capability test looks like: XML_Test.sh

#### XML capabilities
sql1="select xml_is_well_formed('<test>' || repeat('월', 3000000) || '</test>');"
sql2="select xml_is_well_formed('<test>' || repeat('월', 4000000) || '</test>');"

status="OK:"

ret=$(echo "$sql1" | psql -At -U $user -h $host $db)

if [ "$ret" != "t" ]
then
  status="FAILED:"
fi
echo "  $status XML capability (test 1/libxml): "

status="OK:"

ret=$(echo "$sql2" | psql -At -U $user -h $host $db)

if [ "$ret" != "t" ]
then
  status="FAILED:"
fi
echo "  $status XML capability (test 2/libxml): "

I'm using Amazon Linux AMI and my PostgreSQL version is: 9.2.24 and using the default PostgreSQL configuration.

Edit: My total system memory is 32 GB.

Running the below command is only indicating whether the test is passing or not:

$ echo "select xml_is_well_formed('<test>' || repeat('월', 4000000) || '</test>')" | psql -At -U USER -h localhost DB
f


$ echo "select xml_is_well_formed('<test>' || repeat('월', 3000000) || '</test>')" | psql -At -U USER -h localhost DB
t

Thank you!

If you don't tell us the error message, nobody can help you. Please edit the question for that. — Laurenz Albe
– Laurenz Albe, Commented Dec 2, 2019 at 7:05
@Muhaddis the only thing this shows is that the test is invalid. There's only 1 XML element with a huge amount of text that can't be queried - there's no structure there. The method xml_is_well_formed isn't used in querying and doesn't have to work with such amounts anyway. It's the loader's job to ensure the input is valid - and a single XML element with 1GB of text isn't a valid choice. It's a BLOB and should be treated as one. The query doesn't test loading, insertion, indexing or querying. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 2, 2019 at 8:52
@Muhaddis the only way this input can be used is as raw text for a full-text search query. It's not even big data, it's just a single BLOB. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 2, 2019 at 8:55
@Muhaddis in all XML systems and libraries, large amounts of data are handled using SAX interfaces. When you have 100MB of data to insert, it makes no sense to wait for all 100MB to be read into memory before you start processing them. SAX and SAX-like interfaces read and parse elements as they appear in the input stream, which means that by the time they finish reading the input, the output may already be available. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 2, 2019 at 8:59
@Muhaddis Databases, even document databases, shred XML docs into individual elements and index those, allowing them to quickly query them. That's what makes this test unsuitable - there's only 1 element with an opaque value. There's nothing to query or indexe there — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 2, 2019 at 9:00

Richard Huxton · Accepted Answer · 2019-12-02 08:44:38Z

3

I think it is a restriction on the underlying xml parser library.

=> SELECT xmlparse(document '<?xml version="1.0"?><test>' || repeat('월', 4000000) || '</test>');
ERROR:  invalid XML document
DETAIL:  line 1: xmlSAX2Characters: huge text node
��월월월월월월월월월월월월월월월월월월월월월월월월월월
                                                                               ^
line 1: Extra content at the end of the document
��월월월월월월월월월월월월월월월월월월월월월월월월월월
                                                                               ^

Perhaps a 1GB limit on text nodes or some such.

To be honest, if you are planning to work with multi-GB xml documents I suspect you want a special-purpose system rather than a general RDBMS.

answered Dec 2, 2019 at 8:44

Richard Huxton

23.6k5 gold badges43 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Panagiotis Kanavos Over a year ago

Databases can handle much bigger BLOBs just fine. The OP tried to validate this single-element, 3M character XML file in memory. All document, graph or relational databases would choke on that. None of them needs to do something like that - their XML contents are already parsed, shredded and valid. It's the loader that needs to validate

Muhaddis Over a year ago

@PanagiotisKanavos Thanks for clarification. I think parsing is not the word to use here.

Panagiotis Kanavos Over a year ago

@Muhaddis you should explain what you want to do, what the real problem is, in the question itself then. What you tried doesn't demonstrate anything. I suspect the method can handle far bigger XML strings if it doesn't have to cache individual multi-MB tokens in memory.

Collectives™ on Stack Overflow

Why PostgreSQL is not parsing large XML data?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related