I would want experiment with root structure of XML. Make it possible to have mul...

oevi · on Feb 10, 2023

XSLT 3.0 allows streaming for large XML files. See e.g. https://www.saxonica.com/html/documentation10/sourcedocs/str...

rhdunn · on Feb 10, 2023

Isn't that what SAX (event based) and reader (pull based) XML parsing is for? Those allow you to incrementally/partially parse XML.

phire · on Feb 10, 2023

Well... There is nothing stopping you partially parsing an XML documents. What you can't do is validated it. Which is the same for any other file format. You be sure the file is fully valid without fully parsing the whole file.

However, the only order you can partially parse an XML file is linear order, which clashes badly with the fact that it's a tree based format. Depending on how tree-like your schema is, this might be a massive hinderance.

This flaw isn't unique to XML. All text encoded file formats share this characteristic, and can only be parsed linearly. If you move across to the world of binary file formats, it's extremely common for them to have indexes of offsets so a parser can navigate a tree-like structure in tree order without having to fully parse it, along with other types of non-linear data structures.

HelloNurse · on Feb 10, 2023

> All text encoded file formats share this characteristic, and can only be parsed linearly. If you move across to the world of binary file formats, it's extremely common for them to have indexes of offsets so a parser can navigate a tree-like structure in tree order without having to fully parse it, along with other types of non-linear data structures.

You don't always have the luxury of randomly accessing a file (obvious example: shell pipelines with a producer and consumer exchanging lots of temporary data), so taking advantage of indexing might require saving a temporary file and stalling processing until the file is ready.

Personally, I'm used to parsing large XML files with event-based APIs and throwing away data aggressively, keeping in memory only one unfinished element of interest, the stack of its ancestors, and my collected data (instead of a DOM for the whole document).

bawolff · on Feb 10, 2023

Its super common for xml parsers to support partial documents.

mdaniel · on Feb 10, 2023

and, AIUI, XMPP is based entirely around that very idea so it's not even a niche concept (well, aside from XMPP being arguably a niche protocol)

planede · on Feb 10, 2023

What do you mean by "partially parse"?

didntreadarticl · on Feb 10, 2023

If you imagine a log file that gets lines added to it now and then, its actually quite fiddly to do that with XML because of the closing outer tag

planede · on Feb 10, 2023

Well, I see the awkwardness of emitting a log like that, but it's not any worse than emitting a single JSON array and waiting for that closing ']'.

Both can be emitted and parsed in a streaming fashion though, but I wouldn't say that either XML or JSON is suitable for logs. Maybe NDJSON, but it's more like a hack around this limitation.

jraph · on Feb 10, 2023

> but I wouldn't say that either XML or JSON is suitable for logs

Of course. Better use a binary format for this.

https://systemd.io/JOURNAL_FILE_FORMAT/

:-)

kzrdude · on Feb 10, 2023

for json it's common to have multiple records per file. Then it's just one object per line, for example.

planede · on Feb 10, 2023

Yes, that's NDJSON, as in newline delimited JSON.

http://ndjson.org/