Show HN: Unicode Separated Values (USV) – Active Internet-Draft

Leftium · on March 6, 2024

Also take a look at WSV: Whitespace Separated Values

And related formats like OML:

- https://hw.leftium.com/#/item/39139115

rhelz · on March 6, 2024

The central problem (I'd almost say the fatal mistake) of these formats is that the text we want to send is very likely to contain the characters used to delimit it.

Like null pointers, this is something which is inherently error prone, insecure, and a continual source of bug reports.

The genius of ASCII (and the perceptiveness of the proposers of formats like USV) is that way back in 1960, they foresaw this problem and gave us a solution to it: ASCII has a very rich set of delimiting characters which are deliberately kept separate from the characters you most likely transmit as text.

At one stroke, this eliminates all the problems of which delimiting characters to use (commas? whitespace, vertical bars? newline? return? return and newline?) how to escape delimiting characters (put them in quotes? prefix them with backslash? use the ascii "ESC" character?) and how to escape the escape characters (using multiple ways to quote a string, single quote, double quotes,etc).

And it completely eliminates all the bugs caused by all the subtle difficulties needed to make such escaping actually work in a secure fashion. And the need to update all your code when somebody releases a new version of their CSV package, which fixes some bugs which other packages use as features....

ASCII gives us delimiting characters on 4 levels: field, record, group and file. It is far more capable than CSV files. It's powerful enough to represent the tabular data of a relational database, or a bunch of related spreadsheets. It even lets us specify a header where we can define column names, etc. It really is a shame we haven't been using those all along.

Its actually comparable to JSON in its ability to represent articulated data--arguably better, because it is table-based and not just key-value based. Any method used to swizzle JSON into tabular form for a relational database can encode JSON using these delimiting characters. And without having to quote all the key strings!!

rhelz · on March 6, 2024

This is a great idea. I remember reading a blog entitled something like "ascii-separated values" as a replacement for CSV files. So I took another look at the ascii table...

...and when I did, I was surprised to find not just field, record, group and file separators, but also values for packet-based networking, values for network handshaking, support for synchronous and asynchronous data transmission....even support for sessions, heartbeats, etc.

... basically in an embryonic form (or a fat-free, pre-crufty form, depending on your perspective) everything you needed to do anything from creating a format to store relational data on a disk to packet-switch networking. Come to think of it teletypes themselves were networked nodes; in retrospect it shouldn't be surprising that ASCII would have rich support for the kinds of devices it was used on.

I really hope this proposal has legs. CSV files are in desperate need of replacing. They are ambiguous, insecure, and non-standardized. No matter how careful you are, you'll just get an endless stream of obscure bug reports, where somebody has escaped something weirdly, or forgot to escape something, or switches delimiting characters midstream, etc etc.

And from the beginning, they were a completely unnecessary hack and a self-inflicted wound: Dedicated characters does an end run around all such potential problems, and we've had them since the 60's.

And what other things might we fix by taking another look at the legacy we inherited from our ancestors?

jph · on March 6, 2024

This is really good info, thank you. Can you explain more about what you learned about the streaming, transmission, sessions, etc.? I'm the author of the USV spec and very much interested.

rhelz · on March 6, 2024

Sure...and thanks again for writing up this spec.

On the hardware, physical networking layer, ASCII gives us SYN, a synchronous idle, which is a bit pattern chosen to make it easy for devices to sync their carrier frequencies and transmission speeds. We also have DC1-4 to specify commands to the networking hardware for additional settings and control. We have EM (end of medium) to indicate we want to stop communicating on this connection. And most charmingly of all, we even have a BEL we can ring to wake up the late-night operator! :-)

We also have characters to quite richly format a data packet: start of heading (SOH), start of text (STX), end of text (ETX). We can indicated whether the content is text or binary using the DLE (data link escape) or we can indicate any encoding we want with the SI,SO (shift in, shift out) characters and some extra user-defined protocol. We even have ETB (end of transmission block) to indicate that the content has been broken into several packets.

And as your proposal notes, we have 4 layers of separator characters we can use to indicate fields, records, etc within the text of the packet. Spreadsheets, and tabular data from relational databases have a simple and direct encoding, as do X-Y plots and vector-based data. It's comparable to JSON in its ability to articulate structured data--arguably better, since it's table-based and not just key-value based. And since it specifies a flattened-table format instead of a tree format like JSON, we can even directly encode things like circular linked-lists or graphs with cycles in them.

For session management, we have ENQ (enquiry) to start a session/initiate login/solicit credentials, ACK and NAK for heartbeating, CAN (cancel) to provide out-of-band directives for session control (such as halting a large file transfer or to interrupt a slow process.) To logout/terminate the session we have EOT (end of transmission).

And the great part is that because these are all dedicated characters--distinct from the characters we would typically use to transmit text--we don't have to do any kind of tedious and spoofable text escaping. And yet, we can still use plain text to encode any additional protocols we want (like specifying a network address, or indicating how long the text is, or checksums) because we can delimit such plain text with the reserved characters.

ASCII takes 100 years of experience in telegraphy and phone networking, and distills it into a microcosm of everything we needed to create the networked/connected world we have today. All this in the 1960's!!!

Sadly, but perhaps inevitably, the adoption of ASCII was way faster than the understanding of ASCII, so lots of the characters have acquired unintended meanings. Even worse, we invented abominations like CSV files....

Also, perhaps inevitably, UNICODE just doesn't seem to have the kind of condensed, crystalline beauty which ASCII has...but what can you do. I can remember, when I first looked at ASCII, it seemed weird to have letters like "ABC" sitting cheek-and-jowl next to data control commands and ringing bells. What kind of alphabet is this?? chuckle. Seemed like a category error to me. Well, once again we're having to expand our conceptions of what a "character" is....