BespON Specification

This is an overview of the BespON specification. A more formal, more detailed specification will follow as soon as the Python implementation is refined further. Additional details are available in the Python implementation, particularly in grammar.py and re_patterns.py.

File format

The default encoding for BespON files is UTF-8. The default extension is .bespon.

As a text-based format, BespON is specified at the level of Unicode code points. Some code points are always invalid as literals within encoded BespON data, and may only be transmitted in backslash-escaped form within strings.

  • All code points with Unicode General_Category Cc (control characters) are prohibited as literals, with the exception of the horizontal tab U+0009 (\t) and line feed U+000A (\n). The carriage return U+000D (\r) is also allowed as a literal if followed immediately by a line feed, in which case the \r\n sequence is normalized to \n. Otherwise, literal \r is prohibited.
  • The line separator and paragraph separator U+2028 and U+2029 are prohibited as literals.
  • All code points with Unicode property Bidi_Control are prohibited as literals. Since these code points can easily cause the appearance of text to differ from the actual order in which code points are entered in a file, they are not appropriate in a text-based data format meant for human editing.
  • The use of a BOM (U+FEFF) is discouraged, since the default encoding is UTF-8. A single BOM at the start of a file or stream is allowed, and discarded. A BOM at any other location is prohibited.
  • All code points with Unicode property Noncharacter_Code_Point are prohibited as literals.
  • All Unicode surrogate code points (U+D800 – U+DFFFF) are prohibited as literals, except when properly paired in an implementation in a language whose string representation requires them (UTF-16, UCS-2, etc.).

None type

None/null/nil is represented as none. All other capitalizations of none are invalid and must produce errors. Unlike YAML, uppercase and titlecase variants of keywords words are NOT equivalent to their lowercase versions, and other capitalization variants are NOT valid unquoted strings. The literal string "none", in any capitalization, can only be represented by using a quoted string.

Bool

Boolean values are represented as true and false. As with none, all other capitalizations are invalid and must produce errors. The only way to obtain the literal strings "true" and "false", in any capitalization, is to use quoted strings. Unlike TOML, the meanings of true and false are NOT context-dependent; they are not strings when used as dict keys, but booleans when used as dict values.

Numbers

Integers

Decimal, hexadecimal, octal, and binary integers are supported. Base prefixes for non-decimal numbers are 0x, 0o, and 0b; other capitalizations are prohibited. A single underscore is allowed after a base prefix before the first digit, and single underscores are allowed between digits (for example, 0x_12_34). Hexadecimal values may use either uppercase A-F or lowercase a-f, but mixing cases is prohibited.

Signs + and - may be separated from the first digit character by spaces or tabs, but this is discouraged. Breaking a line between a sign and the number to which it belongs is prohibited.

Implementations in languages that lack an integer type must interpret integers as floats.

Implementations are expected to use 32-bit unsigned integers at minimum. Any integer value that cannot fit in an implementation's integer type must result in an error, rather than an integer overflow or undefined behavior.

Floats

Decimal and hexadecimal floats are supported. Decimal floats use e or E for the exponent, while p or P are used for hex. Single underscores are allowed after the base prefix, between digits, and before exponents (for example, 0x_12_34_p5_6).

Signs + and - may be separated from the first digit character by spaces or tabs, but this is discouraged. Breaking a line between a sign and the number to which it belongs is prohibited.

Infinity is represented as inf, and not-a-number as nan. All other capitalizations are invalid, and the literal strings "inf" and "nan" require quoted strings.

Implementations are expected to use float types compatible with IEEE 754 binary64. Any non-inf float value that cannot fit in an implementation's float type must result in an error by default, rather than overflowing to inf.

Unicode strings

Unicode normalization is not performed on the Unicode string type, to avoid potential information loss issues. Unicode normalization before saving data in BespON format, or after loading data, is encouraged as appropriate.

Unquoted strings

Unquoted strings that fit the pattern for an ASCII identifier and are not a variant of none/true/false/inf/nan or a reserved word based on inf or nan may be used anywhere, including as dict keys. These must match the regular expression _*[A-Za-z][0-9A-Z_a-z]*.

Implementations should provide an option to enable unquoted strings based on Unicode identifiers, using Unicode properties XID_Start and XID_Continue with the omission of the Hangul filler code points U+115F, U+1160, U+3164, and U+FFA0 (which typically are rendered as whitespace). This is not the default because of the confusability and security implications of using unquoted non-ASCII code points.

Implementations should also provide an option to disable unquoted strings. This can be useful when it is desirable to avoid absolutely any potential for ambiguity.

Quoted inline strings

Quoted inline strings are delimited with one or more single quotation marks ', double quotation marks ", or backticks `.

Backslash escapes are enabled within strings delimited by single and double quotation marks. Escapes of the forms \xHH, \uHHHH, \U00HHHHHH, and \u{H...H} are all allowed, with the restriction that all non-numeric hex characters must have the same case (uppercase or lowercase) within a given escape.

Strings delimited by single and double quotation marks follow the same rules. A string that starts with one quotation mark ' or " ends at the next identical quotation mark that is not backslash-escaped. The sequences '' and "" represent the empty string. Longer delimiter sequences such as ''' and """ may be used to include any unescaped delimiter sequence that is shorter or longer than the delimiters. Such delimiter sequences must be multiples of 3 characters in length and no longer than 90 characters.

Strings delimited by backticks begin with a delimiter sequence of length 1, 2, 3, or a multiple of 3 no longer than 90 characters, and end when the same delimiter sequence is next encountered. This is similar to Markdown. These are literal strings; there are no backslash escapes. If the first non-space character at the beginning or end of a literal string is a backtick, then one space at that end of the string is stripped. This allows, for example, the sequence ` ``` ` to represent the triple backticks ```.

Inline quoted strings may be wrapped over multiple lines. When this is done, the indentation of the first continuation line must be equal to or greater than that of the line on which the string begins, and any subsequent continuation lines must have that same indentation. Wrapped quoted strings are unwrapped by replacing each line break with a space if the last character before the break does not have the Unicode property White_Space, and simply stripping breaks otherwise.

Quoted multiline strings

Quoted multiline strings preserve literal line breaks and indentation relative to the closing delimiter. They start with a pipe | followed by a sequence of single or double quotation marks or backticks. The rest of the line after this opening sequence must contain nothing but whitespace (space, tab, line feed). The sequence must be a multiple of 3 characters in length and no longer than 90 characters. The string ends when the same sequence of quotation marks/backticks is found between a pipe | and a slash /. The sequence must not appear unescaped anywhere else within the string. Shorter or longer sequences of quotation marks/backticks are allowed unescaped.

For example,

|```
First line
    second line
|```/

would be equivalent to "First line\n second line\n". Note that literal line breaks and indentation relative to the closing delimiter are preserved. When a multiline string starts at the beginning of a line or is only preceded by whitespace, the opening and closing delimiters must have the same indentation. Otherwise, the closing delimiter must have indentation greater than or equal to that of the line on which the multiline string begins.

The choice of ' and " versus ` in the delimiters determines whether backslash escapes are enabled. When they are, a backslash followed immediately by zero or more spaces and a literal newline becomes the empty string. Thus,

|"""
First line\
Second line
|"""/

would be equivalent to "First lineSecond line".

Indentation

Collection types (lists and dicts) may be represented with an indentation-based syntax. Both spaces and tabs are allowed as indentation, although use of spaces is strongly encouraged.

If spaces and tabs are mixed in indentation, total indentation level is determined by the exact, literal sequence of spaces and tabs, never by converting tabs into some equivalent number of spaces.

Lists

Lists are ordered collections of objects. Objects are not all required to be of the same type.

By default, an implementation must raise an error if the total nesting depth of all collection objects (lists and dicts) exceeds 100.

There are two list syntaxes. In indentation-based syntax, list elements are denoted with asterisks *. An asterisk must not be preceded on its line by anything except for indentation. For example,

* 'First element'
* 'Second element'

Indentation between the asterisk and the beginning of the following object is strongly encouraged, but not required. Indentation may be spaces or tabs. If the indentation character immediately before and after the * is a tab, then the * is ignored in calculating the total indentation; otherwise it is treated as equivalent to a space. In indentation-based syntax, all * in a list must have the same indentation, and all objects that follow them must also have the same indentation.

When a list is within a list, each successive asterisk must be on a new line. Asterisks in sub-lists must be indented relative to those in higher-level lists. Thus,

*
  * text

is equivalent to

[[text]]

An asterisk cannot be used alone without a following object to represent the empty list. The empty list must always be represented explicitly as [].

There is also a more compact, inline syntax for lists, in which the list is delimited by square brackets [...] and individual list elements are separated by commas. For example,

['First element', 'Second element']

Everything within an inline list must have indentation greater than or equal to that of the line on which the list starts. The only exception is if an inline list occurs within an inline collection type, in which case it inherits the indentation level of the parent collection, and everything within it must have indentation greater than or equal to that indentation level. While logical and consistent indentation beyond this is encouraged, it is not enforced in any way.

Dicts

Dicts are mappings of keys to values.

By default, an implementation must raise an error if the total nesting depth of all collection objects (lists and dicts) exceeds 100.

Only none, true, false, integers, strings, byte strings, and derived types are allowed as keys. Floating point numbers and collection types are specifically excluded. Floating point numbers are problematic as keys due to rounding, and because nan is not equal to itself. Collection types can be problematic as keys depending on how mutability is handled, and are not supported as keys in some programming languages.

Duplicate keys are strictly prohibited.

As with lists, there are two syntaxes for dicts. In indentation-based syntax, all keys must have the same indentation. A key must not be preceded on its line by anything except for indentation (and an asterisk * if in a list, or a tag if explicitly typed). Values must have indentation consistent with their keys. Values that do not follow their keys on the same line must be indented relative to their keys. Keys and values are separated by equals signs =. For example,

key = value
another_key = 'another value that
    continues'
yet_another_key =
    sub_dict_key = sub_dict_value

Quoted values that start immediately after their keys (on the same line) may have the same indentation as their keys, following the rules for continuation lines for quoted strings. This is convenient when it is desirable to avoid indentation. However, indenting values is encouraged.

There is also a more compact, inline syntax for dicts that parallels that for inline lists. A dict is delimited by curly braces {...} and individual key-value pairs are separated by commas. For example,

{key = value, another_key = another_value}

As with inline lists, everything within an inline dict must have indentation greater than or equal to that of the line on which the dict started. The only exception is if an inline dict occurs within an inline collection type, in which case it inherits the indentation level of the parent collection, and everything within it must have indentation greater than or equal to that indentation level. While logical and consistent indentation beyond this is encouraged, it is not enforced in any way.

In an indentation-style dict, an equals sign = must be on the same line as the end of its key. In an inline-style dict, an equals sign is permitted to be the first thing on the line immediately following the end of its key to provide resiliance against poor line breaking, but purposely doing this is strongly discouraged.

Key paths

Key paths provide a compact syntax for expressing nested dicts (and to a lesser extent lists). A key path is a sequence of period-separated, unquoted strings that are suitable as dict keys. For example,

key.subkey.subsubkey = 123

would be equivalent to

key = {subkey = {subsubkey = 123}}

A key path can only pass through dict nodes that do not yet exist, or that do exist but have only ever been visited previously as part of a key path. The final element of a key path must not have been defined previously, because that would violate the prohibition on duplicate keys. Thus,

key.subkey.subsubkey = 123
key.subkey.another_subsubkey = 456

would be valid and equivalent to

key = {subkey = {subsubkey = 123, another_subsubkey = 456}}

However,

key.subkey = {}
key.subkey.another_subsubkey = 456

would be invalid, because it is attempting to define key.subkey both as an empty dict and also as a dict {another_subsubkey = 456}.

The final element of a key path may alternatively be an asterisk *, in which case it adds a list element. Thus,

key.subkey.* = 123
key.subkey.* = 456

would be equivalent to

key = {subkey = [123, 456]}

Key paths are scoped. All key path nodes created from within a given dict are accessible from within that dict. However, all of those nodes become inaccessible once the dict is closed. Thus, in

key =
    subkey.subsubkey = 123

it would be possible to add subkey.another_subsubkey at the subkey level, because key paths based on subkey are still within scope. However, it would not be possible to use key.subkey.another_subkey at the key level, since that is outside the scope. Scoping ensures that all data for a given dict is somewhat localized.

Sections

Sections provide a way to use indentation-based syntax without using deep indentation. A section is started by a pipe | followed by a sequence of equals signs = whose length is a multiple of 3 and is no longer than 90 characters. This is followed by a key path or a key, which must terminate on the line where the key path begins (line breaks are not permitted). Everything after the section start is included under the specified key path or key. For example,

|=== section.subsection
key = value
another_key = another_value

would be equivalent to

section = {subsection = {key = value, another_key = another_value}}

Sections may also use the asterisk * as a key, to assemble a list while avoiding the associated indentation in normal syntax. For example,

|=== *
key = value

|=== *
another_key = another_value

is equivalent to

[{key = value}, {another_key = another_value}]

A section ends at the next section. Alternatively, it is possible to return to the top (root) level of the data structure by using the section closing element, which has the form |===/. This parallels the delimiters for multiline strings. However, when this is used to close a section, the number of equals signs = in the closing element must match the number used to open the section. Furthermore, if the closing element is ever used, then all sections must be closed explicitly.

Aliases

Aliases are used to reference objects that have previously been labeled. Aliases consist of a dollar sign $ followed immediately by a label, which is an unquoted string.

When an alias refers to a dict, then key path-style $alias.key.subkey notation is supported to refer to objects inside the dict. This avoids having to label each dict value explicitly.

There are also two special aliases that always exist. $~ refers to the top level of the data structure. Since ~ is not a valid unquoted string, this alias can never be overwritten by the user. Similarly, $_ is an alias to the current collection. This can be useful in a list to insert a reference to itself ($_) or in a dict to reuse the value of an earlier key ($_.earlier_key).

Implementations must provide an option to disable aliases, since they add some additional complexity and indirection, and it may be desirable to avoid using them under some circumstances. (The complexity involved is not as great as might be imagined, though. In the Python implementation, resolving aliases only takes around 140 lines of code more than the non-alias case.)

Aliases are not permitted as dict keys. This would make it significantly harder for a user to determine whether duplicate keys exist. It would also make $alias.key.subkey syntax ambiguous. (Is it a single key located at $alias.key.subkey, or a key path consisting of $alias followed by key.subkey?)

By default, implementations must check for circular references and raise an error when they are detected. Implementations should provide optional support for circular references.

Tags

BespON provides for explicitly typed objects, with the tagging syntax (tag)>. By default, tags will only be used to support binary types and perhaps a few other types beyond those already discussed. Tag syntax will eventually provide an optional extension mechanism for user-defined types.

Tags also allow data to be labeled for later referencing via aliases, allow the indentation and newlines in a string to be specified in a shorthand form, and permit inheritance for collections.

Binary types

Tag syntax makes possible byte strings.

(bytes)> 'Some text'

It also makes possible arbitrary binary data.

* (base16)> '536F6D652074657874'
* (base64)> 'U29tZSB0ZXh0'

When a binary type is applied to a string delimited by double or single quotation marks, only \xHH escapes are permitted.

The bytes type can only be applied to strings containing literal code points in the ASCII range (before any backslash-escaping is applied).

The exact requirements for base16 and base64 typed strings may still be refined in the future. Using a string containing backslash-escapes for base16 or base64 is currently permitted but discouraged. The current string requirements for both types are relatively strict. It is possible that this may be relaxed somewhat.

The base16 type must only be applied to strings in which all non-numeric hex characters have the same case (uppercase or lowercase). Multiline strings and inline strings wrapped over multiple lines are permitted (all trailing spaces and line feeds on each line are stripped before processing), but leading whitespace on a line and trailing empty lines at the end of a string are prohibited. Each two-character sequence (representing a byte) may be separated from the next two-character sequence with a single space, but only if this is done throughout the entire string; otherwise, internal whitespace is prohibited.

The base64 type follows similar rules to base16. Multiline strings are permitted (all trailing spaces and line feeds on each line are stripped before processing), but leading whitespace on a line and trailing empty lines at the end of a string are prohibited. Internal whitespace within a line is prohibited, and as a result, using base64 with wrapped inline strings is not possible. Inline strings that are not wrapped, with no leading or trailing whitespace, are permitted.

Labels

Tags allow data objects to be labeled, for later referencing via aliases. For example,

(bytes, label=bin_data)> 'Some text'

A label must always be an unquoted string. Allowing arbitrary code points in labels would have confusability implications.

Labels are currently not allowed for dict keys.

When a label and a type occur in a tag, the type must always be first. The type may be omitted when the default type for the following object is desired and the default type may be inferred. In practice, this means that for standard, non-binary types, the type may always be omitted except in the case of dicts in indentation-style syntax. In those cases, the type is currently required. This restriction may be removed once label and alias rules are further refined. (Dict keys are not currently allowed to be labeled, to parallel the fact that aliases currently are not allowed as keys. However, if dict keys were allowed to be labeled, then that could make a label at the beginning of an indentation-style dict ambiguous. Does it apply to the dict or to the first key?)

Strings

The tag keywords indent and newline may be applied to strings to customize indentation and newline sequences. These are only allowed for multiline strings.

indent specifies a sequences of spaces and tabs that is added to the beginning of each line that follows a literal (not escaped) line break (as well as the first line).

newline specifies a sequence of code points that is used to replace all literal line feeds (\n). Only the empty string and code point sequences considered a newline in the Unicode standard (chapter 5.8, "Newline Guidelines") are permitted:

['', '\r\n', '\n', '\v', '\f', '\r', '\x85', '\u2028', '\u2029']

There is the further restriction that only newline sequences in the ASCII range are permitted for strings that are explicitly typed with a binary type (for example, bytes).

Inheritance

Dicts support two tag keywords for inheritance. init specifies an alias or inline list of aliases to dicts that are used for initialization. All the keys supplied by these dicts must not duplicate each other or duplicate the keys directly entered in a dict using init. The prohibition on duplicate keys still holds. default specifies an alias or inline list of aliases to dicts that are used to supply defaults. Because these are default, fallback values, duplicate keys are allowed, and do not overwrite existing keys. For example,

>>> bespon.loads("""
initial =
    (dict, label=init)>
    first = a
default =
    (dict, label=def)>
    last = z
    k = default_v
settings =
    (dict, init=$init, default=$def)>
    k = v
""")
{'initial': {'first': 'a'},
 'default': {'last': 'z', 'k': 'default_v'},
 'settings': {'first': 'a', 'k': 'v', 'last': 'z'}}

Lists also support two tag keywords for inheritance. init specifies an alias or inline list of aliases to lists whose elements are inserted at the beginning of a list using init. extend specifies an alias or inline list of aliases to lists whose elements are appended at the end of a list using extend, after all explicit elements have been inserted. For example,

>>> bespon.loads("""
initial_1 = (label=init1)> [1, 2]
initial_2 = (label=init2)> [3, 4]
extend = (label=ex)> [7]
settings = (init=[$init1, $init2], extend=$ex)> [5, 6]
""")
{'initial_1': [1, 2],
 'initial_2': [3, 4],
 'extend': [7],
 'settings': [1, 2, 3, 4, 5, 6, 7]}

Additional inheritance-related features, including a recursive merge for dicts, are under consideration.

Right-to-left code points

BespON is intended for human editing. This can be difficult with code points from right-to-left languages (code points with Unicode Bidi_Class R or AL), because it is possible for a key and a value to be reversed in visual representation due to the bidirectional text rendering algorithms. For example, suppose we have the data

א =
  1
ב =
  2

In this form, the meaning is clear, as a mapping of Hebrew letters to integers. However, put these values on a single line, and the order of keys and values is lost at the visual level (though still retained in terms of logical code point ordering).

{א = 1, ב = 2}

This is the same as

{\u05D0 = 1, \u05D1 = 2}

Under ideal circumstances, a rendering engine would identify that this is a data as opposed to text context, and provide a useful rendering. Given that this will rarely be possible, the next best thing would be to require that right-to-left code points only appear in contexts in which this sort of key-value reversal in visual rendering is not possible.

By default, whenever a string contains code points with Unicode Bidi_Class R or AL on its last line, no string, number, or comment is allowed to follow it on that line. Any following object must be on a subsequent line, to avoid the potential of confusion due to bidirectional rendering. Such a string may still be followed by a comma ,, bracket ], brace }, equals sign =, or other non-string, non-digit element, since this will not introduce the possibility of ambiguous rendering.

Comments

Comments come in two forms.

Line comments start with a single number sign # that is not followed immediately by another #, and go to the end of the line. Line comments may optionally be preserved in round-tripping that only modifies data, as opposed to adding or deleting values. However, line comments cannot in general survive round-tripping, because they may appear anywhere, with any indentation, in any quantity. So far as syntax is concerned, they are not uniquely associated with any particular data element.

Doc comments are uniquely associated with individual data objects. Each data object may have at most one doc comment. The doc comment must come before the object, and may only be separated from it by a tag (doc comments come before tags). Doc comments come in two forms. Inline doc comments start with three number signs ###, or a sequence that has a length that is a multiple of 3 and is no longer than 90. They follow the same rules as inline quoted strings, with two additional restriction in indentation-based syntax:

  • A doc comment must have the same indentation as its object, unless the object is tagged, in which case the doc comment must have the same indentation as the tag.
  • If a doc comment starts at the beginning of a line, it cannot be followed on its last line by anything other than a line comment.

Doc comments also come in a multiline form, that follows the same rules as multiline strings:

|###
Multiline doc comments.

These can contain empty lines.
|###/

The rules about indentation-based syntax apply to these as well.

Because doc comments are uniquely associated with individual data objects, they may survive round-tripping even when data is added or removed.

Extended types

An implementation may provide optional, non-default support for additional types. The exact set of recommended optional types, and their precise definition, may be revised in the future; this is not guaranteed to be stable. Even once the set of recommended optional types stabilizes, implementations may support additional, optional, language-specific data types that are in common use; for example, the Python implementation provides optional support for tuples.

Numbers

  • Rational number literals use the form 1/2, where the numerator and denominator must both be decimal integers, and any sign must come before the fraction (with optional, discouraged whitespace after the sign and on both sides of the slash).
  • Complex number literals use the general form 1.0+2.0i, where the real part is optional, the imaginary unit is represented with i, and numbers must be floats (either both in decimal or both in hex form, unless one is inf or nan).

To allow for the possibility of complex number literals, and provide the widest possible scope for their definition, three letters are currently reserved as number literal units: i, j, and k. This means that infi, infj, infk, nani, nanj, nank, and all capitalization variants are reserved words. This technically would allow for quaternion literals, although that is not currently planned.

Collections

  • set: Applied to a list to return an unordered collection in which all objects are unique.
  • odict: Ordered dict, in which key order is preserved.

Roadmap

Some possible features are still under consideration, and may be added either as required features or as optional extensions.

  • copy and deepcopy types based on aliases.
  • Support for recursive dict merging.