Internet DRAFT - draft-cordell-messaging
draft-cordell-messaging
HTTP/1.1 200 OK
Date: Mon, 08 Apr 2002 23:21:28 GMT
Server: Apache/1.3.20 (Unix)
Last-Modified: Thu, 24 Jun 1999 13:45:00 GMT
ETag: "2e7f9c-7e1e-3772365c"
Accept-Ranges: bytes
Content-Length: 32286
Connection: close
Content-Type: text/plain
INTERNET DRAFT Pete Cordell
Internet Engineering Task Force Tech-Know-Ware
June 23, 1999
Expires December 25, 1999
<draft-cordell-messaging-00.txt>
Structured Message Encoding Using an
ASCII Line Format
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
At some stage in the development of a protocol it is
necessary to consider how the protocol elements will be
conveyed on the wire. With groups from both the IETF and
the ITU-T collaborating on the same protocol this has
often been a contentious issue.
Described here is a mechanism for encoding messages that
is intended to combine the benefits that both groups seek
when selecting message encoding techniques.
The result is a formal method of specifying messages that
is encoded on the wire using ASCII and is extensible.
Techniques are also discussed that show how such messages
can be easily parsed without complicated software tools.
1. Analysis
With groups from both the IETF and the ITU-T collaborating on the
same protocol specifying how messages are encoded on the wire has
often been a contentious issue.
The ITU-T preference is for ASN.1. The benefit of ASN.1 is that
it describes messages in a powerful expressive high level way. It
is similar to writing code in Pascal or C as opposed to
Assembler. As such it is amenable to processing by software tools
which can reduce the amount of labour involved in implementing a
protocol, and also reduce the likelihood of making careless
mistakes. The down side is that it does not generally allow
proprietary extensions (although some would claim this as an
advantage!) and to understand ASN.1 fully can take considerable
effort, thus countering any gains made in ease of implementation.
The IETF community has a preference for encoding application
messages in ASCII. This is because it allows the data to be
generated and read by humans and it allows the protocol to be
easily extended, perhaps for proprietary purposes, while still
maintaining parameter visibility.
The benefits of each method leads to the two schemes' proponents
being reluctant to wholesale adopt the other method. However, the
two approaches are not entirely irreconcilable (at least at the
level stated above) and this specification describes a scheme
designed to do just that by adopting the following principals:
1.The bits on the wire are in the form of ASCII characters,
2.Only a subset of ASN.1 syntax is used to define the messages,
3.Extensions are made to the ASN.1 definition to achieve easy
protocol extension.
The benefits of this are a common method of expressing high level
messages constructs that it is hoped will enhance the
collaboration of the IETF and ITU-T communities.
2. Syntax Sub-set
The first requirement is to select a sub-set of the total ASN.1
capabilities. The approach taken is to adopt the 80-20 principle
in which 80% of the expressiveness is achieved with 20% of the
syntax. To this end, the following keywords are used. All other
keywords are not supported.
INTEGER OCTET STRING IA5String BMPString
BOOLEAN NULL SEQUENCE CHOICE
OBJECT IDENTIFIER SEQUENCE OF SIZE
OPTIONAL ...
Table 1. Subset of ASN.1 keywords
To provide compatibility with existing ASN.1, some ASN.1 terms
are aliased to the sub-set above. These include:
1.SET is aliased to SEQUENCE,
2.NumericString and GeneralString are aliased to OCTET STRING.
To ease understanding by those not familiar with ASN.1, the
following terms are also adopted:
1.ASCIIString is treated as an alias of IA5String,
2.UnicodeString is treated as an alias of BMPString.
Additional constraints are that ranges cannot be extensible (e.g.
INTEGER( 0..56,... ) is illegal), and that a SEQUENCE OF cannot be
a direct option within a CHOICE. The latter simplifies the line
encoding. (Instead put the SEQUENCE OF within a SEQUENCE which
can then be part of the CHOICE.)
These limitations mean that there is no support for ASN.1 macros,
and restricted alphabets can not be directly used (although
comments in the syntax can be used to define such restricted
alphabets).
3. Support for Extensibility
To allow for extensibility, the syntax includes the terms
EMBEDDED, AS and PLUGIN.
EMBEDDED is a type that transports a pre-encoded message fragment
using the same encoding rules as the message that it is contained
in. It is the responsibility of the protocol designer to define
syntax that indicates what the embedded type is. An example of
EMBEDDED is:
encapsulated-protocol EMBEDDED,
AS and PLUGIN support direct extensibility, either proprietary or
defined by a standards body, within the message definition.
PLUGIN indicates that the parameter is an extension and thus must
use an ASCII tag. This is principally of significance when the
message is encoded in a binary form. However, it is included here
as a message definition should not preclude particular message
encoding techniques and it makes the definition more generic. The
AS keyword can be used to optionally specify the value of the
ASCII tag to be used. If an AS field is not included with a
PLUGIN, then the value name is used as the tag on the wire. For
example, the following can be used to declare a new parameter which
has an ASCII tag on the wire called "myparam.mycompany.com":
myparam AS myparam.mycompany.com
INTEGER( 0..32767 ) PLUGIN OPTIONAL,
A special use of the AS notation allows the protocol
specification to indicate that no tag should be used for the
parameter. This is indicated by specifying "AS ?" (this should
become clearer in the description below). This has the form:
version AS ? INTEGER( 0..100 ),
This notation can not be used on SEQUENCE OF parameters, OPTIONAL
parameters, PLUGINs, members of a CHOICE, or any parameter that
follows such a parameter.
4. Examples of the Syntax
Before, defining the ASCII line format, an example of the above
syntax rules is presented. This allows the example that will be
encoded to be presented, and gives someone who is not familiar
with ASN.1 syntax a quick overview.
A definition for an (complicated) ASN.1 message that does not
employ sub-definitions may look as follows:
startup ::= SEQUENCE
{
sequence_no AS ? INTEGER( 1..65535 ),
host-name AS ? IA5String( SIZE( 1..128 ) ),
user-name BMPString( SIZE ( 1..64 ) ),
gUID OCTET STRING ( SIZE( 16 ) ),
activated BOOLEAN,
modes SEQUENCE
{
highmode BOOLEAN,
lowmode BOOLEAN,
...
},
response CHOICE
{
acknowledge NULL, -- NULL indicates no further data
silent NULL,
informGroup INTEGER( 0..65535 ), --Address to send
--group response to
...
},
id INTEGER( 1..256 ) OPTIONAL,
protocol OBJECT IDENTIFIER,
-- Set to { ietf (3) wg (0) newproj (0) }
node_alerts SEQUENCE OF INTEGER( 0..65535 ),
complex SEQUENCE SIZE (1..4) OF SEQUENCE
{
admin_node INTEGER( 0..256 ),
user_id INTEGER( 0..256 ),
mode SEQUENCE
{
video BOOLEAN,
audio BOOLEAN,
data BOOLEAN,
...
} OPTIONAL,
...
},
my-extension AS mine.bigco.com INTEGER( 1..3 ) PLUGIN OPTIONAL,
...
}
From this it can be seen that there are some basic types
including INTEGER, IA5String, OCTET STRING and BOOLEAN.
There are also two compound constructs; these being SEQUENCE and
CHOICE. A SEQUENCE is similar to a structure (struct) in C and a
CHOICE is similar to a C union (however, the chosen option in the
CHOICE is also recorded, which is not the case for a C union).
Additionally, you can have a SEQUENCE OF the above types, and
elements can be OPTIONAL.
In a SEQUENCE OF construct there can be more than one of the
specified component. The number of components may be constrained
or unconstrained (See complex and node_alerts respectively).
Elements that are marked OPTIONAL can be absent in a correctly
formed message. All other elements must be present for the
message to be valid.
The ... tokens are called extension markers. They tell the compiler
that anything that follows has been defined in a subsequent
version, and so should be treated as optional irrespective of
whether it is marked as OPTIONAL. (In the scheme described here,
the extension markers act no more than as a shorthand way to say
that all the following parameters are optional. It is not
necessary to include the extension markers in the first release
of the message definition and it is possible to extend a message
even if there were previously no extension markers.)
Finally, it is not necessary to map each parameter directly to a
base type (e.g. INTEGER, OCTET). I.e. the definition above might
be encoded as:
startup ::= SEQUENCE
{
sequence_no AS ? Seq_no,
host-name AS ? IA5String(SIZE(1..128)),
user-name BMPString( SIZE ( 1..64 ) ),
gUID Conference_ID,
activated BOOLEAN,
modes Modes,
.
.
.
and elsewhere the following definitions would appear:
Seq_no ::= INTEGER(1..65535)
Conference_ID ::= OCTET STRING (SIZE(16))
Modes ::= SEQUENCE
{
highmode BOOLEAN,
lowmode BOOLEAN,
...
}
This is a better way to do the message definition, for all the
reasons that you would do the same in any piece of software. This
does not affect the coding method, as it is a process of macro
expansion to get to the message definition we started with.
5. ASCII Line Encoding
Now lets consider how the message definition can be represented in a
text format.
The basic mechanism is to encode all items as:
<name of item> <optional white space> = <optional white space>
<value> <white space>
By doing this consistently, parsers can skip fields they don't
understand.
An INTEGER can simply be represented as a printable string of the
number, (the range of the number is not so important to the line
format when represented in this way. However, the number range is
probably important to the application.) as in:
id [ "-" ] 1*( "0-9" ) LWS
where:
id = tag "=" / ; Normal case
"=" / ; Option when parameter is following member
; in a SEQUENCE OF
"" ; When "AS ?" is used to request that a tag is not used
; These variations will be discussed further below
LWS = " " / tab / CRLF
e.g. (ignoring the AS ? specifier):
sequence_no = 125
An IA5String is represented as a string in quotes including any
necessary back slash escapes, as in:
id "\"" *( Printable characters with standard C escapes \r, \n,
\t, \l, \\, and \" ) "\"" LWS
e.g. (ignoring the AS ? specifier):
host-name = "Zebedee"
A BMPString is similarly encoded using UTF-7 within single quote
marks, as in:
id "'" *( UTF-7 coded characters ) "'" LWS
e.g.:
user-name = `Pete Cordell'
The value of an OCTET STRING is represented with a leading
character x, and then each OCTET is coded as the ASCII
representation of two hexadecimal digits that encode the upper 4
bits followed by the lower 4 bits respectively, as in:
id "x" 1*( ( "0-9" / "A-F" / "a-f" ) ( "0-9" / "A-F" / "a-f" ) ) LWS
For example:
gUID = x0f1b6c0dbcad01230f1b6c0dbcad0123
BOOLEANs are coded simply as TRUE and FALSE, as in:
id ( "TRUE" / "FALSE" ) LWS
e.g.:
activated = TRUE
The OBJECT IDENTIFIER type (which is included mainly for
backwards compatibility with previously defined ASN.1) is encoded
as:
id oid-value-list LWS
where:
oid-value-list = oid-value-list "-" 1*("0-9") / 1*("0-9")
e.g.:
protocol = 3-0-0
The SEQUENCE is coded by including the elements of the sequence
in brackets (), as:
id "(" *(encoded-type) ")" LWS
for example:
modes = ( highmode = TRUE lowmode = FALSE )
Doing this allows the complete sequence to be skipped if the
parameter is not understood, or it is of no interest. This is
important for backwards compatibility.
A CHOICE is encoded using a similar scheme to the SEQUENCE, as
in:
id "[" (encoded-type) "]" LWS
e.g.:
response = [ acknowledge = NULL ]
... or:
response = [ informGroup = 137 ]
In practice many CHOICE options map to NULL, (an example of which
is shown above) which is inefficient in terms of characters sent
and tedious to write by hand. Conversely, this presents little
problem to a program scanning and generating the text as it
consistently maintains the X=Y format. However, on the whole a
shorthand notation for the above is preferable, and this is
represented as follows:
response = [ acknowledge ]
(Note that the brackets are still important as this highlights
that acknowledge comes from a CHOICE statement. Note also that
this shorthand form of X=NULL can not be used in a SEQUENCE.) An
implementation must recognise both formats.
The OPTIONAL items are either present or not present.
Unfortunately an example makes no sense!
When multiple items of the same type are included in a message
using the SEQUENCE OF encoding, this can be done simply by
including the item multiple times, as in:
node_alerts = 0 node_alerts = 5000 node_alerts = 12
This is quite wasteful in terms of characters, and so the
following compacted encoding can be used:
node_alerts = 0 = 5000 = 12
The rule that allows this is that if you get the = token when you
expected to retrieve a tag, you use the most recently collected
tag for the current level of SEQUENCE / CHOICE nesting.
As a final, extended example, the two `complex' components shown
above can be encoded as:
complex = ( admin_node = 20
user_id = 6
mode = ( video = TRUE audio = TRUE data = FALSE )
)
= ( admin_node = 5
user_id = 5
)
To form a complete encoding, each of the parameters within the
outer most definition is encoded and then concatenated to end of
the previous output. Note that the name of the outer most
definition is not encoded as this conveys no information. A
complete example of the message would be (including taking into
account the AS ? specifications):
125
"Zebedee"
user-name = `Pete Cordell'
gUID = x0f1b6c0dbcad01230f1b6c0dbcad0123
activated = TRUE
modes = ( highmode = TRUE lowmode = FALSE )
response = [ informGroup = 137 ]
id = 12
protocol = 3-0-0
node_alerts = 0 = 5000 = 12
complex = (
admin_node = 20
user_id = 6
mode = ( video = TRUE audio = TRUE data = FALSE )
)
= (
admin_node = 5
user_id = 5
)
mine.bigco.com = 3
-- See comment in paragraph below about marking the end of a message
In the above message there is no implicit identification of the
end of the message. Therefore an additional closing bracket
(either ")" or "]") must be appended to the message to delimit its end.
The result is that the last lines of the message should infact be:
mine.bigco.com = 3
)
Note that for each tagged parameter the order is not important.
Therefore, the above message could equally be represented as:
125 "Zebedee"
activated = TRUE
node_alerts = 0
user-name = `Pete Cordell'
sequence_no =
modes = ( highmode = TRUE lowmode = FALSE )
id = 12
protocol = 3-0-0
node_alerts = 5000 = 12
complex = (
admin_node = 20
user_id = 6
mode = ( video = TRUE audio = TRUE data = FALSE )
)
response = [ informGroup = 137 ]
complex = (
admin_node = 5
user_id = 5 -- We can have comments too
)
gUID = x0f1b6c0dbcad01230f1b6c0dbcad0123
)
6. Other Forms of Line Encoding
It should be noted that, what we really have here is a separation
of the message definition and the way it is represented on the
wire. Therefore, the above line-encoding scheme is not the only
possible form. Indeed it may be possible to use multiple
encodings for the syntax within the same system. This would allow
a debug mode which could use the scheme described above
with additional comments, and a more processor or bandwidth
efficient scheme (if such a scheme can be found) for normal
operation.
7. Forming Complete Messages
There is a tendency within the ASN.1 world to see ASN.1 as the
only way to generate messages. This is not always appropriate,
such as when large amounts of data are streamed from a disk, or
the contents of a message are being digitally signed. By
considering the above scheme as a tool within a message encoding
tool chest that contains a number of tools, the burdens on this
form of message encoding can be significantly reduced. Thus, when
using the above scheme, especially when authentication might be
required, it is suggested that the above be placed in a wrapper
as one of possibly a number of message fragments. The scheme
adopted will depend on whether a binary representation is
permissible or whether it is necessary to have a complete
ASCII encoding. Possible schemes are not discussed further here,
but it is raised because it is an important issue that must be
considered right at the start of protocol development as
retrofitting such an feature is rarely possible.
8. BNF Notation for Message Definition Syntax
Below is a summary of the message definition syntax in a BNF
style. Note that, although considerable effort has been put into
getting the definition correct, it still may contain some errors.
Syntax-definition = *( root-def )
root-def = name "::="
[ "SEQUENCE" [ "SIZE" "(" digits ".." digits ")" ]
"OF" ]
type
name = identifier
type = compound-type / simple-type / string-type /
referenced-type
compound-type = seq-type / choice-type
seq-type = "SEQUENCE" "{" seq-line-list "}"
seq-line-list = seq-line / seq-line-list "," seq-line
seq-line = seq-element / extension
seq-element = name [ "AS" ( tag / "?" ) ] [
numerical-tag ]
[ "SEQUENCE" [ "SIZE" "(" digits ".." digits ")" ]
"OF" ]
type [ "PLUGIN" ] [ "OPTIONAL" ] [ "IGNORE" ]
choice-type = "CHOICE" "{" choice-line-list "}"
choice-line-list = choice-line /
choice-line-list "," choice-line
choice-line = choice-element / extension
choice-element = name [ "AS" tag ] [ numerical-tag ] type [ "PLUGIN" ]
tag = identifier
numerical-tag = "[" digits "]"
simple type = "BOOLEAN" / "OCTET" / "NULL" / integer-type
integer-type = "INTEGER" [ "(" digits ".." digits ")" ]
string-type = char-string / non-char-string
char-string = ( "IA5String" / "BMPString" ) [ "("
"SIZE" "(" digits [ ".." digits ] ")" ")" ]
non-char-string = ( "OCTET STRING" / "EMBEDDED" )
[ "SIZE" "(" digits [ ".." digits ] ")" ]
referenced-type = identifier
extension = "..."
identifier = ( "A-Z" / "a-z" ) *( "A-Z" / "a-z" / "0-9" /
"-" / "." / "_" )
; i.e. Alphanumeric string with leading
; alphabetic character
digits = [ "-" ] ( decimal | hex )
decimal = *( "0-9" )
hex = "0x" *( "0-9" / "A-F" / "a-f" )
9. Techniques for Encoding and Decoding Messages
The above message definition syntax is formal enough that message
compilers can be built to convert messages on the wire to and
from programming language constructs such as C structures and
unions.
However, this is not the only way such encoding and decoding can
be done, and may not even be the best way.
When encoding, simple common tools such as sprintf can be used.
When decoding, the biggest problem is locating a desired
parameter in an efficient manner. One way to do this is to
recognise that the encoded message is a form of tree that can be
readily represented using a multidimensional linked list in which
each element stores the name and location of a parameter in the
message. In most cases one parameter follows another, and a
pointer to the `next' item can capture this. In the case of
SEQUENCEs and CHOICEs, the structure of the tree forks and so
another pointer is required to point to the forked set of
parameters.
Once parsed, the tree can be used to readily locate parameters. A
parameter can be identified by the route through the tree that
has to be traversed to locate it. These names can be concatenated
into a string (using a suitable separator) to indicate the
desired parameter. It should also be noted that, due to SEQUENCE
OF, there might be more than one parameter with the same name.
Therefore it is necessary to specify the instance of the
parameter you desire. Using these principles, a function call for
accessing the value of a boolean in the message above might look
like:
wasFound = getBoolean( parseTreeRootPointer,
"complex:mode:video", // Parameter name
0, // Instance
&myBoolean ) // Where result is placed
Some example code that demonstrates the above principles has been
developed and is located at:
http://www.tech-know-ware.com/messaging.html
10. Alternative Message Definition Syntax
The above message definition syntax is very closely related to
ASN.1. However, this is not the only form such a syntax can take.
Megaco has adopted a more BNF style notation for the preliminary
definition of messages. This section adopts the spirit of the
Megaco message definition scheme and formalises it so that it
becomes more amenable to the techniques described above.
It should be noted that, in defining such a syntax there is a
risk that readers treat it as BNF when in fact it is quite
different. This should be considered when it is decided whether
or not to adopt this scheme.
Any given parameter may appear once in a message, be
optional, or appear multiple times. This can be captured as
follows:
parameter =
"(" parameter-description ")" / ; Parameter always appears once
"[" parameter-description "]" / ; Parameter is optional
"*(" parameter-description ")" / ; Appears zero or more times
"1*(" parameter-description ")" / ; Appears 1 or more times
digits "*" digits "(" parameter-description ")"
; Appears a range number of times
Parameters appear either one after another, or only one of a
possible set of parameters appears. Thus we have the constructs:
parameter-list = name "=" *parameter
and:
parameter-switch = name "=" parameter-switch-list
parameter-switch-list = parameter-switch-list "/" parameter /
parameter
Each parameter has a name, a tag that is used to represent it on
the wire, and a type. By default the tag that represents it on
the wire is the same as its name. Also, a parameter may not be
tagged on the wire. Therefore, a parameter-description is as
follows:
parameter-description =
( name type [PLUGIN] ) / ; Tag on wire is same as name
( name tag type [PLUGIN] ) / ; Use specified tag on wire
( name ? type ) ; Has no tag on wire
The type of a parameter may be as follows:
type = "UNSIGNED8" / "UNSIGNED16" / "UNSIGNED32" / "UNSIGNED64" /
"SIGNED8" / "SIGNED16" / "SIGNED32" / "SIGNED64" /
"BOOLEAN" / "NULL" / "OID" / strings / compound / referenced
strings = string-type / ; Unbounded string
[ digits [ * digits ] ] string-type / ; Bounded string
digits string-type ; Fixed sized string
sting-type = "ASCII" / "Unicode" / "OCTETS" / "EMBEDDED"
compound = embedded-parameter-list / embedded-parameter-switch
embedded-parameter-list = "{" *parameter "}"
embedded-parameter-switch = "{" parameter-switch-list "}"
A complete message definition has the form:
message-definition = 1*( parameter-list / parameter-switch )
and to wrap up:
name = tag = referenced =
( "A-Z" / "a-z" ) * ("A-Z" / "a-z" / "0-9" / "-" / "." / "_" )
Using this notation to express the example message definition
above yields:
startup = ( sequence_no ? UNSIGNED16 )
( host-name ? 1*128ASCII )
( user-name 1*64Unicode )
( gUID 16OCTET )
( activated BOOLEAN )
( modes {
( highmode BOOLEAN )
(lowmode BOOLEAN )
} )
( response {
( acknowledge NULL ) / ; NULL indicates no further data
( silent NULL ) /
( informGroup UNSIGNED16 ) ;Address to send
;group response to
})
[ id UNSIGNED8 ]
( protocol OID ) ; Set to { ietf (3) wg (0) newproj (0 ) }
* ( node_alerts UNSIGNED16 )
1*4( complex {
( admin_node UNSIGNED8 )
( user_id UNSIGNED8 )
[ mode {
( video BOOLEAN )
( audio BOOLEAN )
( data BOOLEAN )
} ]
} )
[ my-extension mine.bigco.com UNSIGNED8 PLUGIN ]
11. Security Considerations
This specification defines a method for encoding messages into
clear ASCII text. If such messages need to be authenticated or
encrypted the result of the message encoding performed by this
specification will need some form of post-processing. Readers are
directed to the section "Forming Complete Messages" for further
information on this topic.
12. Conclusions
The above shows that the benefits of ASCII line coding can be
combined with the benefits of a formal syntax definition, thus
simplifying both the definition and implementation of protocols.
13. Author's Address
Pete Cordell
Tech-Know-Ware Ltd
Ipswich
U.K.
e-mail: pete@tech-know-ware.com
<draft-cordell-messaging-00.txt>