Internet DRAFT - draft-almesberger-wajhak-diffserv-linux
draft-almesberger-wajhak-diffserv-linux
HTTP/1.1 200 OK
Date: Mon, 08 Apr 2002 22:29:26 GMT
Server: Apache/1.3.20 (Unix)
Last-Modified: Wed, 30 Jun 1999 15:38:14 GMT
ETag: "2e6c14-cc6f-377a39e6"
Accept-Ranges: bytes
Content-Length: 52335
Connection: close
Content-Type: text/plain
INTERNET-DRAFT Werner Almesberger
EPFL ICA, Switzerland
Jamal Hadi Salim
CTL Nortel Networks
Alexey Kuznetsov
INR Moscow
June 1999
Differentiated Services on Linux
<draft-almesberger-wajhak-diffserv-linux-01.txt>
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
Recent Linux kernels offer a wide variety of traffic control
functions, which can be combined in a modular way. We have designed
support for Differentiated Services based on the existing traffic
control elements, and we have implemented new components where
necessary. In this document we give a brief overview of the structure
of Linux traffic control, and we describe our prototype
implementation in more detail.
1. Introduction
The Differentiated Services architecture (Diffserv) lays the
foundation for implementing service differentiation in the Internet
in an efficient, scalable way. We assume that readers are familiar
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 1]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
with the concepts and terminology defined in [1]. Furthermore, we
assume familiarity with the packet marking as described in [2].
We have developed a design to support basic classification and DS
field manipulation required by Diffserv nodes. The design enables
configuration of the first PHBs that are being defined in the
Diffserv WG. We have implemented a prototype of this design using the
traffic control framework available in recent Linux kernels. The
source code, configuration examples, and related information can be
obtained from http://icawww1.epfl.ch/linux-diffserv/
The main focus of our work is to allow maximum flexibility for node
configuration and for experiments with PHBs, while still maintaining
a design that does not unnecessarily sacrifice performance.
This document is structured as follows. Section "Linux Traffic
Control" gives a brief overview of traffic control functions in
recent Linux kernels. Section "Diffserv extensions to Linux traffic
control" discusses where the existing model needed to be extended.
Section "New components" describes the new components in more detail.
We conclude with examples of configuration scripts in section
"Building sample configurations".
2. Linux Traffic Control
Figure 1 shows roughly how the kernel processes data received from
the network, and how it generates new data to be sent on the network.
+---------------+
+---->| TCP, UDP, ... |
| +---------------+
| | TRAFFIC CONTROL
| v |
+---------------------+ +------------+ +----------------+
-->|Input de-multiplexing|-->| Forwarding |-->| Output queuing |-->
+---------------------+ +------------+ +----------------+
Figure 1: Processing of network data.
"Forwarding" includes the selection of the output interface, the
selection of the next hop, encapsulation, etc. Once all this is done,
packets are queued on the respective output interface. This is the
point where traffic control comes into play. Traffic control can,
among other things, decide if packets are queued or if they are
dropped (e.g. if the queue has reached some length limit, or if the
traffic exceeds some rate limit), it can decide in which order
packets are sent (e.g. to give priority to certain flows), it can
delay the sending of packets (e.g. to limit the rate of outbound
traffic), etc.
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 2]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
Once traffic control has released a packet for sending, the device
driver picks it up and emits it on the network.
2.1 Components
The traffic control code in the Linux kernel consists of the
following major conceptual components:
- queuing disciplines
- classes (within a queuing discipline)
- filters
- policing
Each network device has a queuing discipline associated with it,
which controls how packets enqueued on that device are treated. A
very simple queuing discipline may just consist of a single queue,
where all packets are stored in the order in which they have been
enqueued, and which is emptied as fast as the respective device
can send. See figure 2 for such a queuing discipline without
externally visible internal structure.
+--------------------+
--->| Queuing discipline |--->
+--------------------+
Figure 2: A simple queuing discipline without classes.
More elaborate queuing disciplines may use filters to distinguish
among different classes of packets and process each class in a
specific way, e.g. by giving one class priority over other
classes.
+-+ +-----+ +------------------+ +-+ +-+
| | +------+ | |-->|Queuing discipline|-->| | | |
| |---->|Filter|--->|Class| +------------------+ | |--+ | |
| | | +------+ | +--------------------------+ | | | |
| | | +----------------------------------+ | | |
| | | +------+ | | |
| | +->|Filter|-_ +-----+ +------------------+ +-+ | | |
-->| | | +------+ ->| |-->|Queuing discipline|-->| | | | |-->
| | | |Class| +------------------+ | |--+-->| |
| | | +------+ _->| +--------------------------+ | | |
| | +->|Filter|- +----------------------------------+ | |
| | +------+ | |
| +-----------------------------------------------------------+ |
| Queuing discipline |
+---------------------------------------------------------------+
Figure 3: A simple queuing discipline with multiple classes.
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 3]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
Figure 3 shows an example of such a queuing discipline. Note that
multiple filters may map to the same class.
Queuing disciplines and classes are intimately tied together: the
presence of classes and their semantics are fundamental properties
of the queuing discipline. In contrast to that, filters can be
combined arbitrarily with queuing disciplines and classes as long
as the queuing discipline has classes to map the packets to. But
flexibility does not end there yet - classes normally do not take
care of storing their packets themselves, but they use another
queuing discipline to take care of that. That queuing discipline
can be arbitrarily chosen from the set of available queuing
disciplines, and it may well have classes, which in turn use
queuing disciplines, etc. The term qdisc would be used
interchangeably to mean queueing discipline in this draft.
+-+ +------+ +---------------+ +-+ +-+
| | +------+ | |-->|TBF, rate=1Mbps|-->| | | |
| |--+->|Filter|--->|"high"| +---------------+ | |--+ | |
| | | +------+ | +-----------------------+ | | | |
| | | +--------------------------------+ | | |
| | | | | |
| | | +------+ +---------------+ +-+ | | |
-->| | | Default | |-->| FIFO |-->| | | | |-->
| | +------------->|"low" | +---------------+ | |--+-->| |
| | | +-----------------------+ | | |
| | +--------------------------------+ | |
| | | |
| +---------------------------------------------------------+ |
| Queuing discipline with two delay priorities |
+-------------------------------------------------------------+
Figure 4: Combination of priority, TBF, and FIFO queuing
disciplines.
Figure 4 shows an example of such a stack: first, there is a
queuing discipline with two delay priorities. Packets which are
selected by the filter go to the high-priority class, while all
other packets go to the low-priority class. Whenever there are
packets in the high-priority queue, they are sent before packets
in the low-priority queue (e.g. the "sch_prio" queuing discipline
works this way). In order to prevent high-priority traffic from
starving low-priority traffic, we use a token bucket filter
(TBF), which enforces a rate of at most 1 Mbps. Finally, the
queuing of low-priority packets is done by a FIFO queuing
discipline. Note that there are other ways to accomplish what we
have done here, e.g. by using class-based queuing (CBQ).
Packets are enqueued as follows: when the "enqueue" function of a
queuing discipline is called, it scans the filters until one of
them indicates a match to a class identifier. It then queues the
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 4]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
packet for the corresponding class, which usually means to invoke
the "enqueue" function of the queuing discipline "owned" by that
class. Packets which do not match any of the filters are typically
attributed to some default class.
Typically, each class "owns" one queue, but it is in principle
also possible that several classes share the same queue or even
that a single queue is used by all classes of the respective
queuing discipline. Note, however, that packets do not carry any
explicit indication of which class they were attributed to.
Queuing disciplines that change per-class information when
dequeuing packets (e.g. CBQ) will therefore not work properly if
the "inner" queues are shared, unless they are able either to
repeat the classification or to pass the classification result
from "enqueue" to "dequeue" by some other means.
Usually when enqueuing packets, the corresponding flow(s) can be
policed, e.g. by discarding packets which exceed a certain rate.
3. Diffserv extensions to Linux traffic control
The traffic control framework available in recent Linux kernels [3]
already offers most of the functionality required for implementing
Diffserv support. We therefore closely followed the existing design
and added new components only where it was deemed strictly necessary.
3.1 Overview
Figure 5 shows the general structure of the forwarding path in a
Diffserv node.
+---------+ +-----+ +---------+
| Classi- |--->| +-----+ | |
--->| fier & |----->| +-----+ | Marking |--->
| Meter |------->| PHB |--->| |
+---------+ +-----+ +---------+
Figure 5: General Diffserv forwarding path.
Depending on the implementation, marking may also occur at
different places, possibly even several times.
The classification result may be used several times in the
Diffserv processing path, and it may also depend on external
factors (e.g. time), so reproducing the classification result may
not only be expensive, but actually impossible.
We therefore added a new field "tc_index" to the packet buffer
descriptor ("struct sk_buff"), where we store the result of the
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 5]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
initial classification. In order to avoid confusing "tc_index"
with the classifier "cls_tcindex", we will call the former
skb->tc_index throughout this document.
skb->tc_index is set using the "sch_dsmark" queuing discipline,
which is also responsible for initially retrieving the DSCP, and
for setting the DS field in packets before they are sent on the
network. "sch_dsmark" provides the framework for all other
operations.
The "cls_tcindex" classifier reads all or part of skb->tc_index
and uses this to select classes.
Finally, we need a queuing discipline to support multiple drop
priorities as required for Assured Forwarding. For this, we
designed GRED, a generalized RED. "sch_gred" provides a
configurable number of drop priorities which are selected by the
lower bits of skb->tc_index.
3.2 Classification and marking
The classifiers "cls_rsvp" and "cls_u32" can handle all micro-flow
classification tasks. Additionally, the "ipchains" firewall is
also capable of tagging microflows into classes. Behavior
aggregate classification could also be done using "cls_u32" and
"ipchains", but since we usually already have "sch_dsmark" at the
top level, we use the simpler "cls_tcindex" and retrieve the DSCP
using "sch_dsmark", which then puts it into skb->tc_index.
When using "sch_dsmark", the class number returned by the
classifier is stored in skb->tc_index. This way, the result can be
re-used during later processing steps.
Nodes in multiple DS domains must also be able to distinguish
packets by the inbound interface in order to translate the DSCP to
the correct PHB. This can be done using the "route" classifier, in
combination with the "ip rule" command interface subset.
Marking is done when a packet is dequeued from "sch_dsmark".
"sch_dsmark" uses skb->tc_index as an index to a table in which
the outbound DSCP is stored and puts this value into the packet's
DS field.
skb->ihp->tos
- - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
Initial DS field marking -- ^
Initial value of tc_index |
+-+ +------+ | +---+-+ +-+ +-|-+
| | | cls_ |--->| | |--> . . . -->| | | | |
| |--->| rsvp |--->| | | | |--->| | |
| | | |--->| | | +---------------+ | | v |
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 6]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
-->| | +------+ +-|-+-------^-----------+ | O |-->
| | | | | ^ |
| +------------------|---------|----------------+ | |
| sch_dsmark | | | |
+--------------------|---------|------------------|-+
| | -- tc_index may |
v v change |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
skb->tc_index
Figure 6: Micro-flow classifier.
Figure 6 shows the use of "sch_dsmark" and skb->tc_index in a
micro-flow classifier based on "cls_rsvp". Figure 7 shows a
behavior aggregate classifier using "cls_tcindex".
skb->ihp->tos
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
| -- DS field is used May change DS field -- ^
| for classification |
+-|-+ +------+ +---+-+ +-+ +-|-+
| | | | cls_ |--->| | |--> . . . -->| | | | |
| | |----->| tc |--->| | | | |--->| | |
| | | |index |--->| | | +---------------+ | | v |
-->| O | +------+ +-|-+--------------^----+ | O |-->
| | | ^ | | | ^ |
| | +----------|---------|----------------|---------+ | |
| | sch_dsmark | | | | |
+-|------------|---------|----------------|-----------|-+
| | | -- tc_index -- | |
v | v may change v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
skb->tc_index
Figure 7: Behaviour aggregate classifier.
3.3 Cascaded meters
Multiple meters are needed if traffic should be assigned to more
than two classes, based on the bandwidth it uses. As an example,
such classes could be for "low", "high", and "excess" traffic.
Our current implementation supports a limited form of cascading at
the level of classifiers. We are testing a cleaner and more
efficient solution at the time of writing.
3.4 Implementing PHBs
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 7]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
PHBs based only on delay priorities, e.g. Expedited Forwarding
[4], can be built using CBQ [5] or the more simple "sch_prio".
(See section "Building sample configurations".)
Besides four delay priorities, which can again be implemented with
already existing components, Assured Forwarding [6] also needs
three drop priorities, which is more than the current
implementation of RED supports. We therefore added a new queuing
discipline which we call "generalized RED" (GRED). GRED uses the
lower bits of skb->tc_index to select the drop class and hence the
corresponding set of RED parameters.
3.5 Shaping
The so-called Token Bucket Filter ("sch_tbf") can be used for
shaping at edge nodes. Unfortunately, the highest rate at which
"sch_tbf" can shape is limited by the system timer, which normally
ticks at 100 Hz, but can be accelerated to 1 kHz or more if the
processor is sufficiently powerful. Note that Linux traffic
control supports more granular clocking for droppers (i.e. shapers
without buffer).
CBQ can also be used to do shaping.
Higher rates can be shaped when using hardware-based solutions,
such as ATM.
3.6 End systems
Diffserv-capable sources use the same functionality as edge
routers, i.e. any classification and traffic conditioning can be
administratively configured.
In addition to that, an application may also choose to mark
packets when they are generated. For IPv4, this can be done using
the "IP_TOS" socket option, which is commonly available on Unix,
and of course also on Linux. Note that Linux follows the [7]
convention of not allowing the lowest bit of the TOS byte to be
different from zero. This restriction is compatible with use for
Diffserv. Furthermore, the use of values corresponding to high
precedences (i.e. DSCP 0x28 and above) is restricted. This can be
avoided either by giving the application the appropriate
capabilities (privileges), or by re-marking (see below).
Setting the DS field with IPv6 is currently very awkward. In the
future, an improved interface is likely to be provided that
unifies the IPv4 and IPv6 usage and that may contain additional
improvements, e.g. selection of services instead of "raw" DS field
values.
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 8]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
An application's choice of DS field values can always be refused
or changed by traffic control (using re-marking) before a packet
actually reaches the network.
4. New components
The prototype implementation of Diffserv support requires the
addition of three new traffic control elements to the kernel: (1) the
queuing discipline "sch_dsmark" to extract and to set the DSCP, (2)
the classifier "cls_tcindex" which uses this information, and (3) the
queuing discipline "sch_gred" which supports multiple drop priorities
and buffer sharing.
Only the queueing discipline to extract and set the DSCP is truly
specific to the differentiated services architecture. The other two
elements can also be used in other contexts.
Figure 6 shows the use of "sch_dsmark" for the initial packet marking
when entering a Diffserv domain. The classification and rate control
metering is performed by a micro-flow classifier, e.g. "cls_rsvp", in
this case.
This classifier determines the initial TC index which is then stored
in skb->tc_index. Afterwards, further processing is performed by an
inner queuing discipline. Note that this queuing discipline may read
and even change skb->tc_index.
When a packet leaves "sch_dsmark", skb->tc_index is examined and the
DS field of the packet is set accordingly.
Figure 7 shows the use of "sch_dsmark" and "cls_tcindex" in a node
which works on a behavior aggregate, i.e. on packets with the DS
field already set. The procedure is quite similar to the previous
scenario, with the exception that "cls_tcindex" takes over the role
of "cls_rsvp" and that the DS field of the incoming packet is copied
to "tc_index" before invoking the classifier.
Note that the value of the outbound DS field can be affected at three
locations: (1) in "sch_dsmark", when classifying based on
skb->tc_index, which contains the original value of the DS field; (2)
by changing skb->tc_index in an inner queuing discipline; and (3) in
"sch_dsmark", when mapping the final value of skb->tc_index back to a
new value of the DS field.
4.1 "sch_dsmark"
As illustrated in figure 8, the "sch_dsmark" queuing discipline
performs three actions based on the scripting invocation:
- If "set_tc_index" is set, it retrieves the content of the DS
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 9]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
field and stores it in skb->tc_index.
- It invokes a classifier and stores the class ID returned in
skb->tc_index. If the classifier finds no match, the value of
"default_index" is used instead. If "default_index" is not
set, the value of skb->tc_index is not changed. Note that this
can yield undefined behaviour if neither "set_tc_index" nor
"default_index" is set.
- After sending the packet through its inner queuing discipline,
it uses the resulting value of skb->tc_index as an index into
a table of (mask,value) pairs. The original value of the DS
field is then replaced using the following formula:
ds_field = (ds_field & mask) | value
skb->ihp->tos
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
| -- Optional: DS field is copied to tc_index | ^
| | |
+-|-+ +---------+ tc_index is translated +----|---|---+
| | |-+->|Filter |_ to new DSCP \| | | |
| | | | +-------^-+ -_ -- res.classid contains \ v | |
| | | | | -_ new tc_index |\ (&)>(or) |
| | | | +-------|-+ -->+-------+-------------+ | ^ ^ |
| v | +->|Filter | |------->|classid|Queuing disc.|-->| | | |
-->| O | | +-------^-+ _-->+-------+-------------+ | |___| |-->
| | | | Default | _- | | [__|__] |
| | | +----------|---- | | +>[__|__] |
| | | | default_index provides tc_index | | [__|__] |
| | +------------|--------------|---------------------+ |Mask Value|
| | sch_dsmark | | | |
+-|--------------|--------------|-----------------------|----------+
| | -- classifier|may use |
v | tc_index v |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
skb->tc_index
Figure 8: The "dsmark" queuing discipline.
Table 1 lists the parameters that can be configured in the
"dsmark" queuing discipline. The upper part of the table shows
parameters of the queuing discipline itself. The lower part shows
parameters of each class.
---------------------------------------------------------
| Variable name / tc keyword | Value | Default |
---------------------------------------------------------
| indices | 2^n | none |
| default_index | 0... indices-1 | absent |
| set_tc_index | none (flag) | absent |
---------------------------------------------------------
| mask | 0...0xff | 0xff |
| value | 0...0xff | 0 |
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 10]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
---------------------------------------------------------
Table 1: Configuration parameters of "sch_dsmark".
"indices" is the size of the table of (mask,value) pairs.
4.2 "cls_tcindex"
As shown in figure 9, the "cls_tcindex" classifier uses
skb->tc_index to select classes. It first calculates the lookup
key using the algorithm
key = (skb->tc_index >> shift) & mask
Then it looks for an entry with this handle. If an entry is
found, it may call a meter (if configured), and it will return the
class IDs of the corresponding class.
If no entry is found, the result depends on whether "fall_through"
is set. If set, a class ID is constructed from the lookup key.
Otherwise, it returns a "not found" indication and the search
continues with the next classifier. We call construction of the
class ID an "algorithmic mapping". This can be used to avoid
setting up a large number of classifier elements if there is a
sufficiently simple relation between values of skb->tc_index and
class IDs. An example of this trick is used in the AF scripts on
the web site.
The size of the lookup table can be set using the "hash" option.
"cls_tcindex" automatically uses perfect hashing if the range of
possible choices does not exceed the size of the lookup table. If
the "hash" option is omitted, an implementation-dependent default
value is chosen.
+-----------------+
| mask shift |
| | | |
skb->tc_index ---->(&) ---->(>>) |
| | |
+-------------|---+
|
+-------------------+
| Key
v
+------------------------+
| key class(id) police |
+------------------------+
| key class(id) police |
+------------------------+
| | |
: | +--------> Profile
: |
: +------------------> Class
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 11]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
|
+-----------------> *:Key
if fall_through
Figure 9: The "tcindex" classifier.
Table 2 shows the parameters that can be configured in the
"tcindex" classifier. The upper part of the table shows parameters
of the classifier itself. The lower part shows parameters of each
element.
----------------------------------------------------------------
| Variable | tc keyword | Value | Default |
----------------------------------------------------------------
| hash | hash | 1...0x10000 | implementation- |
| | | | dependent |
| mask | mask | 0...0xffff | 0xffff |
| shift | shift | 0...15 | 0 |
| fall_through | fall_through/ | flag | fall_through |
| | pass_on | | |
----------------------------------------------------------------
| res | classid | major:minor | none |
| police | police | Profile | none |
----------------------------------------------------------------
Table 2: Configuration parameters of "cls_tcindex".
Note that the keyword used by "tc" (the command-line tool used to
manually configure traffic control elements) does not always
correspond to the variable internally used by "cls_tcindex".
4.3 "sch_gred"
Virtual Queue RED Parameters
+-------+
Class virtual queue | VC0 |
selector +---->| VC1 |----+
| | ... | |
skb->tc_index | | VCn | | Class physical
| +-------+ v Queue queue
Packet | +---+ packet ---------+
================================>| |========> | | | |===>
+---+ ---------+
I
I Drop packet
X
Figure 10: Generic RED and the use of "skb->tc_index"
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 12]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
Figure 10 shows how "sch_gred" uses skb->tc_index for the
selection of the right virtual queue (VQ) within a physical queue.
What makes "sch_gred" different from other Multi-RED
implementations is the fact that it is decoupled from any one
specific block along the packet's path such as a header classifier
or meter. For example, CISCO's DWRED [8] is tied to mapping VQ
selection based on the precedence bits classification. On the
other hand, RIO [9] is tied to the IN/OUT metering levels for the
selection of the VQ. In the case of GRED, any classifier, meter,
etc. along the data path can affect the selection of the VQ by
setting the appropriate value of skb->tc_index.
GRED also differs from the two mentioned multiple RED mechanisms
in that it is not limited to a specific number of VQ. The number
of VQs is configurable for each physical class queue. GRED does
not assume certain drop precedences (or priorities). It depends on
the configuration parameters passed on by the user. In essence,
DWRED and RIO are special cases of GRED.
Currently, the number of virtual queues is limited to 16 (the
least significant 4 bits of skb->tc_index). There is a one to one
mapping between the values of skb->tc_index and the virtual queue
number in a class. Buffer sharing is achieved using one of two
ways (selectable via configuration):
- Simple setting of physical queue limits. It is up to the
individual configuring the virtual queues parameters to decide
which one gets preferential treatment. Sharing and
preferential treatment amongst virtual queues is based on
parameter settings such as the per-virtual queue physical
limit, threshold values and drop probabilities. This is the
default setting.
- A similar average queue trick as that is used in [9]. This is
selected by the operator "grio" during the setup. Each VQ
within a class is assigned a priority at configuration time.
Priorities range from 1 to 16 at the moment, with 1 being the
highest. The computation of the average queue value (for a VQ)
involves first summing to the current stored average queue
value all the the other average queue values of the VQs which
are more important than it. This way a relatively higher
priority (lower priority value) gets preferential treatment
because its average queue is always the lowest; the relatively
lower priority will still continue to send when the higher
ones are not dominating the buffer space. A user can still
configure the per-virtual-Queue physical queue limits,
threshold values, and drop probabilities as in the (first)
case when the "grio" option is not defined.
The second scheme is slightly slower than the first one (a few
more per-packet computations).
GRED is configured in two steps. First (see also the upper part of
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 13]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
table 3) the generic parameters are configured to select the
number of virtual queues "DPs" and whether to turn on the RIO-like
buffer sharing scheme ("grio"). Also at this point, a default
virtual queue is selected so that packets with out of range values
of skb->tc_index or misconfigured priorities in the case of "grio"
buffer-sharing setup are directed to it. Normally, the default
virtual queue is the one with the highest likelihood of having a
packet discarded. The operator "setup" identifies that this is a
generic setup for GRED.
--------------------------------------------------
| Variable | tc keyword | Value | Default |
--------------------------------------------------
| DPs | DPs | 1...16 | none |
| def | default | 1...DPs | none |
| grio | grio | none (flag) | absent |
--------------------------------------------------
| limit | limit | bytes | none |
| qth_min | min | bytes | none |
| qth_max | max | bytes | none |
| n/a | avpkt | bytes | none |
| n/a | bandwidth | rate | 10 Mbps |
| n/a | burst | packets | none |
| n/a | probability | [0...1) | 0.02 |
| DP | DP | 1...DPs | 0 |
| prio | prio | 1...DPs | none |
--------------------------------------------------
Table 3: Configuration parameters of "sch_gred".
The second step is to set parameters for individual virtual
queues. (See also the lower part of table 3). These parameters are
equivalent to the traditional RED parameters. In addition, each
RED configuration identifies which virtual queue the parameters
belong to as well as the priority if the "grio" technique is
selected. The mandatory parameters are:
- "limit" defines the virtual queue "physical" limit in bytes.
- "min" defines the minimum threshold value in bytes.
- "max" defines the maximum threshold value in bytes.
- "avpkt" is the average packet size in bytes.
- "bandwidth" is the wire-speed of the interface.
- "burst" is the number of average-sized packets allowed to
burst. The Linux RED implementation attempts to compute an
optimal W value for the user based on the "avpkt", minimum
threshold and allowed burst size. This is based on the
equation: burst + 1 - qmin/avpkt < (1-(1-W)^burst)/W as
described in [10].
- "probability" defines the drop probability in the range
[0...).
- "DP" identifies the virtual queue assigned to these
parameters.
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 14]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
- "prio" identifies the virtual queue priority if "grio" was set
in the general parameters.
5. Building sample configurations
^ ^ ^ ^ ^^
+------------+ ( netlink ) +------------+
| tc |<---> ( cloud ) <--> | kernel |
+------------+ ( ) +------------+
( )
^^^^^^
Figure 11: User space to kernel communication using "tc"
Communication and configuration of the kernel code or modules is
achieved by a user level program tc written by Alexey. The
interaction is shown in figure 11.
Given the flexibility of the code, there are many ways to reach the
same end goal. Depending on the requirement, one could script the
same PHB using a different combinations of qdiscs; e.g. one could
build a core EF capable router using either CBQ to rate limit it and
prioritise its traffic or instead use the PRIO qdisc with a Token
Bucket attached to rate limit it. It is hoped that users of Linux
Diffserv will be able to script their own flavored configurations.
The examples below (as well as those on the Linux Diffserv web site)
are simplistic, in the sense that they only assume one interface per
node. One should easily be able to extend them for more than one
interface
The normal recipe for creating a configuration script is:
- attach "sch_dsmark" to the output interface
- define the structure of the queuing discipline(s) inside
"sch_dsmark"
- number the classes and decide on a numbering scheme to use for
skb->tc_index (the latter may be trivial if skb->tc_index is only
used within "sch_dsmark".)
- identify which packets go to which classes and configure the
classifier(s) of "sch_dsmark" accordingly
The script lines in the next subsections are numbered for clarity of
the accompanying description below.
For clarity, we did not include handling of historical DS field
values in our scripts.
5.1 Edge device: Packet re-marking
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 15]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
Table lookup
+-+ +-----------------+ +---------+------+ +-----------+
| |-+->| u32 | Meter |--->|Class 1:1| | | DSCP 0x2e |
| | | +-----------------+ +---------+ | | |
| | | | Exceeded -- +----->|Class 1:2| | | DSCP 0x18 |
| | | | limit +---------+ FIFO |---->| |-->
-->| | | +-------------------->|Class 1:3| | | DSCP 0x1a |
| | | No match +---------+ | | |
| | +---------------------------------->| | | |
| | +------+ | |
| | | |
| +--------------------------------------------------+ |
| sch_dsmark (1:0) |
+----------------------------------------------------------------+
Figure 12: Packet re-marking at the edge.
1. tc qdisc add dev eth0 handle 1:0 root dsmark indices 64
2. tc class change dev eth0 classid 1:1 dsmark mask 0x3 value 0xb8
3. tc class change dev eth0 classid 1:2 dsmark mask 0x3 value 0x68
4. tc class change dev eth0 classid 1:3 dsmark mask 0x3 value 0x48
5. tc filter add dev eth0 parent 1:0 protocol ip prio 4 handle 1:
u32
divisor 1
6. tc filter add dev eth0 parent 1:0 protocol ip prio 5 handle 2:
u32
divisor 1
7. tc filter add dev eth0 parent 1:0 prio 4 u32
match ip dst 10.0.0.0/24
police rate 1Mbit burst 2K continue
flowid 1:1
8. tc filter add dev eth0 parent 1:0 prio 5 u32
match ip dst 10.0.0.0/24
flowid 1:2
9. tc filter add dev eth0 parent 1:0 prio 4 u32
match ip dst 10.1.0.0/16
match ip src 192.1.0.0/16
match ip protocol 6 0xff
match ip dport 0x17 0xffff
flowid 1:3
The first line attaches a dsmarker to the interface eth0 on the
root node. The second line instructs the dsmarker to remark the
DSCP of classid 1:2 by first masking out bits 6 and 7 then ORing
that with a value of 0xb8. Note that: This is equivalent to
ignoring the ECN bits, and setting the code point value to 0x2e
(which happens to be the DSCP for EF). In a similar manner, the
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 16]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
third line instructs the dsmarker to remark the CP of classid 1:2
to 0x1a (DSCP for AF31). The fourth line adds a remarking the
class 1:3 DSCPs to 0x12 (DSCP for AF21). These three lines in
effect are also registering the classes 1:2, 1:3 and 1:4.
Line 5 adds a u32 classifier with priority of 4. Line 6 adds
another classifier of a lower priority. Line 7 maps all packets
with a source IP address of 10.0.0.0/24 to class 1:1. Line 7 and 8
show how one can attach a meter to a classifier and the reaction
to an exceeding of the rate. Basically, the trick is to define two
filters matching the same headers with a higher priority one
attached with a meter and policing action. The operator "continue"
is used to allow a lookup of the next lower priority matching
filter. In this case, should the metering be exceeded in class
1:1, the flow is reclassified to class 1:2. Line 9 selects all TCP
packets from source subnet 10.1.1.0/16 destined towards subnet
192.1.1.0/16 and sends them to the queue for class 1:3.
The overall effect is: all packets coming in from source subnet
address 10.0.0.0/24 will get their packets marked with a DSCP of
0x2e (EF class/PHB) up to a point where they start exceeding their
allocated rate (of 1Mbps and burst of 2K). In this case, the
packets are demoted to class 1:2 where they will be remarked to
DSCP 0x18 (AF21). Any TCP packets of origin subnet 10.1.1.0/16
destination subnet 192.1.1.0/16 will be remarked to 0x1A (AF22).
It is easy to see that one can build a multi-color marking scheme
of large depths using using such cascading filter/metering
schemes.
5.2 Core device: EF using CBQ
DSCP=0x2E
|
+-+ +--------+ +-+ +--------+ | +---------+-------+ +-+ +-+
| |->|tc_index|->| |--+-->|tc_index|-->|Class 2:1| pFIFO |--+ | | | |
| | +--------+ | | | +--------+ +---------+-------+ | | | | |
| | | | | | | | | |
| | | | | No match +---------+-------+ | | | | |
| | | | +--------------->|Class 2:2| RED |--+-->| |->| |
| | | | +---------+-------+ | | | |
-->| | | | | | |
|-->
| | | +--------------------------------------------+ | | |
| | | CBQ (2:0) | | |
| | +------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------------+ |
| sch_dsmark (1:0) |
+----------------------------------------------------------------------+
Figure 13: Configuring EF using CBQ.
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 17]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
The script below is the output of the EF Perl script on the Linux
Diffserv Web site.
1. tc qdisc add dev eth0 handle 1:0 root dsmark indices 64
set_tc_index
2. tc filter add dev eth0 parent 1:0 protocol ip prio 1
tcindex mask 0xfc shift 2
3. tc qdisc add dev eth0 parent 1:0 handle 2:0 cbq
bandwidth 10Mbit allot 1514 cell 8 avpkt 1000 mpu 64
4. tc class add dev eth0 parent 2:0 classid 2:1 cbq
bandwidth 10Mbit
rate 1500Kbit avpkt 1000 prio 1 bounded isolated
allot 1514 weight 1 maxburst 10 defmap 1
5. tc qdisc add dev eth0 parent 2:1 pfifo limit 5
6. tc filter add dev eth0 parent 2:0 protocol ip prio 1
handle 0x2e tcindex classid 2:1 pass_on
7. tc class add dev eth0 parent 2:0 classid 2:2 cbq
bandwidth 10Mbit rate 5Mbit avpkt 1000 prio 7
allot 1514 weight 1 maxburst 21 borrow
8. tc qdisc add dev eth0 parent 2:2 red limit 60KB min 15KB
max 45KB burst 20 avpkt 1000 bandwidth 10Mbit
probability 0.4
9. tc filter add dev eth0 parent 2:0 protocol ip prio 2
handle 0 tcindex mask 0 classid 2:2 pass_on
Line 1 attaches to the root node on interface eth0 a dsmarker
which copies the TOS byte into skb->tc_index. Line 2 adds a filter
to the root node which exists merely to mask out the ECN bits and
extract the DSCP field by shifting to the right by two bits. A
classful qdisc using CBQ is attached to node 2:0 (2:0 is the child
of the root node 1:0) - this is in line 3. Two child classes are
defined out of the 2:0 node. 2:1 is of type CBQ which is bound to
a rate of 1.5 Mbps (line 4). A packet counting FIFO qdisc
("pfifo") with a maximum queue size of 5 packets is attached to
the CBQ class as the buffer management scheme (line 5). Line 6
adds a "tcindex" classifier which will redirect all packets with a
skb->tc_index 0x2e (the DSCP for EF) to classid 2:1 - non 0x2e are
allowed to fall through so they can be matched by another filter.
Line 7 defines another CBQ class, 2:1, emanating out of node 2:0 -
this is intended to be the Best Effort class. The rate is limited
to 5 Mbps; however, the class is allowed to borrow extra bandwidth
if it is not being used (via the operator "borrow"). Since the EF
class does not lend its bandwidth (operator "isolated" line 4),
the BE can only borrow up to a maximum of an extra 3.5Mbps. Note
that in scenarios where there is no congestion on the wire, this
might not be a very smart provisioning scheme since the BE traffic
will probably get equivalent traffic performance as EF. The major
differentiator in that case will be the priorities. The EF class'
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 18]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
traffic will always be served first as long as there is something
on the queue (prio 1 is higher than prio 8 in comparing line 4 and
7). Line 8 attaches RED as the buffer management scheme to be used
by the BE class. Line 9 then maps the rest of the packets (without
DSCP of 0x2e) to the classid 2:2. The description of the RED and
CBQ parameters are beyond the scope of this document.
6. Conclusion
We have given a brief introduction to the elements of Linux traffic
control in general, and we have explained how the existing
infrastructure can be extended in order to support Diffserv. We have
then shown how we implemented support for the Diffserv architecture
in Linux, using the traffic control framework of recent kernels. We
have also described how nodes can be configured using our work.
Our implementation provides a very flexible platform for experiments
with PHBs already under standardization as well as experiments with
new PHBs. It can also serve as a platform for work in other areas of
Diffserv, such as edge configuration management and policy
management.
Future work will focus on the elimination of a few restrictions that
still exist in our architecture, and also in the simplification of
the configuration procedures.
7. References
[1] RFC2475; Blake, Steven; Black, David; Carlson, Mark; Davies,
Elwyn; Wang, Zheng; Weiss, Walter. An Architecture for
Differentiated Services, IETF, December 1998.
[2] RFC2474; Nichols, Kathleen; Blake, Steven; Baker, Fred; Black,
David. Definition of the Differentiated Services Field (DS
Field) in the IPv4 and IPv6 Headers, IETF, December 1998.
[3] Almesberger, Werner. Linux Traffic Control - Implementation
Overview,
ftp://lrcftp.epfl.ch/pub/people/almesber/pub/tcio-current.ps.gz,
Technical Report SSC/1998/037, EPFL, November 1998.
[4] RFC2598; Jacobson, Van; Nichols, Kathleen; Poduri, Kedarnath.
An Expedited Forwarding PHB, IETF, June 1999.
[5] Floyd, Sally; Jacobson, Van. Link-sharing and Resource
Management Models for Packet Networks, IEEE/ACM Transactions on
Networking, Vol. 3 No. 4, pp. 365-386, August 1995.
[6] RFC2597; Heinanen, Juha; Baker, Fred; Weiss, Walter;
Wroclawski, John. Assured Forwarding PHB Group, IETF, June 1999.
[7] RFC1349; Almquist, Philip. Type of Service in the Internet
Protocol Suite, IETF, July 1992.
[8] CISCO DWRED. Distributed Weighted Random Early Detection,
http://www.cisco.com/univercd/cc/td/doc/product/software/ios111/cc111/wred.htm
[9] Clark, David; Wroclawski, John. An Approach to Service
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 19]
INTERNET-DRAFT draft-almesberger-wajhak-diffserv-linux-01.txt June 1999
Allocation in the Internet, Internet Draft
draft-clark-diff-svc-alloc-00.txt, July 1997.
[10] Floyd, Sally; Jacobson, Van. Random Early Detection Gateways
for Congestion Avoidance, IEEE/ACM Transactions on Networking,
August 1993.
8. Author's address
Werner Almesberger
Institute for computer Communications and Applications
Swiss Federal Institute of Technology (EPFL)
CH-1015 Lausanne
Switzerland
email: Werner.Almesberger@epfl.ch
Jamal Hadi Salim
Computing Technology Labs
Nortel Networks
P.O.BOX 3511, Station C
Ottawa, Ontario
Canada K1Y 4H7
email: hadi@nortelnetworks.com
Alexey Kuznetsov
Institute for Nuclear Research (INR)
60th October Anniversary pr. 7a
Moscow 117312
Russia
email: kuznet@ms2.inr.ac.ru
Almesberger, Hadi & Kuznetsov Expires 12/99 [Page 20]