Internet DRAFT - draft-francis-mapped-bgp-design
draft-francis-mapped-bgp-design
Network Working Group P. Francis
Internet-Draft Cornell U.
Intended status: Informational X. Xu
Expires: April 29, 2009 Huawei
H. Ballani
Cornell U.
October 26, 2008
Mapped BGP Design
draft-francis-mapped-bgp-design-00.txt
Status of this Memo
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 29, 2009.
Abstract
This draft introduces Mapped-BGP, a routing protocol that uses BGP to
distributed tunnel endpoint-to-prefix mappings. The goal of this
draft are to present preliminary concepts and get feedback. It is
not meant to be a fully-formed proposal. The goals of Mapped-BGP
are: 1) to reduce the processing required to run BGP, 2) to speed up
inter-domain convergence, 3) to improve the cross-ISP load balancing
capabilities of BGP, and where possible, 4) to enable forms of
address aggregation like geographical addressing (i.e. for IPv6).
Improved address aggregation is unlikely to be very useful for IPv4,
Francis, et al. Expires April 29, 2009 [Page 1]
Internet-Draft Mapped BGP October 2008
because most addresses have already been assigned. This design takes
the position that Mapped BGP is useful even without better
aggregation, because 1) FIB size can be reduced through FIB
suppression with Virtual Aggregation, and 2) RIB size per se is not
the growth bottleneck.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terms and concepts . . . . . . . . . . . . . . . . . . . . . . 4
3. Description of Mapped-BGP . . . . . . . . . . . . . . . . . . 5
3.1. Structure of new attributes . . . . . . . . . . . . . . . 5
3.2. Map-RIB data structure . . . . . . . . . . . . . . . . . . 6
3.3. Tunnel Endpoints (TE) . . . . . . . . . . . . . . . . . . 6
3.4. Rules for advertising maps . . . . . . . . . . . . . . . . 7
3.4.1. Rules for initiating a map . . . . . . . . . . . . . . 7
3.4.2. Transposing Maps and Routes . . . . . . . . . . . . . 8
3.4.3. Authenticating updates . . . . . . . . . . . . . . . . 9
3.4.4. Longest-prefix map selection rules and aggregation . . 9
3.4.5. Changing maps . . . . . . . . . . . . . . . . . . . . 11
3.4.6. Propogating and activating maps . . . . . . . . . . . 12
3.4.7. Changing TE-route . . . . . . . . . . . . . . . . . . 13
3.5. Load Balancing in Mapped-BGP . . . . . . . . . . . . . . . 13
3.5.1. Incoming Load Balance at Sites . . . . . . . . . . . . 14
3.5.2. Incoming Load Balance at Lower-tier ISPs . . . . . . . 15
3.5.3. Multi-exit discrimination with Mapped-BGP . . . . . . 17
3.6. Aggregation in Mapped-BGP . . . . . . . . . . . . . . . . 18
3.6.1. Geographic or Metro Addressing . . . . . . . . . . . . 21
3.6.2. Opportunistic AS aggregation clusters . . . . . . . . 26
3.6.3. Generalized Inter-domain Virtual Aggregation . . . . . 26
4. Performance Benefits . . . . . . . . . . . . . . . . . . . . . 27
5. Normative References . . . . . . . . . . . . . . . . . . . . . 28
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 28
Intellectual Property and Copyright Statements . . . . . . . . . . 30
Francis, et al. Expires April 29, 2009 [Page 2]
Internet-Draft Mapped BGP October 2008
1. Introduction
The basic idea behind Mapped-BGP is quite simple. Rather than
distribute routes to reachable prefixes, BGP distributes routes to
tunnel endpoints (TE) and distributed maps that associate reachable
prefixes with TEs. Otherwise, run BGP in much the same way that it
runs today. Indeed, in Mapped-BGP it is possible to transpose TE
routes and their associated maps back into routes to prefixes. This
transposition is used to allow ISPs running Mapped-BGP to interface
with legacy ISPs that do not run Mapped-BGP. The transposition also
allows us to reuse the security mechanisms of BGP, especially prefix
filtering.
The maps in Mapped-BGP are, for the most part, policy-free. By this
we mean that the types of policies normally applied to routes; the
seven-step best path computation, the assignment of weights and local
preferences, the addition or deletion of attributes including path
prepending, and decisions about where to advertise routes, are not
applied to maps. Rather, maps are blindly distributed along the
routes traced out by their associated TEs. Since the majority of
prefixes would be distributed by maps rather than by routes, the cost
of processing BGP updates would be significantly decreased. Note
that RIB and FIB size would not be reduced with this approach.
However, FIB size can be reduced with FIB suppression associated with
Virtual Aggregation [I-D.francis-idr-intra-va], and we doubt that RIB
size per se is a serious bottleneck in BGP (this needs to be
validated).
A natural question to ask is, if policies are not being applied to
maps, how are BGP policies applied to prefixes advertised in maps?
Since maps are distributed along the reverse best paths of their
associated TEs, policies that apply to the TE routes are
automatically grandfathered onto the map prefixes. This works well
for policies that are used to control which routes pass through an
ISP, for instance to configure valley-free routing. This does not
work as well, however, for policies used for load balance across
ASes. This is because BGP load balancing mechanisms operate at the
granularity of routes, which in the absense of maps operate at the
granularity of prefixes. With Mapped-BGP, a TE-route originated by
an ISP will apply to all of the ISP's prefixes. In other words, the
ISP only originates a single route, and so there isn't enough route
granularity on which to apply BGP load balancing policies.
To make up for this shortcoming, Mapped-BGP introduces a parameter
called a Tunnel Endpoint Discriminator (TED). This is a
parameterless value that a remote router uses to decide the relative
probability with which it will use the different TEs that apply to a
given prefix. TEDs allow both multi-homed sites and lower-tier
Francis, et al. Expires April 29, 2009 [Page 3]
Internet-Draft Mapped BGP October 2008
multi-homed ISPs to load balance at relatively fine granularity.
The tunnels in Mapped-BGP provide a simple mechanism to produce
virtual topologies across ASes. If used in concert with aggregatable
address assignment policies like geographical addressing, Mapped-BGP
provides significant new opportunities for aggregation without the
need for careful physical topology management across ISPs (for
instance within a geographical area).
2. Terms and concepts
FIB-install and FIB-suppress: These two terms refer to the act of
installing a route into the FIB, and not installing a route into
the FIB, respectively. Note that the mechanism for not installing
a route into the FIB may be simply not putting it into the routing
table (defined below).
Head-end and tail-end: Head-end generally refers to the start of the
tunnel. For instance, head-end router is the router that starts
the tunnel. Head-end ISP is the ISP that contains the head-end
router, etc. Tail-end generally refers to the terminating point
of the tunnel. The term tunnel endpoint (TE) is generally
synomonous with tail-end.
Legacy: Refers to something that does not operate Mapped-BGP (for
instance, a legacy AS or ISP, a legacy router, etc.). Anything
that is not labeled as legacy is assumed to be operating Mapped-
BGP.
Map: The term "map" refers to a single prefix-TE mapping. It may
also refer to the "map attribute" in a BGP update. Note that in
general, however, a map attribute will contain multiple individual
maps.
Routing Table: The term Routing Table is defined here the same way
as in Section 3.2 of RFC4271: "Routing information that the BGP
speaker uses to forward packets (or to construct the forwarding
table used for packet forwarding) is maintained in the Routing
Table." As such, FIB Suppression can be achieved by not
installing a route into the Routing Table
Tunnel endpoint (TE): The term TE typically refers to the router or
AS that detunnels the packet. The term can also refer to the TE
address. Tunnel endpoints (TE) should be anycasted across some or
all routers in the AS.
TE-route: This is a normal BGP route whose NLRI contains one or more
TEs.
TE-block, TE-subblock, and Path Splitting: Typically a TE will be
defined by a CIDR block of addresses (as opposed to a single
address). This is done to enable upstream load balance through a
mechanism called Path Splitting (see Section 3.5), whereby the
route for the entire TE block is split into multiple routes, each
Francis, et al. Expires April 29, 2009 [Page 4]
Internet-Draft Mapped BGP October 2008
to a sub-block within the block. These routes are advertised to
different neighbors, giving upstream ASes multiple paths to choose
from to get to a given destination prefix. The term TE-block
refers to the entire block of addresses that comprise a TE, and
the term TE-subblock refers to a sub-block within the block.
Tunnel endpoint discriminator (TED): A map may have a TED associated
with it for the purpose of incoming load balancing. This is used
when an AS is multi-homed to multiple providers, and each provider
serves a TE. A split also has TEDs associated with it, which are
used by an ISP to load balance traffic incoming among its AS
peering links. The TED is a parameterless indication of the
proportion of traffic that should be sent to each TE or AS-link.
Note that head-end ISPs are not required to honor the TED. Note
also that TED info in maps is lost when maps are aggregated.
3. Description of Mapped-BGP
3.1. Structure of new attributes
There are two new attributes associated with Mapped-BGP. One is the
"map", which is used to associate a reachable address prefix with a
Tunnel Endpoint Block (TE-block). The other is the "split", which is
used to associate a TED value with segments of multiple paths to TEs.
The contents of the map attribute is as follows:
[TE-list]
List of one or more address targets, consisting of:
[prefix],
action,
[TED],
where:
TE = a CIDR block of one or more addresses,
TE-list = a list of TE's
action = add, remove
prefix = a CIDR block of one or more addresses
TED = value between 0 - 255 (or smaller range)
[] = optional (note that either the TE-list, or the prefix, or
both must be present)
The format of the split attribute is:
Downstream AS
List of two or more Upstream ASes, consisting of:
Upstream AS
TED
Francis, et al. Expires April 29, 2009 [Page 5]
Internet-Draft Mapped BGP October 2008
3.2. Map-RIB data structure
We assume a new data structure called the map-RIB. For each eBGP
neighbor, there is conceptually a map-RIB-in and a map-RIB-out, which
contains the maps received from and sent to the neighbor
respectively.
Normally the same map (i.e. same TE, TED, and action) will have been
received from each peer and sent to each peer. During a change (a
map going from add to remove, or a change in TED), however, there
will be a brief convergence period during which the map received from
different peers will differ. The map-RIB data structure can be
substantially compressed to exploit this fact. In other words, most
map-RIB entries can simply have a flag indicating that all received
and sent maps are the same, and avoid listing them explicitly.
3.3. Tunnel Endpoints (TE)
TEs are typically anycasted across multiple routers for both the sake
of resilience and to allow for aggregation. When a TE is associated
with a single AS, then all routers in the AS will be anycasted with
the TE address. A TE may be associated with multiple ASes (i.e. for
aggregation), in which case all routers in all the ASes will be
anycasted. It may also be possible to assign a TE to a metro or
geographical area. In this case, the TE address is anycasted across
at least all routers within the area, but not necessarily all routers
in all ASes that have a presence in the area.
A "TE" can in fact be composed of a CIDR block. In other words, a
group of addresses can all act as the TE (i.e. all cause the router
to detunnel the packet). From the point of view of the TE router,
all addresses in the block are treated identically---it doesn't
matter which TE address was used to tunnel a received packet. The
purpose of allowing a block of addresses to be a TE is to allow for
load balancing. Different sub-blocks within a TE-block may follow
different paths to the TE (path splitting), thus allowing the head-
end router to select a path by virtue of selecting different TE
addresses within the block. This path selection can be loosely
influenced by downstream ASes through the use of TEDs.
Because a router may participate in multiple levels of aggregation
(i.e. AS-level and geographical-level), a given router may advertise
multiple TE-blocks in its maps. There should not, however, be more
than two or at most three TE-blocks in a given map.
Francis, et al. Expires April 29, 2009 [Page 6]
Internet-Draft Mapped BGP October 2008
3.4. Rules for advertising maps
3.4.1. Rules for initiating a map
An ISP will initiate a map on behalf of its stub-AS customers. This
is illustrated in the following example. It shows a network of stub
ASes, A, B, C, and D, and ISPs (all other ASes). The prefixes
associated with the stub ASes are as shown. B, C, and D are single-
homed customers of W, and A is a multihomed customer of W and Z. W is
a customer of X and Y.
J
/ \
/ \
I------X Y
/ \ /
/ \ /
Z W TEw=40.1.1.0/28
\ |
\ ----------------------
\ / | | |
Pa=20.1/16 A B C D Pd=30.1/16
Pb=20.2/16 Pc=20.3/16
Given this configuration, AS W would initiate the following updates
to non-legacy AS's:
Route: AS-path=(W), NLRI=(40.1.1.0/28)
Map: TE=(40.1.1.0/28), AT=(<20.0/14>, <30.1/16>, <20.1/16,TED=20>)
which for the sake of readability we can rewrite as:
Route: AS-path=(W), NLRI=(TEw)
Map: TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)
where Pw-agg is an aggregate consisting of Pa, Pb, and Pc.
The first update (the route) is a normal BGP advertisement with AS-
path = W and NLRI=TEw=40.1.1.0/28 (other attributes left off for
simplicity). We call this update a TE-route, since it is a route to
a TE. The second update is a map. The TE is TEw, which is the same
as the TE-route. Also included are three address targets. In each,
the "action" is assumed to be "add", and is not otherwise shown. The
first, Pw-agg=20.0/14, is the aggregate of Pa=20.1/16, Pb=20.2/16,
and Pc=20.3/16. The second is Pd=30.1/16, which is not aggregatable
and so given separately. Note that D is not multihomed, and so has
no need for a TED. The third is for A. Even though A's prefix falls
Francis, et al. Expires April 29, 2009 [Page 7]
Internet-Draft Mapped BGP October 2008
within the 20.0/14 aggregate, it is also individually listed in order
to convey its TED, value TEDaw=20. AS Z would also advertise a
separate map with a another TED value, thus giving A some control the
volume of incoming traffic on its two access links (see Section 3.5).
There are three ways that AS W could have learned the value TEDaw.
One is to have statically configured it. A second is for A to convey
it via BGP in an Extended Communities Attribute. This would
especially be useful if A is running legacy BGP. The third would be
for A to advertise a map to W, but keeping the TE field as NULL:
Map: TE=(NULL), AT=(<Pa,TEDaw>)
This would effectively signal to W that it wants to have its prefix
Pa advertised individually with the associated TED.
3.4.1.1. More flexible aggregation
Nominally it appears in the above example that we are doing the same
amount of aggregation as with legacy BGP today. This is because Pa
is advertised individually because of multihoming. Section
Section 3.6 describes how Mapped-BGP provides additional
opportunities for aggregation.
3.4.2. Transposing Maps and Routes
A key aspect of Mapped-BGP is that the combination of route+map can
be transposed into a route. This is important for the simple
pragmatic reason that it allows an AS to speak BGP with legacy AS's.
It is also important because it allows certain existing BGP
mechansisms that operate on routes, like filtering incoming updates,
to be applied to map+routes. As an example of this transposition,
the updates advertised by W in the above example can be transposed
into the following route:
Route: AS-path=(W), NLRI=(40.1.1.0/28, 20.0/14, 30.1/16, 20.1/16)
or equivalently:
Route: AS-path=(W), NLRI=(TEw, Pw-agg, Pd, Pa)
This is possible because the prefixes in the map (20.0/14, 30.1/16,
20.1/16) can be associated to the AS-path in the route (W) by virtue
of TEw matching the NLRI of the TE-route.
To continue the example, the BGP updates advertised by AS X would be:
Route: AS-path=(W,X), NLRI=(TEw)
Francis, et al. Expires April 29, 2009 [Page 8]
Internet-Draft Mapped BGP October 2008
Map: TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)
AS X adds itself to the route, but does not change the map. Indeed
maps never change as they propogate through the Internet (though they
can be dropped during aggregation). These two updates can be
transposed into:
Route: AS-path=(W,X), NLRI=(TEw, Pw-agg, Pd, Pa)
In fact, if AS I were a legacy AS, then AS X would give this update
to AS I. This allows legacy AS's to coexist with updated AS's. This
document does not address the issue of having legacy routers and
Mapped-BGP routers coexist within the same AS.
3.4.3. Authenticating updates
One of the challenges of any tunneled routing system is that of
authenticating maps. Mapped-BGP exploits the fact that route+map is
transposable to route to acheive authentication equivalent to that of
BGP, and indeed to mostly reuse the authentication mechanisms and
configuration of BGP. Conceptually authentication can be seen as
operating as follows: When a router receives an eBGP route+map, it
converts it to the equivalent route. It then applies its existing
filtering mechanisms of the route. If the route is acceptable, then
the route+map is also acceptable. If the route is not acceptable,
then the route+map is likewise not accepted.
3.4.4. Longest-prefix map selection rules and aggregation
Mapped-BGP uses longest-prefix selection on maps in much the same way
that legacy BGP uses longest-prefix selection on routes. In the
following discussion, assume that Pl and Ps are two prefixes that
overlap. Pl has a larger mask, and Ps has a smaller mask (i.e. Pl
falls within Ps).
If an AS receives maps for Pl and Ps with different TEs, then the Pl
map must be used to route packets to addresses within Pl. This is
similar to legacy BGP, where if an AS has different routes to Pl and
Ps, the route to Pl must be used. The reason in both legacy BGP and
Mapped-BGP is the same: it is not clear whether addresses in Pl are
reachable in the AS originating the route to Ps.
If the maps for Pl and Ps have the same TE, then either may be used
to route packets within Pl. However, in this case there is an
important difference with legacy BGP. In legacy BGP, if Ps is
selected (i.e. aggregation takes place), then Ps is advertised
upstream and upstream ASes never learn about Pl. With Mapped-BGP,
all maps may, and typically will still be advertised upstream, and
Francis, et al. Expires April 29, 2009 [Page 9]
Internet-Draft Mapped BGP October 2008
upstream ASes may in fact make a different choice.
Why this matters can be illustrated using the example network above.
With legacy BGP, AS X would receive the following two routes from AS
W: Pa=20.1/16 and Pw-agg=20.0/14. If AS X decides to aggregate these
two into the single route Pw-agg, then AS I will receive Pw-agg from
W, and Pa from Z. Now, AS I has no choice to accept the route to Pa
via Z, because it does not know that Pa is reachable via W. On the
other hand, if AS X chooses to forward both routes to AS I, then AS I
receives from X Pa=20.1/16 and Pw-agg=20.0/14, and from Z a route to
Pa. Now I may choose between the route via Z and the route via W,
but once the choice is made, ASes upstream of I are forced into the
same choice.
By contrast, with Mapped-BGP, all of the maps (Pa and Pw-agg via W,
and Pa via Z) would be propogated, along with TE-routes to Z and W.
In this way, AS I can choose one route, and ASes upstream of I can
choose different routes. Furthermore, this choice can include
installing only the aggregate prefix Pw-agg into router FIBs if so
desired. With legacy BGP, this choice often doesn't exist. Indeed,
different routers in AS I could use different TEs (TEz or TEw), or
even multipath to both TEs (that is, use both TEs simultaneously).
Of course, the cost of doing this is that both of A's maps must be
propogated everywhere. We defend this with two arguments here.
First, that the cost of propogating a map is expected to be
relatively small. If an AS chooses to load only the aggregate in its
FIB, then the cost of the unused maps is limited to receiving them,
deciding to suppress them from the FIB, storing them in the RIB, and
passing them on. Though we need to run benchmarks to measure this
cost, intuitively we believe that this is significantly less
expensive than processing a full-blown route and entering it in the
FIB.
Second, ASes still have the option of dropping the maps altogether if
they can't deal with them. Doing so results in the same sorts of
inflexibility we see today in BGP, but nevertheless the option
exists. For instance, if in the above example AS X decided to simply
drop the map for Pa altogether, then AS I would receive the aggregate
map Pw-agg from X, and Pa map from Z. AS I would have to choose the
route via Z here, because it would not be able to tell that A is
connected to W. So bottom line there is considerably more flexibility
with Mapped-BGP in making the overhead versus routing granularity
tradeoff.
More broadly, this example illustrates one of the core design
principles of Mapped-BGP; that by both making the processing of
routing information cheaper, and providing considerable flexibility
Francis, et al. Expires April 29, 2009 [Page 10]
Internet-Draft Mapped BGP October 2008
as to what to do with that routing information, we have the option of
propogating much more detailed information in the global routing
system than we are able to do today. At the same time, individual
ISPs have the option of ignoring the details if they so choose, and
are less constrained by the decisions made by downstream ISPs.
This principle does in fact result in a shift of power among ASes.
Today, upstream ASes are held hostage to the decisions of downstream
ASes. In Mapped-BGP, however, downstream ASes lose some control of
packet forwarding (or, at least, that control becomes more expensive
to acheive). For instance, in the above example topology, lets
imagine that AS X decides that AS B is under attack and wants to drop
(or identify and scrub) those packets. Unfortunately all packets to
B are tunneled to AS W along with packets to A, C, and D. If X wants
to distinguish these, it must look deeper into all packets going to
W. We believe that this shift in power is probably overall good, but
more thought and experimentation is required to understand this.
3.4.5. Changing maps
There are three things that can change on a map: the set of TEs it is
associated with, its prefix, and its TED. There are two actions
associated with maps: add and remove. Changes in TE or prefix are
done using add and remove. For instance, in the above figure, if the
link between A and W goes down, W would advertise:
Map: TE=TEw, AT=(<Pa,action=remove>)
Because A is multihomed, this update causes ASes to use the TE
associated with Z exclusively. In other words, this effectively
disassociates Pa with TEw. The TED does not need to be included with
a remove update, nor does the route to the TE. If the map is
subsequently added again (because the link comes back up), then the
TED would of course have to be included, but the TE-route would still
not have to be repeated.
If the link between D and W goes down, then W would advertise:
Map: TE=TEw, AT=(<Pd,action=remove>)
Since this is the only TE associated with Pd, this update would
effectively remove Pd from routers everywhere. It is worth noting
that W can tell that D is single-homed because W does not receive any
other maps associated with D. Because of this, W might very
reasonably decide not to advertise D's unreachability, thus saving
some control processing overhead on the rest of the Internet.
If the link between W and B goes down, then W does not need to
Francis, et al. Expires April 29, 2009 [Page 11]
Internet-Draft Mapped BGP October 2008
advertise anything via eBGP, because B's prefix is aggregated and B
is not multihomed.
TED values may be modified from time to time even though no other
aspects of the map (its TE or add/remove status) changes. TED
changes are advertised by simply repeating the complete map with the
new TED value. It is worth noting that if a map advertises only a
TED change, other ASes do not need to process the change right away.
For instance, they could wait until they recompute traffic
engineering.
3.4.6. Propogating and activating maps
Maps are similar to link-state updates in that each effectively
describes a "link" somewhere in the Internet (i.e. that an AS with a
given prefix is attached to an AS with another given prefix). As
such, as with link-state updates, maps have the potential to be
interpreted out of order. For example, an ISP might advertise an
"add" map after a "remove" map, but the "remove" could well be
received after the "add" at some remote ISP, thus installing the
wrong state. OSPF solves this problem using sequence numbers and a
set of rules on how to interpret them. Mapped BGP can exploit the
trees generated by routes, combined with the fact that BGP speakers
send updates in order, to solve the same problem without the need for
sequence numbers.
Specifically, what Mapped-BGP does is to require that maps are
distributed along the trees created by routes. This prevents old
maps from looping around on themselves and incorrectly voiding more
recent updates. While an older map heard from one AS neighbor may
temporarily be used in preference to a newer map heard from another
AS neighbor, the fact that maps must follow the tree (in order) means
that eventually the newer update will overtake the older one. (As of
this writing, we don't have a formal proof of this.) In particular,
maps are distributed according to the following rules:
1. Each router remembers, for every map prefix, the latest map
received from every eBGP peer, and the latest map sent to every
eBGP peer. Note that most of the time these will all be the
same, and so the data structures can be compressed to exploit
this.
2. The map used by an AS, and advertised to other ASes, is that
received from the next-hop AS on the associated TE-route. If the
next-hop AS changes a map, then this changed map is used and
advertised. If a different next-hop AS is selected, then the
maps advertised by that AS are used. If this causes any maps to
change, then the changes are used and advertised.
Francis, et al. Expires April 29, 2009 [Page 12]
Internet-Draft Mapped BGP October 2008
3. A map is never advertised to the next-hop AS on the TE-route.
3.4.6.1. Peering sessions and maps
As an optimization to speed up the establishment of a peering session
between eBGP speakers, we exploit the fact that maps are usually the
same for all peers, and "guess" the value of a map before a peer
advertises it. Specifically, when a peering session first comes up,
the peers exchange all routes before exchanging any maps. When a
peer learns a route (probably a TE-route) and selects it as a next-
hop, it immediately uses any maps associated with the TE. In other
words, it continues to use whatever TEs it was already using.
Subsequently, when the peer starts advertising maps, the BGP speaker
responds accordingly.
3.4.7. Changing TE-route
Routes, including TE-routes, are handled as with normal BGP. They
are handled independently of maps. In other words, if a BGP speaker
advertises a change of route to its peer, it does not need to re-
advertise the associated maps. Assume, for instance, that AS J uses
AS X as the next hop to TEw, and the link between X and W goes down.
X will withdraw the route to TEw, but does not need to withdraw the
maps with TE=TEw as the TE. These maps will have been previously
advertised by Y, and so the alternate path through Y can be used
right away. When the link between X and W is restored, then X only
need advertize the route to W again---the previously advertised maps
are still valid and can be used immediately.
3.5. Load Balancing in Mapped-BGP
A cornerstone of the performance benefits of Mapped-BGP is the fact
there there are no policies associated with maps per se (other than
the fact that they need to be source filtered to prevent hijacking,
just as routes are source filtered today). In other words, policies
are applied to routes, but not to maps. This raises the obvious
question of what policies are we giving up because of this, and how
do we get them back?
We can divide policies into two types: policies that act at the
granularity of ASes, and those that act at the granularity of
prefixes. AS-level policies are preserved in Mapped-BGP, because
these policies can be applied to TE-routes (of which there are,
roughly, one per AS). Indeed, AS-granularity policies become cheaper
to enact, because there are fewer routes to deal with. Examples of
AS-based policies are: prefer routes to customers over other routes.
Prefer routes to peers over routes to providers. Do not export
routes received from peers to non-customers.
Francis, et al. Expires April 29, 2009 [Page 13]
Internet-Draft Mapped BGP October 2008
Most policies that act at the granularity of prefixes are for the
purpose of traffic engineering. There are a number of such examples.
For instance, one ISP may give per-prefex MEDs to its neighbor in
order to influence how packets enter each peering point. This may be
done either for the purpose of load balance, or in order to minimize
the distance that packets need to travel within the receiving ISP.
Likewise an ISP might set loc-prefs on a per prefix basis to
influence the outgoing load on each peering point. A multi-homed
site might deaggregate its prefix, and then use community attributes
offered by its provider to do per-prefix path prepending or route
filtering to influence the load on its incoming access links.
Mapped-BGP improves upon existing inter-domain traffic-engineering
through two mechanisms: the Tunnel Endpoint Descriminator (TED), and
"path splitting". These mechanisms are simpler, more scalable, and
expected to be more effective than the current set of BGP
mechansisms. It is hoped that the use of this simpler approach would
simplify BGP configuration overall.
Before discussing these mechanisms, we should point out the obvious;
which is that traffic engineering requirements necessarily put ASes
in conflict with each other. A simple example of this is illustrated
below. Say that A want to send half of its traffic to B and half to
C, and D wants to receive 25% of its traffic from B and 75% from C.
It may not be possible to satisfy both A's and D's requirements.
A
/ \
/ \
B C
\ /
\ /
D
Mapped-BGP's approach to this conflict is to provide a mechanism
whereby the receiver can convey to the sender what its traffic
engineering needs are, and the sender can honor or ignore the
receiver's wishes. This is similar in spirit to how MEDs work in
legacy BGP. Other legacy mechanisms, however, like path prepending,
attempt to "force" the sender into honoring the receiver's incoming
traffic engineering requirements by manipulating its next-hop
selection algorithm.
3.5.1. Incoming Load Balance at Sites
Mapped-BGP's TED mechnanism has already been partially described. In
the previous example, AS A conveys TED values to ASes W and Z, which
in turn attach these TED values to the corresponding maps:
Francis, et al. Expires April 29, 2009 [Page 14]
Internet-Draft Mapped BGP October 2008
From W: Map: TE=(TEw), AT=(..., <Pa,TEDaw>)
From Z: Map: TE=(TEz), AT=(..., <Pa,TEDaz>)
Assuming that these TEDs aren't suppressed through some aggregation
somewhere, they are conveyed to all ASes in the Internet. The TEDs
are parameterless. They are interpreted at each AS as an indication
that more or less traffic should be directed to the associated TE.
This interpretation, however, is entirely up to the head-end AS. AS
A uses TED values to control the volume of incoming traffic from Z
and W as follows. AS A sets some initial TED values, say TEDaw=50
and TEDaz=50, and over some period of time (days or a couple weeks)
measure the incoming volume of traffic. If the volume is not as
desired, for instance too much trafic from W and not enough from Z,
then AS A can modify the TED values, say to TEDaw=40 and TEDaz=60.
Over time, AS A can determine what the appropriate values are, as
well as gain a sense of how future changes in TED values are likely
to effect traffic load.
How a head-end AS (one that transmits traffic to addresses in Pa)
interprets the TED values is up to it. The TED values may or may not
have any effect on the AS'es traffic engineering decisions. The
basic idea here is that if AS cannot satisfy its own traffic
engineering requirements while honoring the TED values, then it will
ignore the TED values. If on the other hand the head-end AS can both
satisfy its traffic engineering requirements and honor the TED
values, it will do so. The hope is that enough head-end ASes will be
able to honor the TED values to allow receiving ASes to control its
incoming traffic.
While exactly how an AS determines how to process TEDs is for further
study, we can imagine a sequence of steps whereby the AS first
determines which map+routes must ignore TEDs. For instance, the AS
might be doing hot-potato, and there are simply some destinations
where one TE is more hot-potato than another and therefore prefered.
Assuming that there are remaining destinations for which the choice
of TEs are roughly equivalent, the AS can then look at the TED and
select on that basis. For instance, if as in the previous example
the TEs are TEDaw=40 and TEDaz=60, the AS might choose with
probability 0.6 to choose TEDaz, and with probability 0.4 to choose
TEDaw (or each router could make its own selection on this basis).
3.5.2. Incoming Load Balance at Lower-tier ISPs
The example above shows how a multihomed stub AS site can use TEDs to
do incoming traffic engineering across multiple ISPs. This is not
the only case, however, where incoming traffic engineering across
different ASes is required. For instance, using the same figure,
assume that AS W is a lower-tier ISP that is a customer of provider
Francis, et al. Expires April 29, 2009 [Page 15]
Internet-Draft Mapped BGP October 2008
ISPs X and Y, and that wishes to balance traffic arriving from X and
Y. This is made possible in Mapped-BGP using path splitting.
Recall that TEs actually consist of a block of addresses. Path
splitting operates by splitting a TE-block into multiple sub-blocks
(as many as there are links to balance over), and advertising each
sub-block to a different neighbor AS. This creates multiple paths
that can be used to reach the same TE. By associating a destination
prefix with one path or another, a head-end AS can influence which
path is used, and therefore the volume of traffic on a given link.
For example, assume that AS W wants to control the incoming volume of
traffic from ASes X and Y. Recall that in the earlier example
(without path splitting), AS W advertised the following:
Route: AS-path=(W), NLRI=(TEw=40.1.1.0/28)
Map: TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)
To do path splitting, what AS W instead advertises is the following:
Route To X: AS=(W), NLRI=(TEwx=40.1.1.0/29)
Route To Y: AS=(W), NLRI=(TEwy=40.1.1.8/29)
Map: TE=(TEw), AT=(<Pw-agg>, <Pd>, <Pa,TEDaw>)
zzzz
Split: DS-AS=W, US-AS=(<X,TEDwx>, <Y,TEDwy>)
where: DS-AS = Downstream AS
US-AS = Upstream AS
The first thing to note about this is that the original map goes
unchanged. The second thing to note is that AS W has split its TE-
block in half, and is advertising a separate route to each subblock
(shown as TEwx and TEwy). To be clear, the map still associates the
reachable prefixes with the full original TE-block, but here that
block is reachable via two paths. Whats more, AS W advertises one
such route to X only, and the other to Y only. This effectively
gives upstream ASes some path control. If they select a TE address
within the TEwx subblock, then packets will get to W via X. Likewise
if they select a TE address within the TEwy subblock, then packets
will get to W via Y.
Finally, W has generated a split attribute in order to convey the
relative volume that should enter via the two neighbors. The
Downstream-AS (DS-AS) is W itself. There are two records for the
Upstream ASes (US-AS), one for X and one for Y, each associated with
a separate TED (TEDwx and TEDwy respectively).
Given this, consider the behavior of AS J. J will receive a route to
TEwx from X, and to TEwy from Y. J will also propogate these routes
Francis, et al. Expires April 29, 2009 [Page 16]
Internet-Draft Mapped BGP October 2008
to its neighbors. Now J, as a head-end AS, can choose between these
two routes, on a router-by-router or even packet-by-packet basis if
it wishes, to send packets to destinations in A, B, C, and D.
Mechanistically it makes this choice by selecting a TE address from
either the TEwx or TEwy subblock. Assuming that J is willing to
honor W's TEDs in making the choice, it would send more or less
traffic along each route according to the value of the TEDs. Indeed
in this particular example, J can choose first how to send traffic to
A (i.e. via TEz or TEw). Of the traffic that J chooses to send to W,
it can then choose how to split the traffic between X and Y.
Now consider the behavior of AS I. I will receive two TE-routes from
X, one to TEwx with AS-path X-W, and the other to TEwy with AS-path
X-J-Y-W. AS I, as a head-end AS, can strictly speaking choose
between these two paths based on which TE address it uses to tunnel
packets to A, B, C, and D. In this particular example, the choice
does not effect I's traffic (it all goes to X either way). The
choice does of course effect W's incoming load balance, as well as
the length of the paths and the amount of traffic load at J and Y. In
this case, I should almost certainly favor efficiency (shorter path)
over W's load-balancing needs. This is especially true considering
that W may in any event satisfy its load balancing requirements even
if I does send all packets on the shortest path, because other ASes
can reasonably choose the path via Y.
Note that this approach is frugal in its overhead. W wants to
balance two peering links, and so creates exactly two routes.
Contrast this with today's situation, where an AS may need to
deaggregate a prefix multiple times in order to get the granularity
needed to effectively load balance. Whats more, the multiple routes
can be aggregated back together by any AS. This might well be done,
for instance, by an AS that is relatively far from W, and that sends
very little traffic to W. In doing this, the aggregating AS would
also drop the split attribute.
3.5.3. Multi-exit discrimination with Mapped-BGP
The above paragraphs describe how an AS can influence the volume of
traffic entering from different ISPs. It is also important to be
able to influence how traffic enters at multiple peering points
between the same neighbor ISP. Today MEDs are used for this, where
the MED value is set on a per-prefix basis. In Mapped-BGP, an AS L
will send two kinds of packets to its neighbor AS K: packets that are
detunneled at K, and packets that are not detunneled at K. AS K can
of course set MEDs on the TE-routes for packets that it does not
detunnel. Normally, however, an AS K only advertises one TE-route
per neighbor AS for its own TE. As a result, there is no basis for
discriminating packets addressed to K's TE.
Francis, et al. Expires April 29, 2009 [Page 17]
Internet-Draft Mapped BGP October 2008
One way to approach this problem might be to use TE-route splitting
here as well. However, this approach leads to either a potentially
large number of split TE-routes, or a large number of additional
maps. (Explanation of why left out for now.) In general it seems
inappropriate to burden the rest of the Internet for a routing matter
that is strictly between two neighboring ASes. As such, the solution
is limited between the two neighbors. Specifically, when AS L wishes
to let its neighbor AS K dictate which exit it should use on a per-
prefix basis, AS L must detunnel packets otherwise destined for K. In
other words, the routers in AS L are configured to detunnel packets
with K's TE addresses. Once detunneled, routers in L route packets
to K based on the inner header destination address.
There are two ways in which L could learn about the MEDs associated
with K's prefixes. One is for K to simply advertise these prefixes
to L as normal BGP routes with MEDs attached. Another would be for K
to attach the MEDs to the maps it sends to L. L would strip these
MEDs before forwarding the maps onwards. At this time we don't have
a preference for one approach over the other.
Finally, it should be pointed out that Mapped-BGP in general creates
more choices for path selection, and therefore more choices for
traffic engineering (both outgoing and incoming) compared to legacy
BGP. With legacy BGP, an AS makes one next-hop-AS choice per
destination prefix. With Mapped-BGP, an AS can make multiple next-
hop-AS choices per destination prefix.
3.6. Aggregation in Mapped-BGP
Mapped-BGP has all of the aggregation features of legacy BGP
(physical topological aggregation), as well as new opportunities for
aggregation beyond what BGP offers in the form of inter-domain
virtual aggregation. Virtual aggregation can be used in a number of
useful ways. It can be used in conjunction with geographical address
assignment to provide a realistic way to implement geographical
addressing. It can be used opportunistically to allow small groups
of ISPs to aggregate some portion of the address space that is
already mostly assigned to them. But it can also be used generally
as a way of shrinking FIBs in the "core" of the network (i.e. the
cores of tier-1 ISPs) and dramatically shrinking RIBs and FIBs
everywhere else.
Section 3.4.1 describes how an ISP can aggregate the prefixes of its
customers. As with legacy BGP, this is done by an ISP that "owns"
the address space that it is aggregating. It is in this sense that
Mapped-BGP aggregation is similar to BGP
Mapped-BGP also has a mechanism that allows for inter-domain virtual
Francis, et al. Expires April 29, 2009 [Page 18]
Internet-Draft Mapped BGP October 2008
aggregation similar in spirit to that described in the intra-domain
virtual aggregation draft [I-D.francis-idr-intra-va]. This
mechanism, especially when used in conjunction with appropriate
address assignment policies, gives Mapped-BGP more opportunities for
aggregation than legacy BGP. The mechanism is this: any router can
become an Aggregation Point Router (APR) for a Virtual Prefix (VP).
A VP is a prefix that is not topologically aggregatable, and must be
bigger (have a smaller mask) than any topological prefix. An APR for
a given VP advertises that VP as a route in BGP. This route must be
tagged with a transitive attribute that indicates that it is a route
for a VP. This allows other routers to know that all subprefixes are
reachable via the VP route. An APR must FIB-install every subprefix
within the VP. These subprefixes may be reachable natively as
routes, or through map tunnels, but they must be reachable.
We can illustrate this using the example topology above (though note
that this example doesn't represent the best usage of this feature).
Imagine that AS Z has a single-homed customer AS E with prefix
Pe=20.0/16. Note in particular that Pe comes out of the aggregate
prefix that W advertises (Pw-agg=20.0/14). With both legacy- and
Mapped-BGP, W could still advertise the aggregate Pw-agg: Pe would
"punch a hole" in the aggregate and routes to Pe would go to Z rather
than W. Note, however, that with Mapped-BGP, W receives Z's map for
Pe:
Map: TE=(TEz), AT=(..., <Pe=20.0/16>)
As a result, W is able to tunnel packets destined to Pe to Z. If AS W
is willing to do that on behalf of other ASes (i.e. act as a transit
for packets to Pe even when W is not on the normal BGP path to Pe),
then it can advertise Pw-agg as a VP. This would allow a remote AS
to suppress loading the finer-grained prefix Pe into its FIB. The
remote AS could forward all packets to Pw-agg towards W. These
packets will either reach W, in which case W would in turn tunnel the
packets to Z, or they would reach a router on the path to W that has
installed Pe, in which case this router will tunnel the packets to Z.
Obviously in this case this results in added latency for packets that
reach W, as well as extra load for W. As such, this mechanism should
not be used willy-nilly, and in fact would probably not be used in
this particular example. Situations where use of the Virtual
Aggregation is appropriate is described later on.
In the above example, the route for Pe could be suppressed from the
FIB (or equivalently, the "routing table" as defined by BGP), but it
is still necessary to keep the maps in the map-RIB. Keeping maps in
the map-RIB is unlikely to become a scaling problem. The reason is
that it doesn't take very much processing to distribute a map that is
not loaded into the FIB. A router needs to determine that a map is
Francis, et al. Expires April 29, 2009 [Page 19]
Internet-Draft Mapped BGP October 2008
valid, and then determine that the map can be FIB suppressed, But
once that is done the map only needs to be stored and transmitted to
neighbors. It seems reasonable to expect a router to be able to
store and process millions of FIB-suppressed maps.
As an aside, even though it should be possible to distribute a very
large number of FIB-suppressed maps, it is in fact possible to not
require many ASes to store the maps at all. This is because in
principle the only ASes that have to keep the maps are those that are
need to distribute the maps to where they need to go. In the above
example, ASes I and X need to keep the map for Pe, because they
convey it from Z to W. But ASes J and Y can in principle ignore the
Pe map altogether. Unfortunately there is no simple way, other than
static configuration, to tell an AS whether or not it needs to
distribute a given set of maps or not. On the other hand, in many
cases this configuration will be relatively straight-forward, as
discussed later.
Unless protected against, the the use of VPs creates a possibility
for transient loops. The problem is illustrated using the figure
below. This figure is a blow-up of AS Z from the example just
described. It shows two border routers in AS Z (z1 and z2). z1 is
connected to the border router E1 in customer AS E, and z2 is
connected to the border router i1 in AS I. Note that, as the router
with the customer interface, z1 is responsible for advertising maps
about Pe (either add or remove). As the router with an ISP
interface, z2 will be the TE for packets tunneled to TEz and destined
for Pe.
+--------+ +-------+
| | | |
| z2-+---+-i1 |
| | | |
| z1 | | |
| / | | |
+-+------+ +-------+
/ AS Z AS I
+---+-+
| / |
| e1 |
| |AS E (with prefix Pe=20.0/16)
+-----+
Now imagine that the z1-e1 link goes down, and that z1 detects this.
We can divide subsequent behavior into three time periods:
Francis, et al. Expires April 29, 2009 [Page 20]
Internet-Draft Mapped BGP October 2008
1. Only z1 knows that the link has failed.
2. z1 and z2 know that the link has failed but no routers outside of
AS Z know this.
3. All routers, in particularly routers in AS W, know that the link
has failed.
Consider the behavior of z1 during the first two periods. It has a
route to prefix Pw-agg. Therefore, if it receives a packet destined
to Pe, it would be expected to forward the packet towards AS W.
Indeed, routers in AS W might have another route to Pe that z1 is
unaware of. On the other hand, AS W may not have another route, and
so would simply forward any packets received for Pe back through the
tunnel to z2, thus forming a loop. In other words, z1 doesn't know
if a loop has formed or not. If a loop has formed, then z1 would not
want to forward packets to Pw-agg, and if a loop has not formed, then
z1 would want to forward packets to Pw-agg. The same holds for other
routers in AS Z.
The behavior we would like is for z1 and z2 to recognize whether any
given packet destined to Pe has looped or not. If it has, it should
be dropped. If it hasn't, then it should be forwarded to Pw-agg.
During period 2, z2 can tell that packets received via the TEz tunnel
may be looping and must therefore drop those packets. But during the
first period, z2 will forward any received packets towards z1.
Therefore, z1 needs to be able to tell whether a packet it receives
arrived via z2's tunnel or not. The way to do this is to have z2
tunnel packets destined for e1. This could be for instance an MPLS
LSP with e1 as its target, as described in the Intra-domain Virtual
Aggregation draft.
Note also that it is possible to distribute VPs as maps rather than
as routes. We currently see no advantage to this, but leave it for
further study none-the-less.
In the following paragraphs, we outline various ways in which VPs may
be used in Mapped-BGP.
3.6.1. Geographic or Metro Addressing
There have been many proposals in the past to deploy geographic
addressing. The basic idea is simple: if a site accesses the
Internet within a particular geographic area, then it is assigned
addresses from a prefix dedicated to that area. This makes both
multihoming and changing providers easier, because the site is likely
to multihome to providers in the same area, or to switch to another
provider in the same area. This allows ISPs serving that area to
aggregate the area prefix. The criticism of geographic addressing
has always stemmed from the fact that existing routing algorithms
require physical connectivity within the aggregate topology. There
Francis, et al. Expires April 29, 2009 [Page 21]
Internet-Draft Mapped BGP October 2008
is no regulatory structure in place today to insure that that
physical connectivity is created and maintained.
With Mapped-BGP, the need for intra-area physical connectivity is not
as critical. Of course, it is still important, because to the extent
that such physical connectivity does not exist, paths will be longer.
Mapped-BGP, however, allows for a great deal more flexibility as to
how much physical connectivity needs to exist, and provides a
scalable re-routing mechanism for when intra-area links do fail.
To operate geographic addressing with Mapped-BGP, of course first an
area needs to be identified, and an address space reserved for it.
Call this address space the area-prefix. It is too late to do this
for IPv4, but it could certainly be done for IPv6 (indeed, such
addresses have already been defined). Of the ISPs that have a
presence in the area, some will be willing to provide general transit
and others will only provide service for their customers. These are
refered to here as transits and non-transits, and the routers in
these ASes are called transit routers and non-transit routers
respectively.
Transit routers within the area (i.e. those that provide access for
sites within the area) are configured to advertise a VP-route for the
area-prefix. These "area routers" must also FIB-install maps for all
sub-prefixes within the area-prefix. Note that a given ISP may span
multiple areas. Only the routers within a given area need advertise
the VP-route and FIB-install the sub-prefixes. Other routers in the
ISP but not in the area may FIB-suppress those sub-prefixes. The
area routers would separately advertise maps for the individual sub-
prefixes, using the TE assigned to the AS (i.e. as normal).
Customers of non-transit routers within the area would still be
assigned area-prefix addresses, but non-transit routers would not
advertise the VP-route. Rather, they would only advertise maps and
TE-routes for their customer's individual prefixes as normal.
The following figure illustrates this.
Francis, et al. Expires April 29, 2009 [Page 22]
Internet-Draft Mapped BGP October 2008
,------.
/ `.
----'--------- `.
/ / \ \ Area (Y, Z, J, K
Transit ISPs X--;-----Y-------Z `. and L are in
/ \ ; / \ / \ : the area)
/ / / \ / \ |
/ ; \ / \ / \ |
Non-transit I----|--J--------K-------L |
ISPs \ : / ;
----\--------- ,'
`-. ,-'
`----------'
Here we see three transit ISPs (X, Y, and Z), and four non-transit
ISPs (I-L), some of which are in the area and some of which are out
of the area, as shown. The non-transit ISPs are customers of the
transit ISPs as shown, and peer with each other as shown. Assume
that Y and Z span many areas, have multiple peering points with each
other, and peer with other ISPs (X and others not shown).
A remote AS would receive the following maps and routes from area
ASes:
Map: TE=(TEy), AT=(<Py=20.0.0/24>)
Map: TE=(TEz), AT=(<Pz=20.0.1/24>)
Map: TE=(TEj), AT=(<Pj=20.0.2/24>)
Map: TE=(TEk), AT=(<Pk=20.0.3/24>)
Map: TE=(TEl), AT=(<Pl=20.0.4/24>)
(.... TE-routes for all of the above maps ....)
one or both of the following VP routes:
Route: AS-path=(Y...), NLRI=Pa=20.0/16
Route: AS-path=(Z...), NLRI=Pa=20.0/16
What's more, although the routes are not shown, assume that the
routes to TEj and TEk are split for load balance. (Of course there
are likely to be more prefixes advertised from each TE. One is
enough to illustrate the technique.)
Assume for now that there is no RIB suppression of maps: all maps are
distributed to all ASes globally. The first thing to note is that
any AS could choose to FIB-suppress any of the /24 maps and still be
able to deliver packets to the destinations. However, in each of
these cases, FIB-suppression would have a greater or lesser impact on
traffic to the destination.
In the case of Pj, remote ASes that choose to FIB-install Pj can use
the TEDs in the split route (as well as their own traffic engineering
Francis, et al. Expires April 29, 2009 [Page 23]
Internet-Draft Mapped BGP October 2008
considerations) to decide whether to route via X or Y. Packets from
remote ASes that FIB-suppress Pj will be routed to Y or Z, depending
which is the better route. Packets to Y will reach J through the Y-J
link. Packets to Z may be routed to J via X, but the fact that Y and
Z have POPs in the area, and X doesn't, suggests that a larger
proportion of Z's packets may reach J through Y. What's more, packets
may reach Z via X, only to be tunneled to J back through X. Having
said that, a remote AS whose route to Z or Y is via X can tell that
FIB suppression is likely to result in a longer path, and so may be
less likely to FIB-suppress. Ultimately, FIB-suppressing Pj is
likely to produce significantly more traffic on the Y-J link compared
to the X-J link.
To some extent J might be able to counter this imbalance with TEDs.
Another option, however, could be for X and Y to offer explicit load
balancing services to J. In this case, J could supply a separate pair
of locally advertised TEDs that X and Y use to balance traffic to J.
For instance, if the Y-J load is too heavy, and the X-J load too
light, J could ask Y to divert some of its traffic via X, using the
split route to TEj that traverses X.
Note, however, that if the Y-J link goes down, all traffic will
successfully reach J through X, even if Pj is suppressed. This is
because Y and Z will find that the only route to J is via X, and
tunnel packets accordingly. Ultimately J might easily find that
multihoming to an ISP not in its area is worth doing.
Note also that this scenario creates the possibility of transient
loops, similar to those described in Section 3.6 between ASes Z and
W. For instance, if the X-J link goes down, but AS Y doesn't know it
yet and continues to tunnel packets to X (either for load balancing
or because the Y-J link has gone down), then AS X would just route
the packets back to the VP area aggregate. As described previously,
the solution is for routers in X to drop packets when they know they
have been received via the TEx tunnel, but to forward them to the VP
otherwise.
In the case of Pk, FIB-suppression of its map by remote ASes will
eliminate the ability for them to load balance traffic between Y and
Z. From their perspective, all traffic to K would be routed to Pa,
which reaches either Y and Z depending on which path is shorter. In
this case, as with X and Y above, Y and Z could offer explicit load
balancing services to K. As a result, K's multihoming could be hidden
from the vast majority of routers FIBs while sill providing K with
robust and load-balanced multihoming service.
In the case of Pl, FIB-suppression of its map by remote ASes may
result in some packets taking a longer path than they otherwise
Francis, et al. Expires April 29, 2009 [Page 24]
Internet-Draft Mapped BGP October 2008
might. For instance, X may choose to route some Pl-destined packets
to Y even though a path to Z would be the shorter path.
Note that in the above example, by not advertising a route to Pa, J,
K, and L avoid becoming transits for other destinations in the area.
Rather, these ASes can control the extent to which they do transit
traffic through control of the routes they propogate. For instance,
if J wants for some reason to transit traffic for its peer K, it
would propogate the TE-route for TEk (as well as the map for TEk) as
appropriate.
Now lets assume, in order to better scale RIBs, that we do not wish
to propogate all maps to all Internet ASes. (Though once again we
point out that do to the relative efficiency of map distribution,
such scaling is unlikely to be necessary for the forseeable future if
ever.) The fact that Y and Z can in principle load balance for their
customers makes this option tenable. In this example, Y and Z are
the only transit ASes participating in the geographic areas. For now
lets assume that Y and Z have multiple peering points in multiple
geographic locations (i.e. both are national or multi-national ISPs
with significant territorial overlap). In this case it is highly
unlikely that Y and Z will become partitioned from each other. Given
that, it might be deemed reasonable that Y and Z only need to
distribute area subprefix maps for area ASes to each other. Thus all
other ASes never get maps for subprefixes in the area.
In this example, J would still propogate its map (and TE-route) to X
and I. Although I, as a non-transit, would likely not further
proporate J's map, X most likely would. As a result, J's map would
be propogated Internet-wide, but not K's or L's. As long as most
multi-homing is "in area", most maps could be suppressed, resulting
in both greatly reduced RIB size as well as FIB size.
Now lets assume that Y and Z only peer in one place (i.e. between
their POPs in the area). Assume further that they both peer with X
in multiple places. In other words, X serves as a robust backup
route between Y and Z should their single peering point fail. In
this case, X must be willing to propogate area maps between Y and Z
(along with the TE-routes for all area ASes).
More generally, anywhere there is a desire to not propogate maps, the
area ASes would need to evaluate the richness of paths and determine
which additional ASes need to propogate maps. These additional ASes
would need to agree to do so, and would need to be configured as to
where to propogate the maps. It might also be desirable to have a
flag associated with the area maps indicating that they don't need to
be globally propogated. This way, if an AS does accidently leak the
maps, they don't get distributed everywhere.
Francis, et al. Expires April 29, 2009 [Page 25]
Internet-Draft Mapped BGP October 2008
3.6.2. Opportunistic AS aggregation clusters
The previous section on geographic addressing assumes that addresses
have been assigned with geographic aggregation in mind, and so
doesn't apply to IPv4. However, IPv4 addresses have been assigned in
regional blocks for some time now. For instance, IANA has assigned
11 prefixes to RIPE (5 /8's, 3 /7's, 2 /6's, and 1 /5), at least
according to a RIPE database document. Presumably most of these have
been assigned by ISPs to customers in Europe (though many of these
may be multi-homed outside of Europe). Given this, there may well be
opportunities for "clusters" of richly inter-connected ISPs to
advertise an aggregate, whereby most members of the aggregate are
within those ISPs. The extent to which these opportunities exist is
for further study.
If they do exist, however, they can be exploited by Mapped-BGP in a
fashion very similar to grographic addressing described above.
Specifically, a VP route for the prefix would be advertised by
routers in the cluster, and these routers would FIB-install all sub-
prefixes for the VP.
An important difference between engineered geographic addresses and
opportunistic AS clusters is that, in the latter, there will be more
"stray" sites: sites that have an address within the cluster VP but
are not physically attached to any cluster ASes. Because of this, it
typically won't be easy to identify a set of ASes that would be
willing to suppress propogation of maps to ASes outside that set. So
we should assume that maps will be propogated, and that the only
scaling opportunity comes from FIB reduction.
When remote ASes do FIB suppression, they should prefer to suppress
prefixes within the cluster to those outside the cluster, on the
assumption that paths to prefixes outside the cluster will be longer.
To do this, remote clusters obviously need to identify which prefixes
are in and which are out. One way to do this would be to include all
ASes in the cluster in the VP-route. Paths to TE-routes that do not
contain any of the cluster ASes would be considered to be outside the
cluster.
3.6.3. Generalized Inter-domain Virtual Aggregation
Virtual aggregation could also be deployed in a general fashion,
whereby the global address space is carved up into VPs, and
individual routers are assigned as APRs for different VPs. This is
very much in the spirit of the Intra-domain VA draft, but with a
couple of key differences. First, the extra hop suffered by VA paths
would only occur in one ISP, the first one to tunnel the packet to
the destination TE. As such, the load and latency penalty for Inter-
Francis, et al. Expires April 29, 2009 [Page 26]
Internet-Draft Mapped BGP October 2008
domain VA is significantly less. Second, VA could be deployed in
such a way that the Tier-1 ISPs maintain full routes (i.e. have APRs
for all VPs), but lower-tier ISPs do not maintain any APRs. Rather,
lower-tier ISPs keep VP-routes and any additional routes or maps that
they wish to install. As a result, RIBs and FIBs in lower-tier ISPs
could be almost arbitrarily small while still having the ability to
load balance both incoming and outgoing traffic.
4. Performance Benefits
This section summarizes the performance benefits of Mapped-BGP. Note
that none of the following stated benefits have been quantified.
1. Mapped-BGP decreases the amount of processing needed to handle a
prefix. This is primarily because most policies currently needed
to compare and select paths and determine how to advertise routes
are not required for processing maps. On the other hand, Mapped-
BGP introduces a new policy decision, namely processing TEDs for
those fraction of prefixes to which they apply. The majority of
prefixes will be distributed as maps rather than routes.
2. Mapped-BGP requires less RIB storage space, primarily because
during steady state the map heard from any given neighbor is the
same. Storage can be compressed by exploiting this.
3. Peering sessions initialize faster in Mapped-BGP. This is
because in general only the routes (of which we might expect a
few tens of thousands at most) need to be conveyed before packets
can start flowing (see Section 3.4.6.1).
4. Mapped-BGP will have fewer "big" events. This is because route
changes in Mapped-BGP effect only routes, not maps. Whats more,
if the FIB is organized in a tiered fashion (prefix points to a
TE, which points to a next hop), then a change in TE next hop
only requires a single update to the FIB, not one update for each
impacted prefix. On the other hand, Mapped-BGP is likely to have
more "small" events, because each map will be propogated both
because of a change in add/remove status, and a change in TED.
Indeed, with virtual aggregation, many or even most map updates
don't even impact the FIB.
5. Global convergence in Mapped-BGP will in general be faster. This
is primarily because changes to maps can be distributed before
any policy decisions are made on those changes. This in turn is
possible because maps don't change as they are propogated through
the Internet. This allows an AS to first quickly distribute a
received map and only afterwards process it. Indeed, map changes
that involve only modifications to the TED can be processed much
later in time (minutes).
Francis, et al. Expires April 29, 2009 [Page 27]
Internet-Draft Mapped BGP October 2008
6. Load balancing across ASes is both more accurate and more
efficient in Mapped-BGP. This is because TEDs allow for a fine-
grained description of how much load is desired. This is in
contrast to legacy BGP, where granularity is proportional to the
number of prefixes that can be selected over.
7. With virtual aggregation, Mapped-BGP provides significant
opportunites for new aggregation.
5. Normative References
[I-D.francis-idr-intra-va]
Francis, P., Xu, X., and H. Ballani, "FIB Suppression with
Virtual Aggregation and Default Routes",
draft-francis-idr-intra-va-01 (work in progress),
September 2008.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
Authors' Addresses
Paul Francis
Cornell University
4108 Upson Hall
Ithaca, NY 14853
US
Phone: +1 607 255 9223
Email: francis@cs.cornell.edu
Xiaohu Xu
Huawei Technologies
No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District
Beijing, Beijing 100085
P.R.China
Phone: +86 10 82836073
Email: xuxh@huawei.com
Francis, et al. Expires April 29, 2009 [Page 28]
Internet-Draft Mapped BGP October 2008
Hitesh Ballani
Cornell University
4130 Upson Hall
Ithaca, NY 14853
US
Phone: +1 607 279 6780
Email: hitesh@cs.cornell.edu
Francis, et al. Expires April 29, 2009 [Page 29]
Internet-Draft Mapped BGP October 2008
Full Copyright Statement
Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Francis, et al. Expires April 29, 2009 [Page 30]