diff --git a/overlay.rst b/overlay.rst index bbdf1eb..0083ba1 100644 --- a/overlay.rst +++ b/overlay.rst @@ -56,3 +56,4 @@ available to applications. .. include:: overlay/design.rst .. include:: overlay/cdn.rst .. include:: overlay/conference.rst +.. include:: overlay/p2p.rst diff --git a/overlay/cdn.rst b/overlay/cdn.rst index 87c59c1..4428a0e 100644 --- a/overlay/cdn.rst +++ b/overlay/cdn.rst @@ -40,8 +40,7 @@ There’s not a lot anyone except you or your local service provider can do about the first-mile problem, but it is possible to use content replication to address the remaining problems. Content distribution networks are the systems that manage the process of replicating -content and delivering it to clients, which has a few more moving -parts than first meet the eye. Akamai was one of the first operators +content and delivering it to clients. Akamai was one of the first operators of a CDN and today there are a number of large CDN operators with global footprints. @@ -51,12 +50,12 @@ of *backend servers*. Thus, rather than having millions of users wait forever to contact when a big news story breaks—such a situation is known as a *flash crowd*—it is possible to spread this load across many servers. Moreover, rather than having to traverse multiple ISPs -to reach ``www.cnn.com``, if these surrogate servers happen to be +to reach a popular site, if these surrogate servers happen to be spread across all the backbone ISPs, then it should be possible to reach one without having to cross a peering point. Clearly, maintaining thousands of surrogate servers all over the Internet is -too expensive for any one site that wants to provide better access to -its web pages. Commercial CDNs provide this service for many sites, +too expensive for most sites that wants to provide better access to +their web pages. Commercial CDNs provide this service for many sites, thereby amortizing the cost across many customers. Although we call them surrogate servers, in fact, they can just as @@ -68,9 +67,12 @@ also the case that only static pages, as opposed to dynamic content, are distributed across the surrogates. Clients have to go to the backend server for any content that either changes frequently (e.g., sports scores and stock quotes) or is produced as the result of some -computation (e.g., a database query). +computation (e.g., a database query).\ [#]_ -.. TODO -- need to check - can't code run in modern CDN nodes? +.. [#] CDN operators sometimes offer a complementary service that + allows their customers to dynamically generate certain content + at the surrogates, but that raises its own set of technical issues. + We focus here on static content. .. _fig-cdn: .. figure:: overlay/figures/f09-30-9780123850591.png @@ -155,25 +157,26 @@ rather than transparent, proxy). .. sidebar:: Are CDNs Overlays? - Many of the early overlays built on top of the Internet use some + *Many of the early overlays built on top of the Internet use some sort of tunneling to create virtual point-to-point links, and created a virtual topology between the overlay nodes to offer some function not yet implemented in the Internet, such as multicast of IPv6 support. CDNs don't quite conform to this model, since they - don't generally build tunnels between the CDN nodes. We would - argue, however, that they have enough in common with other types of - overlay to qualify. They offer functionality not natively provided - by the Internet—caching—while using the Internet to interconnect - the nodes in the CDN. A redirector makes an application-level - routing decision, much like other types of overlay nodes. Rather - than forward a packet based on an address and its knowledge of the - network topology, it forwards HTTP requests based on a URL and its - knowledge of the location and load of a set of servers. The - complete collection of redirectors and surrogate servers that make - up a CDN are effectively an application-specific network that - leverages the underlying connectivity of the Internet bring - additional functionality to the Internet: efficient delivery of - content to clients. + don't generally build tunnels between the CDN nodes.* + + *We would argue, however, that they have enough in common with + other types of overlay to qualify. They offer functionality not + natively provided by the Internet—caching—while using the Internet + to interconnect the nodes in the CDN. A redirector makes an + application-level routing decision, much like other types of + overlay nodes. Rather than forward a packet based on an address + and its knowledge of the network topology, it forwards HTTP + requests based on a URL and its knowledge of the location and load + of a set of servers. The complete collection of redirectors and + surrogate servers that make up a CDN are effectively an + application-specific network that leverages the underlying + connectivity of the Internet bring additional functionality to the + Internet: efficient delivery of content to clients.* |Overlay|.2.2 Policies @@ -203,13 +206,13 @@ output. So what makes for a good hashing scheme? The classic *modulo* hashing scheme—which hashes each URL modulo the number of servers—is not -suitable for this environment. This is because should the number of -servers change, the modulo calculation will result in a diminishing -fraction of the pages keeping their same server assignments. While we do -not expect frequent changes in the set of servers, the fact that the -addition of new servers into the set will cause massive reassignment is -undesirable. An alternative is to use the a *consistent hashing* -algorithm. +suitable for this environment. This is for a simple reason: should the +number of servers change, the modulo calculation will result in a +diminishing fraction of the pages keeping their same server +assignments. While we do not expect frequent changes in the set of +servers, the fact that the addition of new servers into the set will +cause massive reassignment is undesirable. An alternative is to use +a *consistent hashing* algorithm. .. _fig-unitcircle: @@ -301,13 +304,14 @@ measurement as the “server load” parameter in the preceding algorithm. This strategy tends to prefer nearby/lightly loaded servers over distant/heavily loaded servers. A second approach is to factor proximity into the decision at an earlier stage by limiting the -candidate set of servers considered by the above algorithms (*S*) to -only those that are nearby. The harder problem is deciding which of -the potentially many servers are suitably close. One approach would be -to select only those servers that are available on the same ISP as the -client. A slightly more sophisticated approach would be to look at the -map of autonomous systems produced by BGP and select only those -servers within some number of hops from the client as candidate -servers. Finding the right balance between network proximity and -server load has been the subject of considerable research and we -assume that the CDN operators continue to fine-tune their algorithms. +candidate set of servers considered by the above algorithms (set *S* +in the pseudocode) to only those that are nearby. The harder problem +is deciding which of the potentially many servers are suitably +close. One approach would be to select only those servers that are +available on the same ISP as the client. A slightly more sophisticated +approach would be to look at the map of autonomous systems produced by +BGP and select only those servers within some number of hops from the +client as candidate servers. Finding the right balance between +network proximity and server load has been the subject of considerable +research and we assume that the CDN operators continue to fine-tune +their algorithms. diff --git a/overlay/conference.rst b/overlay/conference.rst index 3ca44ea..914093f 100644 --- a/overlay/conference.rst +++ b/overlay/conference.rst @@ -15,17 +15,55 @@ terms of the overlay is that rather than being a generic IP multicast overlay, each video conferencing application has its own application-specific overlay. +.. sidebar:: IP Multicast: A Case Study + + *We have mentioned IP Multicast and the MBone multiple times in + this chapter. That they are primarily of historical interest makes + for an an interesting case study of how the Internet has evolved. + IP Multicast is the core feature, and as explained in + Section |Overlay|.1, a block of the IPv4 address space was set aside for + multicast addresses. The idea was that you could assign one of + these address to a multicast group, users could request to join + that group (technically, they added their host to the group), and + then any IP packet set to that multicast address would be + delivered to every host in the group.* + + *The data plane part of multicast IP is easily solved: switch + forwarding pipelines are able to send an incoming packets to multiple + outgoing queues. What proved hard is the control plane part of + the problem, that is, propagating "join requests" to those routers + that needed to know about any particular multicast address. This + is a effectively a routing problem, and when you take both scale + and AS autonomy into account, the resulting protocol turned out to + be as complex as BGP, if not more so.* + + *The MBone was an overlay used to gain experience with multicast, + with the goal of eventually pushing the solution down into + commercial routers so it could be widely deployed. IP Multicast + was never widely deployed in the Internet (with one main + exception), but this wasn't a problem that could be overcome, even + by limiting the solution to an overlay. A general-purpose + multicast mechanism just does not return enough value to offset + the corresponding complexity (at least for video + conferencing). One exception where IP Multicast pays off is when + delivering live TV over the last hop, from the "video head" to + cable set-top boxes (or Smart TVs) in homes. Last-hop multicast + easily beats N-way unicast, and the control overhead is + manageable.* + +.. TODO -- Could say more, but maybe this is enough of a summary. + An overlay is an optimization in the sense that you can run a video conferencing application without one. Indeed, if there are only two participants in the conference, no optimization is needed. Each participant can just send video and audio streams to the other participant directly. There are still some issues to be solved, including what to do if one or both participants is behind a firewall -or NAT, as we discussed in Section |Virt|.3.3. The broader issues that -multimedia applications face are the topic of Chapter |Stream|. But -optimizing the delivery of media to a large -number of conference participants is a job usually performed by an -overlay. +or NAT, as we discussed in Section |Virt|.3.3. The real-time +challenges that multimedia applications face are the topic of Chapter +|Stream|, but optimizing the delivery of media to *multiple* +conference participants—i.e., the multicast part of the solution—is +usually performed by an overlay. Consider the simple case of a 3-party video conference. You can obviously treat this as 3 pair-wise connections: each participant @@ -38,19 +76,12 @@ multicast at the IP layer, some way to replicate the traffic upstream from the participants is needed, and this is the role normally performed by an application-specific overlay. -.. _fig-sfu: -.. figure:: overlay/figures/sfu.png - :width: 600px - :align: center - - Media Distribution Using Selective Forwarding Unit. - The typical solution to media replication is for each participant to send their media to a replication device, which is likely to be located in some cloud datacenter and operated by the company running the conferencing service. One class of replication devices are known as "selective forwarding units" (SFUs). In :numref:`Figure %s -` we've shown the very simple case where there is a single +` we've shown the simplest case where there is a single SFU. All participants send their media to the SFU, and it sends a set of media streams out to all the participants. The "selective" part means that the SFU does not necessarily send every stream to every @@ -60,6 +91,13 @@ participants, while other clients could receive all streams and render them appropriately on their screens. SFUs also forward media streams rather than modifying them. +.. _fig-sfu: +.. figure:: overlay/figures/sfu.png + :width: 600px + :align: center + + Media Distribution Using Selective Forwarding Unit. + There is an alternative approach, in which a *Multipoint Control Unit* (MCU) combines the incoming streams into a single outgoing stream, such as by tiling four videos into a 2x2 grid. This is more @@ -69,8 +107,8 @@ Of course, a single replication device is both a bottleneck from the perspective of scaling, and a potential single point of failure. So what we typically see in practice is a complete overlay network of SFUs, with some SFUs capable of picking up the load from others in the -event of a failure, and SFUs arranged in a mesh so that replication -can be distributed amount the nodes in the mesh rather than all the +event of a failure. The SFUs are arranged in a mesh so that replication +can be distributed among the nodes in the mesh rather than all the work falling on a single node. A simple example of this using three SFUs in a tree structure is shown in :numref:`Figure %s `. @@ -90,11 +128,3 @@ the overlay. This is how most video conferencing solutions on today's Internet work, using overlays to scale the performance of the application without any assistance beyond basic packet transport from the core of the Internet. - -.. sidebar:: IP Multicast: A Case Study - - There's an opportunity to say a bit more about IP multicast, - including what made it hard: the routing protocol, which provided - a means to join a multicast group. Then link 6E section for more - info. Also note where it is used today: last mile into the home - (for live streams). diff --git a/overlay/figures/f09-29-9780123850591.png b/overlay/figures/f09-29-9780123850591.png new file mode 100644 index 0000000..3205ed8 Binary files /dev/null and b/overlay/figures/f09-29-9780123850591.png differ diff --git a/overlay/p2p.rst b/overlay/p2p.rst new file mode 100644 index 0000000..b527b9b --- /dev/null +++ b/overlay/p2p.rst @@ -0,0 +1,237 @@ +|Overlay|.4 Peer-to-Peer Networks +---------------------------------------- + +Music-sharing applications like Napster and KaZaA introduced the term +“peer-to-peer” into the popular vernacular. But what exactly does it +mean for a system to be “peer-to-peer”? Certainly in the context of +sharing files it means not having to download music from a central +site, but instead being able to access music files directly from whoever +in the Internet happens to have a copy stored on their computer. More +generally then, we could say that a peer-to-peer network allows a +community of users to pool their resources (content, storage, network +bandwidth, disk bandwidth, CPU), thereby providing access to a larger +archival store, larger video/audio conferences, more complex searches +and computations, and so on than any one user could afford individually. + +Quite often, attributes like *decentralized* and *self-organizing* are +mentioned when discussing peer-to-peer networks, meaning that individual +nodes organize themselves into a network without any centralized +coordination. If you think about it, terms like these could be used to +describe the Internet itself. Ironically, Napster was not a +true peer-to-peer system by this definition since it depended on a +central registry of known files, and users had to search this directory +to find what machine offered a particular file. It was only the last +step—actually downloading the file—that took place between machines that +belong to two users, but this is little more than a traditional +client/server transaction. The only difference is that the server is +owned by some other Internet user rather than a large corporation. + +So we are back to the original question: What’s interesting about +peer-to-peer networks? One answer is that both the process of locating +an object of interest and the process of downloading that object onto +your local machine happen without your having to contact a centralized +authority, and at the same time the system is able to scale to millions +of nodes. A peer-to-peer system that can accomplish these two tasks in a +decentralized manner turns out to be an overlay network, where the nodes +are those hosts that are willing to share objects of interest (e.g., +music and other assorted files), and the links (tunnels) connecting +these nodes represent the sequence of machines that you have to visit to +track down the object you want. This description will become clearer +after we look at a popular example: BitTorrent. + +BitTorrent is a file-sharing protocol devised by Bram Cohen. It is +based on replicating the file or, rather, replicating segments of the +file, which are called *pieces*. Any particular piece can usually be +downloaded from multiple peers, even if only one peer has the entire +file. The primary benefit of BitTorrent’s replication is avoiding the +bottleneck of having only one source for a file. This is particularly +useful when you consider that any given computer has a limited speed +at which it can serve files over its uplink to the Internet, often +quite a low limit due to the asymmetric nature of most broadband +networks. The beauty of BitTorrent is that replication is a natural +side effect of the downloading process: as soon as a peer downloads a +particular piece, it becomes another source for that piece. The more +peers downloading pieces of the file, the more piece replication +occurs, distributing the load proportionately, and the more total +bandwidth is available to share the file with others. Pieces are +downloaded in random order to avoid a situation where peers find +themselves lacking the same set of pieces. + +Each file is shared via its own independent BitTorrent network, called +a *swarm*. (A swarm could potentially share a set of files, but we +describe the single file case for simplicity.) The lifecycle of a +typical swarm is as follows. The swarm starts as a singleton peer with +a complete copy of the file. A node that wants to download the file +joins the swarm, becoming its second member, and begins downloading +pieces of the file from the original peer. In doing so, it becomes +another source for the pieces it has downloaded, even if it has not +yet downloaded the entire file. (In fact, it is common for peers to +leave the swarm once they have completed their downloads, although +they are encouraged to stay longer.) Other nodes join the swarm and +begin downloading pieces from multiple peers, not just the original +peer. See :numref:`Figure %s `. + +.. _fig-bitTorrentSwarm: +.. figure:: overlay/figures/f09-29-9780123850591.png + :width: 500px + :align: center + + Peers in a BitTorrent swarm download from other peers that may not yet + have the complete file. + +If the file remains in high demand, with a stream of new peers +replacing those who leave the swarm, the swarm could remain active +indefinitely; if not, it could shrink back to include only the +original peer until new peers join the swarm. + +Now that we have an overview of BitTorrent, we can ask how requests +are routed to the peers that have a given piece. To make requests, a +would-be downloader must first join the swarm. It starts by +downloading a file containing meta-information about the file and +swarm. The file, which may be easily replicated, is typically +downloaded from a web server and discovered by following links from +Web pages. It contains: + + * Target file’s size + * Piece size + * SHA-1 hash values precomputed from each piece + * URL of the swarm’s *tracker* + +A tracker is a server that tracks a swarm’s current membership. We’ll +see in a moment that BitTorrent can be extended to eliminate this +point of centralization, with its attendant potential for bottleneck +or failure. + +The would-be downloader then joins the swarm, becoming a peer, by +sending a message to the tracker giving its network address and a peer +ID that it has generated randomly for itself. The message also carries +a SHA-1 hash of the main part of the file, which is used as a +swarm ID. + +Let’s call the new peer P. The tracker replies to P with a partial +list of peers giving their IDs and network addresses, and P +establishes connections, over TCP, with some of these peers. Note that +P is directly connected to just a subset of the swarm, although it may +decide to contact additional peers or even request more peers from the +tracker. To establish a BitTorrent connection with a particular peer +after their TCP connection has been established, P sends P’s own peer +ID and swarm ID, and the peer replies with its peer ID and +swarm ID. If the swarm IDs don’t match, or the reply peer ID is not +what P expects, the connection is aborted. + +The resulting BitTorrent connection is symmetric: Each end can +download from the other. Each end begins by sending the other a bitmap +reporting which pieces it has, so each peer knows the other’s initial +state. Whenever a downloader (D) finishes downloading another piece, +it sends a message identifying that piece to each of its directly +connected peers, so those peers can update their internal +representation of D’s state. This, finally, is the answer to the +question of how a download request for a piece is routed to a peer +that has the piece, because it means that each peer knows which +directly connected peers have the piece. If D needs a piece that none +of its connections has, it could connect to more or different peers +(it can get more from the tracker) or occupy itself with other pieces +in hopes that some of its connections will obtain the piece from their +connections. + +How are objects—in this case, pieces—mapped onto peer nodes? Of course +each peer eventually obtains all the pieces, so the question is really +about which pieces a peer has at a given time before it has all the +pieces or, equivalently, about the order in which a peer downloads +pieces. The answer is that they download pieces in random order, to +keep them from having a strict subset or superset of the pieces of any +of their peers. + +The BitTorrent described so far utilizes a central tracker that +constitutes a single point of failure for the swarm and could +potentially be a performance bottleneck. Also, providing a tracker can +be a nuisance for someone who would like to make a file available via +BitTorrent. Newer versions of BitTorrent additionally support +“trackerless” swarms that use consistent hashing, as described in +Section |Overlay|.2. BitTorrent client software that is +trackerless-capable implements not just a BitTorrent peer but also +what we’ll call a *peer finder* (the BitTorrent terminology is simply +*node*), which the peer uses to find peers. + +Peer finders form their own overlay network, using their own protocol +over UDP to implement the consistent hash. Furthermore, a peer finder +network includes peer finders whose associated peers belong to +different swarms. In other words, while each swarm forms a distinct +network of BitTorrent peers, a peer finder network instead spans +swarms. + +Peer finders randomly generate their own finder IDs, which are the +same size (160 bits) as swarm IDs. Each finder maintains a modest +table containing primarily finders (and their associated peers) whose +IDs are close to its own, plus some finders whose IDs are more +distant. The following algorithm ensures that finders whose IDs are +close to a given swarm ID are likely to know of peers from that swarm; +the algorithm simultaneously provides a way to look them up. When a +finder F needs to find peers from a particular swarm, it sends a +request to the finders in its table whose IDs are close to that +swarm’s ID. If a contacted finder knows of any peers for that swarm, +it replies with their contact information. Otherwise, it replies with +the contact information of the finders in its table that are close to +the swarm, so that F can iteratively query those finders. + +After the search is exhausted, because there are no finders closer to +the swarm, F inserts the contact information for itself and its +associated peer into the finders closest to the swarm. The net effect +is that peers for a particular swarm get entered in the tables of the +finders that are close to that swarm. + +The above scheme assumes that F is already part of the finder network, +that it already knows how to contact some other finders. This +assumption is true for finder installations that have run previously, +because they are supposed to save information about other finders, +even across executions. If a swarm uses a tracker, its peers are able +to tell their finders about other finders (in a reversal of the peer +and finder roles) because the BitTorrent peer protocol has been +extended to exchange finder contact information. But, how can a newly +installed finder discover other finders? The files for trackerless +swarms include contact information for one or a few finders, instead +of a tracker URL, for just that situation. + +An unusual aspect of BitTorrent is that it deals head-on with the +issue of fairness, or good “network citizenship.” Protocols often +depend on the good behavior of individual peers without being able to +enforce it. For example, an unscrupulous Ethernet peer could get +better performance by using a backoff algorithm that is more +aggressive than exponential backoff, or an unscrupulous TCP peer could +get better performance by not cooperating in congestion control. + +The good behavior that BitTorrent depends on is peers uploading pieces +to other peers. Since the typical BitTorrent user just wants to +download the file as quickly as possible, there is a temptation to +implement a peer that tries to download all the pieces while doing as +little uploading as possible—this is a bad peer. To discourage bad +behavior, the BitTorrent protocol includes mechanisms that allow peers +to reward or punish each other. If a peer is misbehaving by not nicely +uploading to another peer, the second peer can *choke* the bad peer: It +can decide to stop uploading to the bad peer, at least temporarily, +and send it a message saying so. There is also a message type for +telling a peer that it has been unchoked. The choking mechanism is +also used by a peer to limit the number of its active BitTorrent +connections, to maintain good TCP performance. There are many possible +choking algorithms, and devising a good one is an art. + +Because BitTorrent traffic can consume a lot of the uplink bandwidth +on a typical residential Internet connection, it provided inspiration +for work on congestion control algorithms that are less +aggressive than those of TCP, yielding to competing traffic when delay +increases. This led to an experimental RFC on "Low Extra Delay +Background Transport (LEDBAT)" and ongoing research on similar +algorithms. + +We conclude by noting that while centrally managed (commercial) music +streaming services are now commonplace, BitTorrent remains in wide use, +delivering everything from music to video, game updates, and software +releases. This is, at least in part, due to its decentralized design, +which is both self-sustaining (more demand naturally results in +more resources) and difficult to shut down. + +.. admonition:: Further Reading + + `BitTorrent Statistics in 2026 + `__. + EarthWeb, 2026.