2005-07-21
Article continued from Page 1
Although network traffic information is still coarse in some degree, there is valuable information inside the traffic and useful patterns can be uncovered. Looking at host UDP sessions is one good example of this.
Identifying P2P users
The author of this paper has found that a unique traffic behavior to UDP connection pattern exists with P2P applications. This can be used to process network traffic and find out which hosts are running P2P applications in a decentralized network structure. And all that needed is the network traffic records.What exactly does it mean to look at a UDP connection pattern, and how can it help us? Before answering these questions, let's review the first popular P2P application, Napster.
Centralized, decentralized and hybrid P2P networks
Napster, written by Shawn Fanning, was first launched in May 1999 and was the first generation of a P2P network. Napster's network structure was centralized, which means it was made up of two elements: central index servers and peers. Central index servers were setup by Napster, which maintained the shared music file information of every online peer. When an active peer wanted to download a music file, it sent an inquiry to Napster's central index server and the latter looked up the request its database and sent back a list of which peers had the desired music files. Then the peer can make direct connection to the peers in the list to get the file.The network structure of Napster has an Achilles Heel -- it is highly dependent on the static central server. If the central server is down, the network will collapse. This was shown by the actions of the recording industry, which forced the original Napster to be shutdown.
The Napster case illustrates the vulnerability of a centralized network structure and greatly affects the subsequent P2P application. For legal, security, scalability, anonymity and some other reasons, more and more P2P applications nowadays work in a totally or partially decentralized network structure, or are moving in the direction. Major P2P file-sharing networks and protocols, such as Edonkey2k, FastTrack, Gnutella, Gnutella2, Overnet, Kad, all use this concept.
Here the author must make it clear that Bittorrent is not a general purpose P2P network although it is a popular P2P application. It still needs tracker servers; while the network structure of Bittorrent is partially decentralized, the technique discussed in this article can't be used to identify Bittorrent users.
Decentralized means a network structure with no dedicated central index servers. It is a trend for P2P evolution. Today, there are many P2P camps using their own network and protocol, but normally their network structures are totally or partially decentralized. Some P2P applications such as EMule and Edonkey support fully decentralized protocols such as Kademlia, which needs no servers at all. And as a partially decentralized model, hybrid decentralized networks have won broad support from various P2P applications and are thus recognized as the most popular P2P network model.
In a hybrid decentralized network, there are still central servers, but they are no longer dedicated and static. Instead, some peers with more power (CPU, DISK, Bandwidth, and active time) will automatically take over the central indexing server functions, which are called ultrapeers (Supernodes). Every one of them is elected from normal peers and each serves a group of normal peers. They communicate with each other to form the backbone of hybrid decentralized network. New ultrapeers are continuously added when appropriate peers join the network. At the same time, ultrapeers are removed when they leave the network.
In order to join the network, a peer must find a way to connect with one or a few of the live ultrapeers. They get the ultrapeer list by some means such as a bootstrap stored in the program or download from special web site. After connecting to a proper ultrapeer, apart from the normal file transfer work, the P2P application must interact with the P2P network to help them keep connected and live happily in the network, uploading information to the server, checking the status of ultrapeer to which they are connected, getting the most current available ultrapeers, comparing the available ultrapeers situations, actively switching to a better ultrapeer, searching files, probing the status of file suppliers, storing available ultrapeers for future use, and so on. In short, besides the real file transfer traffic itself, peers need to send out many control packets (probe, inform and some other packets) to various different hosts to keep up with the changing network environment in real time. This is the first key element of our traffic behavior identification: peers need many control purpose packets sent out to interact with the decentralized network during their lifetime.
UDP connection patterns
Today almost all P2P applications using a decentralized structure have a built-in module to fulfill their interaction work, because there are many control purpose packets needed to be sent out to many destinations. A great deal of the modern P2P networks and protocols select UDP as the carrying protocol.Why do they select UDP? UDP is simple, effect and low-cost. It does not need to provide guarantee for packet delivery, establish connection, or maintain connection state. All these features make UDP fit for fast delivery of data to many destinations. These are just what P2P applications need. Inspecting different P2P applications carefully, you will find most of the modern decentralized P2P applications adopt a similar network behavior. When they startup, they create one or several UDP sockets to listen, and then communicate with abundant outside addresses during their life by using these UDP ports to assist their interaction in the P2P world. This is the second key element of our traffic behavior identification: peers keep using one or several UDP ports to make connections to fulfill the control work.
Now, let's turn to a popular P2P application, Edonkey2000, to see how it can be identified.
Edonkey2000 UDP traffic example
The following is a trace file of Edonkey's outgoing UDP traffic. The output display here is sanitized, so it is only a fraction of the captured traffic. In fact, for this example there were 390 records in just two minutes. For example purposes, the source address is replaced with x and the first column of destination address is replaced with y.
11:24:19.650034 IP x.10810 > y.34.233.22.8613: UDP, length: 25 11:24:19.666047 IP x.2587 > y.138.230.251.4246: UDP, length: 6 11:24:19.666091 IP x.10810 > y.127.115.17.4197: UDP, length: 25 11:24:19.681433 IP x.10810 > y.76.27.4.4175: UDP, length: 25 11:24:19.681473 IP x.2587 > y.28.31.240.4865: UDP, length: 6 11:24:19.696907 IP x.2587 > y.162.178.102.4265: UDP, length: 6 ...... 11:24:20.946921 IP x.2587 > y.250.47.34.4665: UDP, length: 6 11:24:20.962509 IP x.2587 > y.152.93.254.4665: UDP, length: 6 11:24:20.978275 IP x.2587 > y.28.31.241.5065: UDP, length: 6 11:24:20.993871 IP x.2587 > y.135.32.97.580: UDP, length: 6 11:24:21.009621 IP x.2587 > y.149.102.1.4246: UDP, length: 6 11:24:29.681224 IP x.10810 > y.32.97.189.5312: UDP, length: 4 11:24:29.696903 IP x.10810 > y.10.34.181.7638: UDP, length: 4 11:24:29.716503 IP x.10810 > y.26.234.251.12632: UDP, length: 4 ...... 11:26:20.291874 IP x.10810 > y.19.149.0.21438: UDP, length: 19
From the output, we can see that all traffic is coming from two source ports, UDP 2587 and UDP 10810 (These ports are randomly selected by Edonkey and the port numbers on different hosts will be different). The destination IP addresses are diverse. In fact, Edonkey uses one port to send out server status requests to the Edonkey servers, and uses another port to make connection, IP query, search, publicize and some other work.
