2005-07-21
Article continued from Page 2
Finding the pattern
A study of some other decentralized P2P applications, such as BearShare, Skpye, Kazaa, EMule, Limewire, Shareaza, Xolox, MLDonkey, Gnucleus, Sancho, and Morpheus leads to a similar result. All these applications have the same connection pattern: they use one or several UDP ports to communicate with many outside hosts during their lifetime. Describing this pattern in the network layer, it can be summarized as:For a period of time(x), from on single IP, fixed UDP port -> many destination IP(y), fixed or random UDP ports
Experience shows that when x equals five, y equals three, as administrators scanning for a P2P application we will get a satisfying result. Administrators can change x and y values to get more precious or rough result according to their requirement.
In practice, we can export network connection records from corresponding equipment and use a database and shell scripts to process them. For every given minute, if the result shows that any host sends out some number of UDP packets to different hosts from a fixed source port, it is highly probable that the host is a P2P host.
The author of this article setup a test environment on one of China's largest ISP nodes. The network connection records were exported from the router as Netflow data and stored into a MySQL database. With the help of a little script to process all the data, many hosts were identified as P2P peers, and some interesting, locally developed P2P new applications were also discovered.
Dealing with false positives
This sounds like a good method to perform P2P host identification, but what about false positives? Fortunately, this kind of network traffic behavior is seldom seen in other types of usage around the Internet. An exception to this would be if the host is a traditional game server, DNS server or media server. This kind of server will also produce traffic records in which many UDP packets are sent out to many different IP addresses from a single source. But administrators can easily distinguish whether a host is a traditional server because a server normally will not send any kind of traffic on ports other than their functional port, which is not the model used by a P2P host.The value of this UDP connection pattern is obvious: this approach does not need any kind of application layer information, yet the result is still quite satisfactory. It does not rely on any kind of signatures so newly developed P2P application can still be identified quickly in large networks. Meanwhile, analyzing the network layer information requires almost no extra software of hardware, and dramatically reduces the pressure that might otherwise be put on corresponding equipment.
Disadvantages of this approach
To be sure, this UDP session method also has two disadvantages: it can only be used to identify P2P applications that use a decentralized structure (although most of the modern P2P applications are indeed decentralized). Second, if the P2P application chooses TCP rather than UDP to perform its control function, our identification work will fail.
Identifying P2P applications
Up to this point we have identified P2P users by relying on network connection records. We now go one step further to identify what exactly P2P application a host is running without the help of any high level layer data.Examining the UDP traffic of different P2P applications more carefully, you will find even more interesting patterns. It has been mentioned that a decentralized network structure needs control purpose packets, and it is not difficult to understand that for a dedicated P2P application, there are many kinds of control packets. Packets of the same control purpose are very often identical in size. Therefore, the UDP packet can even help us identify exactly which P2P application is running, in the absence of any higher level information.
Most of P2P applications do not have complete documentation on their implementation details and some of them are closed source, so we are still unclear exactly what the makeup is of most applications' UDP packets. Therefore, the author of this article has randomly selected seven decentralized, popular P2P applications and made such observations. The result confirm the hypothesis, that all these applications use some fixed length packets to contact outside.
- Edonkey2000
Edonkey2000 uses many 6 byte UDP packets to send out 'server status request'. These kind of packets will mostly be seen when Edonkey launches. Additionally, the packet performing search function is almost always seen, and has a length of 25 bytes. -
BearShare
When BearShare launches, it first sends out UDP packets with a length of 28 bytes to many different destinations. Every time BearShare launches a file transfer task, there will be a lot of UDP packets each with a length of 23 bytes, sent out to file suppliers. - Limewire
Limewire uses many 35 byte and 23 byte UDP packets, sent out when Limewire starts. Every time a download task starts, there will be many 23 byte UDP packets communicating with the outside. - Skype
Skype will startup with many 18 byte UDP packets to communicate with the outside. - Kazaa
When Kazaa launches, it sends out UDP packet with a length of 12 bytes to many different destinations
- EMule
When you start EMule and select a server to get connected, there will be continuously many 6 byte UDP packets sent out to perform 'server status request' and 'get server info'. If you choose to connect to a Kad network in EMule, there will be continuously 27 byte and 35 byte UDP packets appearing in the connection traffic. - Shareaza
During Shareaza's lifetime, you will discover that there are continuously 19 byte UDP packets found in the traffic.
The result of these simple tests is quite interesting. It means that after identifying the peers in the network records, we could use this technology to determine in the future what exactly a peer uses. However, research on the size of different P2P applications' control packets is still in its infant stage and there are many things left to do. For a detailed and accurate result, each application may need special focus and a lot of research work is still needed.
Furthermore, there are other means that can be used and combine with the methods we discussed in this article to better identify P2P users and P2P applications. Some P2P applications will make connections to fixed outside IP addresses to perform such functions as version checks, authentication, downloading bootstrap, or even advertising. For example, Kazaa will connect to ssa.Kazaa.com, desktop.Kazaa.com and some other sites when it operates. Skype will make TCP connection to ui.skype.com whenever it startups.
Also there are other aspects about traffic behavior, such as data transferred. Connection duration may be used in P2P identification but this adds another level of complexity.
Conclusion
As always, there is no one-fit-all solution for the P2P identification work. Although port based analysis and protocol analysis are currently the most important and commonly used technologies, we should not feel content with them. Try a brain head storming, there may be another method cropping up to reinforce the P2P identifies solution.Acknowledgement
My special thanks to Kelly Martin for his careful review and suggestions!
About the author
Yiming Gong has worked for China Telecom for more than 5 years as a senior systems administrator, and now he works as a researcher at the Research Department, NSFocus Information Technology Co.Ltd.
