{{{ #!forumlinks }}} = Distributed Tracker = ''Status: algorithm design done, re-PEX code merged in mainbranch''[[BR]] svn.tribler.org/abc/branches/mainbranch/Tribler/Core/DecentralizedTracking/repex.py Motivation: Traditional television is moving online [http://www.marketingcharts.com/television/internet-surpasses-tv-in-australia-mobile-approaches-saturation-point-3976/nielsen-australia-media-consumption-internet-vs-tv-2005-2007jpg/ according] to [http://www.imediaconnection.com/images/content/chart_hitwise_060525_2_web.gif numerous studies]. This means every point of failure needs to be removed from the architecture, resulting in a fully self-organising system with unbounded scalability called [wiki:4thGenerationP2P 4th Generation P2P]. Tribler is striving for unbounded scalability. Tribler already addresses the problem of decentralized content discovery using BuddyCast, but the current client still lacks a solution for decentralized tracking of swarms. [#LittleBird Previous work] on decentralized tracking has been done by [JelleRoozenburg Jelle Roozenburg], but further research is required. Current research is being conducted by MSc student Raynor Vliegendhart. == Master thesis work == Implement and deploy a fully distributed tracker in Tribler. Used algorithm is the 2-hop torrentsmell algorithm, described in the research assignment document below. ToDo: * (done) basic cmdline tool for Re-pex handshake * (done) expand tool into swarm measurement crawler * Measure for a few swarms the average response time per client type - (done) 1min max listen time - determine liars about PEX capability * (done) Parallel crawling & measurement - Uniform event format '''.crawl''' file - Timestamp, SHA1, Peer-IP, peer-Port, EVENT, DATA * CONNECTION_ESTABLISHED, PeerID * CONNECTION_TIMEOUT * CONNECTION_CLOSED, bytes-downloaded, bytes-uploaded * CONNECTION_LOST (in case of network error), bytes-downloaded, bytes-uploaded * EXTEND_SEND * EXTEND_RECEIVED, ut_pex=1 (msg#), Transmission 1.34 (client version) * PEX_RECEIVED, peer-IP, peer-port, ..., DROPPED:, peer-IP, peer-port, ... * (first time) ONLINE_CHECK_SEND * (after 1hour,2h,...) ONLINE_CHECK_SEND * ONLINE_CHECK_SUCCESS * ONLINE_CHECK_TIMEOUT (connection never accepted: offline '''or busy''') * ONLINE_CHECK_ERR (network error) - Also included, but currently only relevant for human readers: * TRACKER_PEERS_INJECTED, announce-url, #reported-seeds, #reported-downloaders, peer-IP, peer-port, ... * Experiment around churn rate and decay of PEX info - Crawl several swarms for 1? hour, collect 10000 peers - Store all collected IPv4 addresses - Some peers were heard about multiple times - check online status and connectability of discovered peers for 1 week (drop if 10x connection unsuccessful) * Still online * Still seeding - Determine relation with swarm age (1hour old vs. 3 months old) * Measure bytes used for PEX traffic * Are seeders-only swarms breaking our algorithms? Seeders have no established connections without leechers {{{ #!comment * Early Tribler inclusion for 5.2, August 26th: go/no go. Consider Crawler.py or new overlay message or new extended message type (PEX 2.0) }}} == Thesis Chapter Layout == * Intro * Problem Description [[BR]] speed, scalability, security, etc * Design of 2-hop TorrentSmell[[BR]] PEX graphs, remote search details * Implementation[[BR]] cost per re-pex, interval settings, etc. * Deployment?[[BR]] Deployment crawl results, etc * Conclusions == Swarm crawling measurement == '''Popular torrent, reportedly 2897 seeders and 826 downloaders, on 2009-09-04, from 10:36 to 11:38.''' [[BR]] The graphs below show the bandwidth usage during the crawl, the number of the peers actually PEXed, statistics about the first PEX message, and the likelihood of reconnecting to a previously known peer in the future. [[Image(wiki:DistributedTracker:crawl1-bandwidth-pex-zoom.png, 982px)]] [[Image(wiki:DistributedTracker:crawl1-pexers.png)]] [[Image(wiki:DistributedTracker:crawl1-pexsize-zoom.png)]] [[Image(wiki:DistributedTracker:crawl1-pexresponse-combined.png, 982px)]] [[Image(wiki:DistributedTracker:crawl1-online.png, 490px)]][[Image(wiki:DistributedTracker:crawl1-online-percentage.png, 490px)]] {{{ #!comment Todo: perhaps pick a single graph and make the other links instead? }}} '''3 Linux ISO swarms''' [[BR]] Below is a graph showing the number of peers still connectable after we've last seen them. Note that the big drop at the end can be explained through the lack of data points: only 203 of the 475 initial peers have been checked for their online status. [[Image(wiki:DistributedTracker:linux-online.png, 490px)]][[Image(wiki:DistributedTracker:linux-online-percentage.png, 490px)]] If we crawl the swarm while trying to stay connected to PEX-capable peers until the connection is dropped by the other party, the graph looks quite different (see below). Peers seem to be less likely connectable after they have closed the connection. Also, while less peers get sent an online check, there is an increasing trend of connectability. This is in line with the results of the "Understanding Churn in Peer-to-Peer Networks" paper, where the authors found the uptime of a peer to be a good indicator of the peer's remaining uptime. [[Image(wiki:DistributedTracker:linux-keepalive-online.png, 490px)]][[Image(wiki:DistributedTracker:linux-keepalive-online-percentage.png, 490px)]] == Research Assignment == Ongoing work by Raynor. Large measurements show that The Pirate Bay is the most dominant central tracker. For each tracker we computed the number of total peers that it tracks. This was done by summing up the swarm sizes of all torrents in which a tracker was mentioned in the announce fields. The .torrent files of these swarms were obtained by running the [http://svn.tribler.org/abc/branches/release-4.5/Tribler/Core/BuddyCast/TorrentCollecting.py Tribler TorrentCollecting] software for roughly a year. Swarm sizes were constantly updated during this time. The top 20 trackers are depicted in the bar chart below: [[Image(wiki:DistributedTracker:Top20Trackers-TPB.PNG)]] [[BR]] '''Dataset:''' 283,031 swarms, total of 52,634,778 peers. In this chart, peers that are tracked solely by a tracker are marked as "Single Tracker". Hence, these peers come from a swarm which are only tracked by a single tracker. The remaining peers come from swarms that are tracked by multiple trackers. "Primary Tracker" indicates the peer came from a multi-tracker swarm which is primarily tracked by this tracker, i.e. the tracker is mentioned in the 'announce' field of the torrent file. If the tracker was not mentioned in the 'announce' field (and hence, it must have been mentioned in the 'announce-list' field), we label the peers as "Foreign Tracker". The swarms presented in this chart where carefully filtered for spam, fakes, and misreportings. == DHT Papers == * '''An Analysis of BitTorrent's Two Kademlia-Based DHTs''' ([http://www.cs.rice.edu/~scrosby/tr/BTMeasure-Main.pdf PDF], [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.8338 CiteSeer]) [[BR]] Describes the DHT implementations of Mainline and Azureus and their shortcomings. * '''The Index Poisoning Attack in P2P File Sharing Systems''' ([http://cis.poly.edu/~ross/papers/poison.pdf PDF], [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81.3640 CiteSeer]) [[BR]] Describes index poisoning in Overnet, which is a Kademlia-based DHT. * '''DDoS Attacks by Subverting Membership Management in P2P Systems''' ([http://cobweb.ecn.purdue.edu/~isl/NPSec07.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:qed-p8JXGA8J:scholar.google.com/&output=citation&scfhb=1&oe=ASCII&oi=citation BibTeX]) [[BR]] Discusses the vulnerabilities of DHTs and gossip algorithms and argue that pull-based designs are preferred over push-based designs. * '''Handling Churn in a DHT''' ([http://oceanstore.cs.berkeley.edu/publications/papers/pdf/bamboo-usenix.pdf PDF], [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.5979 CiteSeer], [http://sahara.cs.berkeley.edu/jan2004-retreat/slides/bamboo-tr.pdf TR]) [[BR]] Contains a table on observed session times in various P2P systems. * '''The Impact of DHT Routing Geometry on Resilience and Proximity''' ([http://www.mpi-sws.org/~gummadi/papers/p1101-gummadi.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:QOwTgp5uHQcJ:scholar.google.com/&output=citation&scfhb=1&oe=ASCII&oi=citation BibTeX]) [[BR]] Comparison of different DHT routing ''geometries'', including the XOR geometry of Kademlia. * '''POND: The Power of Zone Overlapping in DHT Networks''' ([http://ieeexplore.ieee.org/stampPDF/getPDF.jsp?arnumber=04579564 PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:SVsdcw8xihcJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Discusses the problem of load balancing in DHTs. * '''Profiling a Million User DHT''' ([http://www.imconf.net/imc-2007/papers/imc150.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:28Cy8sHIpJUJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Studies the Azureus DHT. * '''Secure routing for structured peer-to-peer overlay networks''' ([http://www.cs.rice.edu/~dwallach/pub/osdi2002.pdf PDF], [http://portal.acm.org/popBibTex.cfm?id=844156&ids=J597.844128.844153.844156&types=periodical.issue.section.article&reqtype=article&coll=GUIDE&dl=GUIDE&CFID=15386518&CFTOKEN=41994084 BibTeX]) [[BR]] Discusses secure nodeIds, secure maintenance of routing tables and secure message forwarding in DHTs. == PEX, Gossips, Random/Ant Walks == * '''Understanding the Properties of the BitTorrent Overlay''' ([http://arxiv.org/pdf/0707.1820 PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:yu-yLeY_e0sJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Describes the effect of PEX on the overlay network. * '''Evolution and Enhancement of BitTorrent Network Topologies''' ([http://ieeexplore.ieee.org/stampPDF/getPDF.jsp?arnumber=4539662 PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:QsghLFB65BcJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Slightly related to the previous paper, but does not cover PEX. Paper proposes modified central tracking algorithm for Small World effect. * '''Really Truly Trackerless BitTorrent''' ([http://reports-archive.adm.cs.cmu.edu/anon/2006/CMU-CS-06-148.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:uO6T5XJVl4cJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Proposes random walks to discover peers. * '''Random Walks in Peer-to-Peer Networks''' ([http://www.cc.gatech.edu/~mihail/rwp2p04.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:IJ9nWHksId4J:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Argues random walks are superior to flooding in case of clustered overlay topologies. Discusses a weakly decentralized construction algorithm using random walks. * '''Gossip-based peer sampling''' ([http://www.irisa.fr/asap/intranet/gossip-based-peer-sampling.pdf/attachment_download/file PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:H5yQf3olY4YJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Explores the design space of gossip protocols for peer sampling. * '''A pheromone-based coordination mechanism applied in P2P''' ([http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.2172&rep=rep1&type=pdf PDF], [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.3.2172 CiteSeer]) [[BR]] Discusses real and synthetic pheromones and studies how the latter can be used to build a P2P system. * '''Exploring the Interdisciplinary Connections of Gossip-based Systems''' ([http://www.cs.vu.nl/~costa/papers/costa07exploring.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:NCRXTvOn4RAJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] Gives a definition of gossip protocols and shows several other algorithms share similarities. Also of interest: [wiki:PEXCrawl] == BitTorrent Trackers == * '''Availability in BitTorrent Systems''' ([http://www.cs.umass.edu/~arun/papers/BT_availability.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:hoonBR-9VwAJ:scholar.google.com/&output=citation&scfhb=1&oe=ASCII&oi=citation BibTeX], [http://www-net.cs.umass.edu/~honggang/pub/Neglia06availability_bt06-41.pdf TR]) [[BR]] Discusses central single and multi-tracker swarms and the availability of the trackers. == Unsorted Papers == * '''A Survey and Comparison of Peer-to-Peer Overlay Network Schemes''' ([http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:00YOHLoibNcJ:scholar.google.com/&output=citation&scfhb=1&oe=ASCII&oi=citation BibTeX]) [[BR]] Survey and comparison of various structured and unstructured P2P networks. * '''Peer-to-Peer Overlay Networks: A Survey''' ([http://www.cse.ust.hk/~gabriel/Group/P2Psurvey/TR-P2P.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:UJXa5XunfeMJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation BibTeX]) [[BR]] A less extensive survey. (Can be removed since the previous paper is more extensive?) * '''Debunking some myths about structured and unstructured overlays''' ([http://research.microsoft.com/en-us/um/people/antr/ms/myths.pdf PDF], [http://scholar.google.nl/scholar.bib?hl=nl&lr=&q=info:quix86-EUpYJ:scholar.google.com/&output=citation&scfhb=1&oe=ASCII&oi=citation BibTeX]) [[BR]] ''Debunks'' the myths that structured overlays are expensive to maintain, their topology constraints make it harder to exploit heterogeneity, etc. * '''Understanding Churn in Peer-to-Peer Networks''' ([http://www.imconf.net/imc-2006/papers/p19-stutzbach2.pdf PDF], [http://scholar.google.nl/scholar.bib?q=info:tBT4DC3nxJgJ:scholar.google.com/&output=citation&hl=en&ct=citation&cd=0 BibTeX]) [[BR]] Contains extensive study of churn. = Previous Work: Little Bird = #LittleBird ''This section should be merged after research has been completed'' == Removing the Bittorrent tracker == ''JelleRoozenburg'' In the BitTorrent p2p system all users that download a certain file profit from each other by bartering file pieces. Using this principle, files can be downloaded faster. A group of users that download the same file on the same moment are called a download swarm. The function of the BitTorrent tracker is to find out which users are in the download swarm of a certain file. A secondary task of the tracker is to administrate statistical information about the swarm, for instance which part of the file is downloaded by which users. == Tracker protocol == These tasks are currently implemented in a centralized fashion. A BitTorrent tracker is a server to which users can send a getPeers() request. This is request will be answered by the tracker with a response containing a list of IP addresses and port numbers of peers that are currently in this swarm. Communication with the tracker is done using the HTTP protocol, just like a webserver. == Disadvantages of a centralized tracker == The decentralized design of p2p systems is one of their major advantages. It makes the BitTorrent system flexible, scalable and reliable. The BitTorrent tracker is one of the parts of the system that still has a centralized design. This makes its simple implementation possible, but is formost a disadvantage for BitTorrent. A centralized tracker is a single point of failure in the BitTorrent system. This means that its failure will cause an interruption of the BitTorrent service, namely: to enable people to join and use download swarms. While the system gives a reliable download environment through thousands of users going online and offline at each moment, a disfunctional tracker immediately stops the download possibility of all new users. A BitTorrent tracker is also not scalable because of its centralized design. On this moment it is already noticable that trackers have long response delays and peers often have to do more tracker requests due to time-outs. When the number of BitTorrent users will grow in the future this problem will even grow in such a way that building a reliable centralized tracker will be undoable or at least very expensive. A final disadvantage of the current centralized trackers is that they are an easy target for an attack at the BitTorrent system. A simple distributed denial of service (DOS) attack can stop a tracker and with it thousands of users. With the current use of BitTorrent for the exchange of copyrighted materials, it is obvious that certain parties have motives for such attacks. == Distributed tracker == For the reasons given in the previous section, it shows that the centralized BitTorrent tracker should be replaced by a distributed follow-up. In this section we will explain how this is done and which issues have to be considered when distributing the tracker's functionality over all peers in the swarm. Find initial tracker peers The most important functionality of a tracker is to give any peer a list of addresses through which other peers in the swarm can be contacted. With a centralized tracker, the tracker location is saved in the .torrent-file so that each peer that starts a download can immediately contact it. When such a central point is not present, then we first have to find the tracker. In the distributed case, the tracker consists of all peers (or more commonly: all peers in the swarm in question). So the problem is: find initial peers that are in the swarm and are part of the distributed tracker. This problem will be solved using peer gossiping. Using this protocol, each peer that starts to seed a new file over the network or joins an existing swarm, pushes a message to the peers in its neighborhood. This messages contains his permanent identifier, a timestamp and a datastructure (probably a Bloom filter) with all the info hashes of the files it is currently seeding. These messages are forwarded by other peers according to the probabilistic gossiping protocol. In this way, most peers in the network receive the messages. When a swarm of reasonable size has formed, most peers should have a collection of messages with information about the current members of this swarm. If a peer wants to download the file (so join the swarm), it can simply connect to the peers in its list, check if they are still active swarm members and join in. Get more swarm peers In the current BitTorrent system, all peers in a swarm periodically reconnect to the tracker to add more swarm members to their peer lists. By doing so, they have knowledge of more peers and the probability of finding a peer with the pieces of the file you need grows. In the distributed tracker case, when a peer has found some peers using the method of section 'Find initial tracker peers', it will use these peers to enlarge its list of known peers in the swarm. This can be done by asking these peers for their swarm list (all peers in the swarm). Although this sounds very simple, there are different approaches to do this. === Track just seeders? === Would it make sense to gossip only about seeders? A popular file will have fewer seeders than leechers, and the set of seeders will be more stable. In principle, you only need to find one peer, since you can random walk from there to find others. Bittorrent is normally supposed to work even when the swarm includes only leechers, but would it make sense to give up on that and instead try to get more seeders? Actually, the distinction between seeders and leechers is blurry to me. In practice, a peer could advertise itself when 1) its policy has been configured to seed files for a sufficiently long period of time after download completion ''and'' 2) it has detected that the user leaves it running all the time and not just when he is downloading a file. ''--rimey@hiit.fi'' === DHT and alternatives === Numerous DHT proposals exist to decentralise peer discovery. An advanced proposal uses [http://arxiv.org/abs/cs.DS/0306043 Skip Graphs] to discover content and can be adapted to discover peers. However, security is not fully taken into account and a trivial pollution attack can bring the system down.