W3C Team Project Review: P2P
Outline - Goals
- Introduction, thoughts on a definition
- A (slightly) technical history
- Applications and challenges
- Usability? Some more technical points
- Conclusion? Questions!
P2P, a buzzword??
Yes
abusive use of the term, for every kind of distributed computing environment,
like Grid, multiagent systems... very similar concepts.
Content Distribution Networks... Open Hypermedia Systems (OHS)
new wording = hype
No
deserved its own word as an application for end users, not started as a research topic, sneaked into the academic world after its public success
The non-geek view:
- file sharing
- internet-web application
- piracy/illegal content
- domain = mainly entertainment (but on a collaborative & on-demand mode) [music, videos, usually big files]
P2P Architectures
Network of peers as opposed to server-based exchanges (e.g. IM vs IRC)
- Centralized
- Distributed
- Hybrid
- Unstructured or Structured
Once upon a time: Napster (1999)
- Centralized (napster.com as a central directory)
- sends list of shared files and queries for files to the central server
- then pings peers to evaluate best transfer rates
- Efficient search (and Fast in theory, but bottleneck on server), network is well-known, BUT single point of failure on server + cost of server.
Need to hide: Gnutella (2000)
- Totally distributed, all peers are equal
- Every peer routes queries and serve files, queries flood the network until query "radius" is reached
- Bootstrap nodes needed
- Inefficient search, generates lot of traffic, limited by radius, BUT it is impossible to take the whole network down. Open source
Hybridation: KaZaA/FastTrack (2001)
- SuperNodes and Ordinary nodes
- In theory, better queries BUT still no overall search so only very popular content can be easily found
- maybe a bit easier to attack
- Other improvements in parallel downloads, queuing, etc.
Hybridation
- Deployment paramters: about 30000 SN with 100-150 children ON each, each SN connected to 30-50 other SNs
- HTTP-based requests-reponses, HTTP Byte-Range to get chunks of file from different places
- Content Hash
- Bootstrap SuperNodes only
Similar architecture adopted in 2002-2003 in Gnutella2 (hubs and leafs)
And next?
P2P Balkanization?
Lots of different P2P software and servers, some with open source implementations (WinMX/Winny, eDonkey, eMule, etc.), forks implementing variants, etc.
Why? What made some success and others fail?
Here comes the **AA...
- RIAA and MPAA, Entertainment companies
- Tried collecting information about peers
- Sharing fake files (pollution)
- Taking servers down (e.g. Razorback2) or networks (Napster, WinMX)
- DDoS attacks on SN
- development of protection software , mostly IP level filtering (BlueTack, PeerGuardian, ... roughly same as RBLs against spam, lists of nasty lawyers IPs ;) )
... and the ISPs
A 2002 study evaluated the average guy traffic as 43% P2P and 14% Web surfing
- Filtering common ports
- packet sniffing
- actively closing TCP connections (Sandvine)
Countermeasures: encryption, tunneling into UDP packets
Actual Usability (2003)
- Lots of different software for different networks
- Popularity of objects is short
- 30% of objects smaller than 10MB take more than 1 hr to download
- 50% of objects bigger than 100MB take more than 1 day
- NAT problems
Usability not that good but huge success of "free" content
BitTorrent (2001)
- New context: sharing legal content
- Much like a hybrid network (trackers)
- torrent file = name, length, hash, URL for the tracker
- Chunked files (~256KB), heuristic to replicate rare chunks more than common ones, no "starvation" risk like in the KaZaA and previous techniques
- Efficient distribution, no real "search" since the (meta)data is in the
torrent (and bootstrapping is data-based, not network-based)
- Much faster than KaZaA and similar networks
In the meanwhile...
- Distribution of huge files is more and more common
- Better domestic internet connections (mainly DSL)
Content Distributed Networks (CDNs) become more interesting (less
cost for the original distributor, in particular bandwidth) and
more feasible. Large distributed CPU network was already a reality (rc5, SETI, climate prediction, etc.)
BitTorrent's success?
A 2004 study estimated BitTorrent as high as 30% of all the Internet Traffic, and a 60% total of P2P
Many extensions for security, improvements on distribution of the data and the metadata (multiple trackers)
Legal problems still there, so are the consequences: poisoning chunks, filtering, etc.
and P2P applications were born
- VoIP (Skype)
- More IM
- Joost, Vuze, ...
- Virtual World (Solipsis...)
- boxes with BitTorrent included (routers, set-top-box...)
Same problems:
Efficiency should be improved, NAT problem, ISPs not playing nice,
even if efforts on caching extensions (e.g. Comcast & Azureus/Vuze)
Overlay Network: Routing between the Nodes
Distributed Hashtables (DHT)
Comes from dsitributed caching area (e.g. CARP)
- Kademlia
- Chord
- Pastry/Tapestry
- CAN
- ...
Circle topology (Kademlia, Chord)
- Node takes random Key id in some id space (typically 128 or 160bits integer)
- Ask a node (bootstrap needed: list of default nodes) to find the predecessor and successor
- when connected to successor (next id in circle), ask the list (>=2) of successors, in case the first next one leaves
- tell predecessors to update their list of successors
(so local routing is changed)
- "Chords" on the circle (successors + predecessors)
- search is O(N) at worst, variants in IDs and search techniques improve this (e.g. groups/subgroups of 16 lead to log16(N) steps in Pastry)
More structured P2P approaches
- D-Dimensional cartesien space (CAN)
- one more layer (e.g. connecting a chord with a pastry and a kademlia space,
more hierarchical, closer to Internet's topology)
- BGP-like routing tables
Kademlia is already in use (Overnet, Kad networks, BitTorrent extension)
Challenges of Overlay nets
Could address lots of issues at the same time:
- Node addressing, ID, not always IP-based (mobility problems, for IM, skype, etc.)
- optimize query routing (hash-based organization of the resources)
- reorganization when nodes join or leave (but bootstrap still needed)
- Malicious updates in a (Chord) routing are not critical but optimized routing is more fragile than random
- Security and trust: important for bootstrap nodes, cross-checks of routing tables between trusted nodes, re-routing if inactive nodes, block DoS of join/leaves
Lower level: Network issues
Overlay network does not solve lower level issues. e.g. JXTA assumes a complete and secure network layer under its own architecture
- IP-based keys
- Hole Punching, NAT Traversal. See IETF specifications (TURN, STUN), ongoing work
Conclusions? Questions!
P2P revealed much bigger questions on the use of Internet
- Invisible Internet Project (I2P)
- The Onion Router (TOR)
In 2007, after 4 fours behind P2P, HTTP traffic raised up to almost 50% of the Internet traffic
- "youtube" et al. effect
- P2P still 37%
More P2P applications and Commercial networks to come
- CBC broadcasted an entire show on P2P
- P2PNext project gets 14MEur funding
- More Bittorrent-enabled devices
BitTorrent forks?
Links - Partial & mostly unordered bibliography