Merkle Hashes

When you use a P2P system to download software, photographs, or other content, you want that content to be correct. Using Bittorrent you can download content from various peers, each of which could possibly corrupt this content. Normally, information in the torrent file prevents corruption, but this approach has two problems:

  • A .torrent file contains a hash of every block in the file set being distributed. Because of the resulting size of the torrent files, Web servers hosting popular or many torrents sometimes get overloaded.
  • Many (large) torrents use large piece sizes to keep the torrent small. Large piece sizes cause slow starts for new clients as they first have to obtain a e.g. 2 MB piece in its entirety (to check its hash) before they are allowed to exchange it for new data with others.

Tribler solves these problems by constructing a hash tree of the content and using just the root hash as data integrity protection in the torrent file. The simple root hash value also allows for smaller piece sizes to be used. A common form of hash trees is the Merkle hash tree, hence the name.

We have chosen a minimalistic design that does not affect the existing BitTorrent protocol and clients very much. The design is backwards compatible in the sense that clients supporting the Simple Merkle Hash extension can still be made to process regular torrent files easily.

From the content we construct as hash tree as follows. Given a piece size, we calculate the hashes of all the pieces in the file set. Next, we create a binary tree of sufficient height. Sufficient height means that the lowest level in the tree has enough nodes to hold all piece hashes in the set. We place all piece hashes in the tree, starting at the left-most leaf. The remaining leaves in the tree are assigned a hash value of 0. Finally, we calculate the hash values of the higher levels in the tree, by concatenating the hash values of the two children (again left to right) and computing the hash of that aggregate. This process ends in a hash value for the root node, which we call the root hash.

Merkle tree

The root hash along with the total size of the file set and the piece size are now the only information in the system that needs to come from a trusted source. A client that has only the root hash of a file set can check any piece as follows. It first calculates the hash of the piece it received. Along with this piece it should have received the hashes of the piece's sibling and of its uncles, that is the sibling Y of its parent X, and the uncle of that Y until the root is reached. Using this information the client recalculates the root hash of the tree, and compares it to the root hash it received from the trusted source.

a follow-up: SerpinskyNumbering

Attachments