Metadata Enrichment in Tribler (MET) Notes

Basically miscellaneous notes moved from the main page to here to reduce clutter, but still archive them.

MTurk Experiment

First Experiment

Data:

  • before_open_size_07_Batch_357268_result.csv
  • before_open_size_12_Batch_355973_result.csv
  • before_open_size_37_Batch_357607_result.csv (orig. size 40)
  • after_open_size_11_Batch_357999_result.csv
  • after_open_size_45_Batch_358463_result.csv

Results Before:

Real 157 / 124 (281) 0.5587 / 0.4413
Fake 149 / 130 (279) 0.5341 / 0.4659
--
TV 43 / 17 (60) 0.7167 / 0.2833
Movies 37 / 22 (59) 0.6271 / 0.3729
Books 18 / 25 (43) 0.4186 / 0.5814
Music 33 / 29 (62) 0.5323 / 0.4677
Software 26 / 31 (57) 0.4561 / 0.5439
--
Cur news 38 / 19 (57) 0.6667 / 0.3333
Commercial 41 / 31 (72) 0.5694 / 0.4306
Sport 18 / 33 (51) 0.3529 / 0.6471
Howto/Talk 23 / 25 (48) 0.4792 / 0.5208
Home video 29 / 22 (51) 0.5686 / 0.4314

Results After:

Real 151 / 130 (281) 0.5374 / 0.4626
Fake 157 / 122 (279) 0.5627 / 0.4373
--
TV 26 / 19 (45) 0.5778 / 0.4222
Movies 31 / 24 (55) 0.5636 / 0.4364
Books 39 / 31 (70) 0.5571 / 0.4429
Music 29 / 28 (57) 0.5088 / 0.4912
Software 26 / 28 (54) 0.4815 / 0.5185
--
Cur news 40 / 28 (68) 0.5882 / 0.4118
Commercial 23 / 21 (44) 0.5227 / 0.4773
Sport 38 / 28 (66) 0.5758 / 0.4242
Howto/Talk 35 / 22 (57) 0.6140 / 0.3860
Home video 21 / 23 (44) 0.4773 / 0.5227

Mockups

Search recommendation:

Smart search bar:

Tag cloud / list:

Multilevel tag clouds:

Preliminary findings studying some data from the ClickLog crawl

  • Sometimes a hit was found by matching keywords from the query against a file name in a multi-file torrent.
    • Question: Is there a significant number of users that search for file names?
  • Based on a corpus of more than 800k swarm names, we have identified some users use stopwords in their queries.
    • Remark by Martha: certain users probably expect exact-match behaviour. Example query: "the hill productions".
  • Further directions: Group search queries by users (anonymized) and study search behaviour.
  • ClickLog bug: Propagation of ClickLog information is incorrect when a user has searched and downloaded the same torrent multiple times. Consequence: Reconstructing the query results in repeated words.

Current Tribler code for determining keywords of a swarm:

        keywords = Set(split_into_keywords(torrent_name))



        # search through the .torrent file for potential keywords in

        # the filenames

        for filename in torrentdef.get_files_as_unicode():

            keywords.update(split_into_keywords(filename))

After 20 days of crawling:

  • 501 unique queries found.
  • Contacted 48 peers.
  • Queries were issued by 227 peers, of which 26 were directly crawled.
  • 1467 different swarms were clicked on.

( only swarms for which their names are known are counted)

Discovery of torrents/terms

The following three plots show how many torrents are discovered over time, and how many terms are extracted from that. We also show how many terms remain after doing simple filtering (word length > 2, freq > 1).

Note that the act of searching causes torrents to be discovered at a higher rate (metadata for unknown infohashes in the search results gets requested).

This raises the question whether active peers will tend to get a "narrower" view on the network than idling peers. When an active peer is shown a term cloud and clicks in it, it will perform a search on neighbouring peers. Due to Tribler's semantic overlay, these peers tend to have a similar taste as the searching peer. If a term cloud would only show the most popular terms (according to the peer's own MegaCache), clicking on these terms may enforce the popularity of the clicked terms.

Frequency of discovered terms

800k+ dataset

Dataset 1

Unfiltered Filtered


Dataset 2

Unfiltered Filtered


Clustering

Status: Research put on hold for now as it proved to be quite complex

In a multilevel tag cloud, we cannot simply display the top N terms ranked on df and, when the user selects one of these terms, display the top M terms that co-occur with the selected term and are ranked on df. The graph below illustrates why. In this graph, the red nodes represent the top N=100 level 1 tags. For each level 1 tag, the top M=100 co-occurring tags are connected by an edge. "Pure" level 2 tags are colored green in the graph. As you can see, most level 1 tags also show up at the second level.

An example why we don't want this ranking is the following situation. Let's say that in the global top N terms, the terms "x" and "y" are the two most frequent terms, but that they have a co-frequency of merely 1. Let's say the user selects term "x". For the second level of terms, we do not want "y" to be ranked high at all.

A better approach seems to be to rank level 2 term candidates using the co-df. This results in a graph with 2275 nodes, as opposed to only 165 nodes.

The above images show how terms co-occur.