Metadata Enrichment in Tribler (MET) Notes

Basically miscellaneous notes moved from the main page to here to reduce clutter, but still archive them.

MTurk Experiment

First Experiment

Data:

before_open_size_07_Batch_357268_result.csv
before_open_size_12_Batch_355973_result.csv
before_open_size_37_Batch_357607_result.csv (orig. size 40)
after_open_size_11_Batch_357999_result.csv
after_open_size_45_Batch_358463_result.csv

Results Before:

Real	157 / 124	(281)	0.5587 / 0.4413
Fake	149 / 130	(279)	0.5341 / 0.4659
--
TV	43 / 17	(60)	0.7167 / 0.2833
Movies	37 / 22	(59)	0.6271 / 0.3729
Books	18 / 25	(43)	0.4186 / 0.5814
Music	33 / 29	(62)	0.5323 / 0.4677
Software	26 / 31	(57)	0.4561 / 0.5439
--
Cur news	38 / 19	(57)	0.6667 / 0.3333
Commercial	41 / 31	(72)	0.5694 / 0.4306
Sport	18 / 33	(51)	0.3529 / 0.6471
Howto/Talk	23 / 25	(48)	0.4792 / 0.5208
Home video	29 / 22	(51)	0.5686 / 0.4314

Results After:

Real	151 / 130	(281)	0.5374 / 0.4626
Fake	157 / 122	(279)	0.5627 / 0.4373
--
TV	26 / 19	(45)	0.5778 / 0.4222
Movies	31 / 24	(55)	0.5636 / 0.4364
Books	39 / 31	(70)	0.5571 / 0.4429
Music	29 / 28	(57)	0.5088 / 0.4912
Software	26 / 28	(54)	0.4815 / 0.5185
--
Cur news	40 / 28	(68)	0.5882 / 0.4118
Commercial	23 / 21	(44)	0.5227 / 0.4773
Sport	38 / 28	(66)	0.5758 / 0.4242
Howto/Talk	35 / 22	(57)	0.6140 / 0.3860
Home video	21 / 23	(44)	0.4773 / 0.5227

Mockups

Search recommendation:

Smart search bar:

Preliminary findings studying some data from the ClickLog crawl

Sometimes a hit was found by matching keywords from the query against a file name in a multi-file torrent.
- Question: Is there a significant number of users that search for file names?
Based on a corpus of more than 800k swarm names, we have identified some users use stopwords in their queries.
- Remark by Martha: certain users probably expect exact-match behaviour. Example query: "the hill productions".
Further directions: Group search queries by users (anonymized) and study search behaviour.
ClickLog bug: Propagation of ClickLog information is incorrect when a user has searched and downloaded the same torrent multiple times. Consequence: Reconstructing the query results in repeated words.

Current Tribler code for determining keywords of a swarm:

        keywords = Set(split_into_keywords(torrent_name))



        # search through the .torrent file for potential keywords in

        # the filenames

        for filename in torrentdef.get_files_as_unicode():

            keywords.update(split_into_keywords(filename))

After 20 days of crawling:

501 unique queries found.
Contacted 48 peers.
Queries were issued by 227 peers, of which 26 were directly crawled.
1467 different swarms were clicked on.

( only swarms for which their names are known are counted)

Discovery of torrents/terms

The following three plots show how many torrents are discovered over time, and how many terms are extracted from that. We also show how many terms remain after doing simple filtering (word length > 2, freq > 1).

Note that the act of searching causes torrents to be discovered at a higher rate (metadata for unknown infohashes in the search results gets requested).

This raises the question whether active peers will tend to get a "narrower" view on the network than idling peers. When an active peer is shown a term cloud and clicks in it, it will perform a search on neighbouring peers. Due to Tribler's semantic overlay, these peers tend to have a similar taste as the searching peer. If a term cloud would only show the most popular terms (according to the peer's own MegaCache), clicking on these terms may enforce the popularity of the clicked terms.

Frequency of discovered terms

800k+ dataset

Dataset 1

Unfiltered	Filtered

Dataset 2

Unfiltered	Filtered

Clustering

Status: Research put on hold for now as it proved to be quite complex

In a multilevel tag cloud, we cannot simply display the top N terms ranked on df and, when the user selects one of these terms, display the top M terms that co-occur with the selected term and are ranked on df. The graph below illustrates why. In this graph, the red nodes represent the top N=100 level 1 tags. For each level 1 tag, the top M=100 co-occurring tags are connected by an edge. "Pure" level 2 tags are colored green in the graph. As you can see, most level 1 tags also show up at the second level.

An example why we don't want this ranking is the following situation. Let's say that in the global top N terms, the terms "x" and "y" are the two most frequent terms, but that they have a co-frequency of merely 1. Let's say the user selects term "x". For the second level of terms, we do not want "y" to be ranked high at all.

A better approach seems to be to rank level 2 term candidates using the co-df. This results in a graph with 2275 nodes, as opposed to only 165 nodes.

The above images show how terms co-occur.