Metadata Enrichment in Tribler (MET) Notes
Basically miscellaneous notes moved from the main page to here to reduce clutter, but still archive them.
MTurk Experiment
First Experiment
Data:
- before_open_size_07_Batch_357268_result.csv
- before_open_size_12_Batch_355973_result.csv
- before_open_size_37_Batch_357607_result.csv (orig. size 40)
- after_open_size_11_Batch_357999_result.csv
- after_open_size_45_Batch_358463_result.csv
Results Before:
| Real | 157 / 124 | (281) | 0.5587 / 0.4413 |
| Fake | 149 / 130 | (279) | 0.5341 / 0.4659 |
| -- | |||
| TV | 43 / 17 | (60) | 0.7167 / 0.2833 |
| Movies | 37 / 22 | (59) | 0.6271 / 0.3729 |
| Books | 18 / 25 | (43) | 0.4186 / 0.5814 |
| Music | 33 / 29 | (62) | 0.5323 / 0.4677 |
| Software | 26 / 31 | (57) | 0.4561 / 0.5439 |
| -- | |||
| Cur news | 38 / 19 | (57) | 0.6667 / 0.3333 |
| Commercial | 41 / 31 | (72) | 0.5694 / 0.4306 |
| Sport | 18 / 33 | (51) | 0.3529 / 0.6471 |
| Howto/Talk? | 23 / 25 | (48) | 0.4792 / 0.5208 |
| Home video | 29 / 22 | (51) | 0.5686 / 0.4314 |
Results After:
| Real | 151 / 130 | (281) | 0.5374 / 0.4626 |
| Fake | 157 / 122 | (279) | 0.5627 / 0.4373 |
| -- | |||
| TV | 26 / 19 | (45) | 0.5778 / 0.4222 |
| Movies | 31 / 24 | (55) | 0.5636 / 0.4364 |
| Books | 39 / 31 | (70) | 0.5571 / 0.4429 |
| Music | 29 / 28 | (57) | 0.5088 / 0.4912 |
| Software | 26 / 28 | (54) | 0.4815 / 0.5185 |
| -- | |||
| Cur news | 40 / 28 | (68) | 0.5882 / 0.4118 |
| Commercial | 23 / 21 | (44) | 0.5227 / 0.4773 |
| Sport | 38 / 28 | (66) | 0.5758 / 0.4242 |
| Howto/Talk? | 35 / 22 | (57) | 0.6140 / 0.3860 |
| Home video | 21 / 23 | (44) | 0.4773 / 0.5227 |
Mockups
Preliminary findings studying some data from the ClickLog crawl
- Sometimes a hit was found by matching keywords from the query against a file name in a multi-file torrent.
- Question: Is there a significant number of users that search for file names?
- Based on a corpus of more than 800k swarm names, we have identified some users use stopwords in their queries.
- Remark by Martha: certain users probably expect exact-match behaviour. Example query: "the hill productions".
- Further directions: Group search queries by users (anonymized) and study search behaviour.
- ClickLog bug: Propagation of ClickLog information is incorrect when a user has searched and downloaded the same torrent multiple times. Consequence: Reconstructing the query results in repeated words.
Current Tribler code for determining keywords of a swarm:
keywords = Set(split_into_keywords(torrent_name))
# search through the .torrent file for potential keywords in
# the filenames
for filename in torrentdef.get_files_as_unicode():
keywords.update(split_into_keywords(filename))
After 20 days of crawling:
- 501 unique queries found.
- Contacted 48 peers.
- Queries were issued by 227 peers, of which 26 were directly crawled.
- 1467 different swarms were clicked on.
( only swarms for which their names are known are counted)
Discovery of torrents/terms
The following three plots show how many torrents are discovered over time, and how many terms are extracted from that. We also show how many terms remain after doing simple filtering (word length > 2, freq > 1).
Note that the act of searching causes torrents to be discovered at a higher rate (metadata for unknown infohashes in the search results gets requested).
This raises the question whether active peers will tend to get a "narrower" view on the network than idling peers. When an active peer is shown a term cloud and clicks in it, it will perform a search on neighbouring peers. Due to Tribler's semantic overlay, these peers tend to have a similar taste as the searching peer. If a term cloud would only show the most popular terms (according to the peer's own MegaCache), clicking on these terms may enforce the popularity of the clicked terms.
Frequency of discovered terms
800k+ dataset
Dataset 1
| Unfiltered | Filtered |
![]() ![]() ![]() ![]() | ![]() ![]() ![]()
|
Dataset 2
| Unfiltered | Filtered |
![]() ![]() ![]() ![]() | ![]() ![]() ![]()
|
Clustering
Status: Research put on hold for now as it proved to be quite complex
In a multilevel tag cloud, we cannot simply display the top N terms ranked on df and, when the user selects one of these terms, display the top M terms that co-occur with the selected term and are ranked on df. The graph below illustrates why. In this graph, the red nodes represent the top N=100 level 1 tags. For each level 1 tag, the top M=100 co-occurring tags are connected by an edge. "Pure" level 2 tags are colored green in the graph. As you can see, most level 1 tags also show up at the second level.
An example why we don't want this ranking is the following situation. Let's say that in the global top N terms, the terms "x" and "y" are the two most frequent terms, but that they have a co-frequency of merely 1. Let's say the user selects term "x". For the second level of terms, we do not want "y" to be ranked high at all.
A better approach seems to be to rank level 2 term candidates using the co-df. This results in a graph with 2275 nodes, as opposed to only 165 nodes.
The above images show how terms co-occur.
Attachments
-
discovery_1.png
(59.1 KB) -
added by vliegendhart@… 21 months ago.
Discovery of torrents at peer 1
-
discovery_martha.png
(52.4 KB) -
added by vliegendhart@… 21 months ago.
Discovery of torrents at peer 3 with periods of downtime
-
sparsity.png
(127.1 KB) -
added by vliegendhart@… 21 months ago.
Sparsity pattern of co-occurence matrix for the top 30K frequent terms of 800k+ dataset
-
sparsity-local.png
(39.9 KB) -
added by vliegendhart@… 21 months ago.
Sparsity pattern of co-occurence matrix for terms of a small local db (ranked by frequency)
-
SearchRecommendation1.png
(69.0 KB) -
added by vliegendhart@… 21 months ago.
Search Recommendation overview
-
SearchRecommendation2.png
(115.8 KB) -
added by vliegendhart@… 21 months ago.
Search Recommendation - mouseover item
-
SmartSearchbar.png
(82.1 KB) -
added by vliegendhart@… 21 months ago.
Smart search bar mockup
-
tagmockup_alphabetical.png
(275.3 KB) -
added by vliegendhart@… 21 months ago.
Alphabetical tag cloud mockup
-
tagmockup_list.png
(274.9 KB) -
added by vliegendhart@… 21 months ago.
Tag list mockup
-
tagmockup_multi1.png
(139.9 KB) -
added by vliegendhart@… 21 months ago.
Multilevel tag cloud - Level 1
-
tagmockup_multi2.png
(316.0 KB) -
added by vliegendhart@… 21 months ago.
Multilevel tag cloud - Level 2
-
tagmockup_multi3.png
(321.7 KB) -
added by vliegendhart@… 21 months ago.
Multilevel tag cloud - Level 3
-
large800k.png
(3.1 KB) -
added by vliegendhart@… 21 months ago.
Histogram 800k dataset
-
large800k_loglog.png
(4.4 KB) -
added by vliegendhart@… 21 months ago.
Loglog 800k dataset
-
local1_1hr.png
(3.0 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 1 (1st hour)
-
local1.png
(2.9 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 1
-
local1_1hr_filtered.png
(3.1 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 1 (1st hour) - filtered
-
local1_filtered.png
(2.8 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 1 - filtered
-
local1_1hr_loglog.png
(3.2 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 1 (1st hour)
-
local1_loglog.png
(3.7 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 1
-
local1_1hr_filtered_loglog.png
(3.3 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 1 (1st hour) - filtered
-
local1_filtered_loglog.png
(3.8 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 1 - filtered
-
local2_1hr.png
(3.1 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 2 (1st hour)
-
local2.png
(2.9 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 2
-
local2_1hr_filtered.png
(2.9 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 2 (1st hour) - filtered
-
local2_filtered.png
(3.3 KB) -
added by vliegendhart@… 21 months ago.
Histogram dataset 2 - filtered
-
local2_1hr_loglog.png
(3.1 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 2 (1st hour)
-
local2_loglog.png
(3.7 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 2
-
local2_1hr_filtered_loglog.png
(3.2 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 2 (1st hour) - filtered
-
local2_filtered_loglog.png
(3.8 KB) -
added by vliegendhart@… 21 months ago.
Loglog dataset 2 - filtered
-
discovery_2.png
(60.9 KB) -
added by vliegendhart@… 21 months ago.
Discovery of torrents at peer 2































