Metadata Enrichment in Tribler (MET)

Status: Phase I implemented in Tribler's mainbranch, awaiting 5.3 release
(archived notes)

Research Questions

P: We conjecture that the user has no idea what is available in the Tribler network.
S: Show a term cloud.
(TODO: make this more elaborate. Include things like: we extract terms from swarm names, tf-idf not useful, high frequent terms may not be really informative but should still be included, need for sampling, allowing the cloud to be noisy.)

Will a tag cloud of popular content be useful? Will the user find the information he or she is looking for?
- If we want to measure this, do we have to annotate clicklogs whether the click was performed after a normal search or after clicking on a tag? (Assuming that clicking on a tag is effectively the same as performing a keyword search on that tag.)
- Also note that we have to spread metadata for tag clouds to work. We cannot rely on ClickLog data since it is not propagated.
Is displaying a "pimped" swarm name instead of its original name useful for search?
- (Assuming we are still going to implement this? Metadata can be spread via channels.)
- Open question: how are we going to measure the effect of a "pimped" swarm name?
How can we detect near-duplicates and related swarms?
- Can we only rely on techniques like tf-idf for this?
- Do we need additional user inputted metadata? Or implicit metadata obtained through click behaviour?
- Can we use techniques like multimodal video copy detection (paper linked below)?
  - If so, who will be responsible for generating the video and audio descriptors?
  - How are these video/audio descriptors to be distributed?
  - How do we check the integrity of these descriptors?
  Johan's new idea (02-09-2010): Triple Synergy Near Duplicates Detection
  A/V match + bitrate + duration + metadata + network connections.

(TODO: merge the following questions/remarks with the list above)

Are there certain users whose queries are "more reliable" -- i.e. should make a larger contribution to the network cloud?

Skill level?
Use of queries that others also use?
Combo with matched swarm names?
Future channel owners?
(Does the query word order reflect the order of elements in swarm title?)

Bottlenecks

Few queries are used by more than person --> bigger sample.
How to propagate MegaCache info throughout the network!
(Blindly propagating is not spam-resilient)
Lack of channel owners.
Hierarchical clouds are interesting but difficult (need to contact Christian Wartena).

More questions and issues (added 04-08-2010):

Ranking terms extracted from swarm titles is difficult:
- Term frequency of a term is nearly equal to document frequency (where title=single document).
- Ranking terms based on frequency-popularity results in stopwords getting the highest ranks. One possible solution is to use the Snowball stopword list as a filter.
Some torrent titles use "." as a space and is currently considered as such by the current term extraction test implementation. A side-effect is that domains in titles (e.g. www.somereleasegroup.org) will generate useless terms as "www", "org", "com", "net", etc. Possibly, domains should be detected and extracted as a single term.
Relying on the Snowball stopword list is not sufficient when finding co-occurring term pairs. Terms with a high document frequency should be considered as stopwords when trying to find interesting term pairs.

Current filtering and extraction rules

The full swarm title is feed into Tribler's FamilyFilter to determine whether the swarm title is safe. Only safe titles are further processed.
Titles are split on: whitespace, periods, underscores, hyphens. Single quotes are ignored, as are brackets and parentheses.
Each extracted term is subject to the following filter rules:
1. Terms in the Snowball stopwords list are ignored.
2. All digit terms are ignored, unless it has the form "19##" or "20##".
3. Terms of the form "s##e##" are ignored.
4. Terms of length less than 3 are ignored.
5. The terms "www", "net", "com", and "org" ignored.
We may later reconsider some of these rules.
We could consider using an exception list that is based on ClickLog information. If a certain search term is used to find data, but is ignored according to the rules above, we can add this term to the exception list.

Error: Failed to load processor protected

No macro or processor named 'protected' found

TODO: Update this for the "network buzz" prototype.

Literature

Smarter Blogroll: An Exploration of Social Topic Extraction for Manageable Blogrolls	Hawaii International Conference on System Sciences, vol. 0, pp. 155+, 2008.	Baumer et al
Automated Tag Clustering: Improving search and exploration in the tag space	In Proc. of the Collaborative Web Tagging Workshop at WWW'06, 2006	Begelman et al
Query Similarity by Projecting the Query-Flow Graph	ACM SIGIR conference on Research and development in information retrieval. ACM Press, July 2010	Bordino et al
A Statistical Comparison of Tag and Query Logs	SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2009	Carman et al
Tag-Based Navigation for Peer-to-Peer Wikipedia	(I-Share)	Fokker et al
Automatic recognition of multi-word terms: the C-value/NC-value method	International Journal on Digital Libraries, vol. 3, no. 2, pp. 115-130, August 2000	Frantzi et al
Usage patterns of collaborative tagging systems	Journal of Information Science, vol. 32, no. 2, pp. 198-208, April 2006	Golder et al
Improving Tag-Clouds as Visual Information Retrieval Interfaces	MerÃda, InSciT2006 conference, 2006	Hassan-Montero et al
Patterns and Inconsistencies in Collaborative Tagging Systems : An Examination of Tagging Practices	Annual General Meeting of the American Society for Information Science and Technology, November 2006	Kipp et al
Studying Social Tagging and Folksonomy: A Review and Framework	Journal of Digital Information, vol. 10, no. 1, 2009	Trant
Tag clouds and the case for vernacular visualization	ACM interactions, vol. 15, no. 4, pp. 49-52, 2008	ViÃ©gas et al
Learning to cluster web search results	SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2004, pp. 210-217	Zeng et al

Meeting TBA @ TBA in TBA

TBA

Meeting 30-03-2010 @ 1:30pm in Shannon Room

Attendees: Martha, Raynor, Johan, Niels
Topic: Bundling issues and future ranking issues

Top Three Current Issues for Bundling

Currently underway: We need to identify simple combinations of bundling functions + bundling conditions that are consistent with user-perceived similarities (we think yes: one that targets "series bundling" functionality and one that targets "spam filtering"). (Martha's top risk point)
- Conservative choices
- Raynor + Niels will figure out the representation in the GUI.

How does bundling interact with the order-by buttons. Maybe there is a "representative" item that determines the ordering. Feasibility depends on GUI. (Raynor's top risk point)
- order-by button reorders bundles, not individual results.

We need to elegantly handle incoming results, which can destroy bundles -- the destruction is potentially global, i.e., goes beyond the results of the moment. (Raynor's second-to-top risk point)
- Niels: Can be solved using a "Click here to show new results" approach similar to Twitter.

Further current issues for bundling

How to crawl usage statistics for bundles that will allow us to assess effectivity?
We need to represent the bundles in the interface in a way that does not lead to semantic overload for the user.
- Johan: Hide results only when your level of confidence is high.
We need to give the user a simple, intuitive way of changing between bundles (Raynor has a nice idea).
- Raynor: use simple circle/bullet icons to switch between bundled views. Example of this interface can be found on channel9.msdn.com.
We need to decide how many of the results from the ranked list should be considered for bundling (do we go beyond the top 250)?
- Niels: Performance issue. Should discuss for example with Boudewijn if we can use psyco to speed up certain grouping functions.
How to choose and combine bundling conditions.
How to use the user-perceived similarity data gathered on MTurk to quantitatively demonstrate that our bundles group near-duplicates (and demote spam).
How to propagate condition+function pairs through the network (is this even useful)?
- Niels: At the very least, override locally when the user has selected a non-default bundling in the past for a certain query.
How to exploit MCA information? Can we integrate it?

Future Issues for Ranking

Developing a high quality combination of bundling and ranking
Are reranker keywords feasible? (i.e., automatically extractible and useful to the user)
How to take open2edit information into account in the ranking?
If reranking modalities (reranker keywords, bundling function+condition_ are helpful, can/should they be propagated through the network?

Current issues with ranking in Tribler:

Ranking function in the code seems to add the number of negative votes to the ranking score, but it might be actually ok and just coded in a non-obvious way.
Incoming results reset "ordered-by" option selected by users to the default ranking.
- Niels: this particular behaviour should have been fixed.
There is no way to return from the "order-by" option to the original ranking unless a new result comes in.
Ranking algorithm doesn't take into account where the keyword is mentioned
The surrogates displayed in the ranked list don't contain evidence that would help the user assess their relevance. In snippets used in mainstream search engines query words are highlighted for this purpose.
- Niels: cannot easily highlight search terms using wxPython.

Additional notes:

2010 --> maybe if everything has a number, you don't actually use the function?
How to make the decision about when to apply weighting in the Levenshtein function --> discussion about decaying penalty
Johan: make it conservative to handle the season case.
Niels: harm of incorrect bundling depends on visibility in GUI.

Meeting 20-01-2011 @ 2pm in HB10.230

Attendees: Martha, Raynor
Topic: Near-Duplicate Detection

Possible new algorithm: pairwise similarity.
Concerns for available real estate in UI.
Need to know relevance ranking and how bundling would interact.

Meeting 10-01-2011 @ 3pm in HB10.230

Attendees: Niels, Nitin, Martha, Raynor
Topic: Near-Duplicate Detection

Raw meeting notes from Martha:

Card Sort of responses from first HIT: User-derived dimensions of similarity Goal is to classifying the HIT responses into reasons why people think torrent names are similar.
Sorted all "high agreement" blocks (triples of swarm names). High agreement blocks are blocks in which all three workers decided that there was a two-one split and picked the same item as the odd one out.
Lots of types of blocks left to sort: validation blocks, mid-agreement blocks (two workers agree), low agreement blocks.

General observations about the process:
- Niels: Are we identifying reasons or methods by which people categorize things?
  (we seem to be getting information on both)
- Nitin: They write what is different, and we are trying to identify on what basis that they are trying to group.

A set of similarity dimensions is derived (recorded on green sheets).

Other observations about the results:
- People don't seem to differentiate episodes from series in the film (seems to be two understandings of "episodes") It makes most sense to count "mistakes" (for example confusing a game and a soundtrack) as part of user-perceived similarities Users are quite able to identify episodes (Niels)
- Is it important to differentiate items associated with films (the films soundtrack) with other items that happen to have the same name. Where to draw the line for this understanding of corresponding items?

Bundle Functions
Which light-weight algorithms are anticipated to be most useful to create bundlings that would reflect the user-derived dimensions of similarity:
- Extensions (reflect the types of files)
- Filesize
- Integer ranking
- Modified Levenshtein
  (modified Levenshtein should be based on past experience of Niels, aspects like it's difficult to use stopwords appropriately etc.)

Most interesting innovative aspect of bundling:
- Condition the bundling functions on properties of results list
- Application conditions for the bundle function are then propagated with the bundle function through the network
  (NB For bundling there should be local override.)

Next steps:
- set up list of characteristics of lists that can be used for conditioning bundling
- set up a plan for implementation (in particular what are the expected risks and contraints)
- finish the card sort on the responses: in the end we'll have a list of user-derived similarity dimensions that lightly motivates each bundling function

Meeting 29-12-2010 @ 4pm in HB10.320

Attendees: Martha, Raynor
Topic: Near-Duplicate Detection

Concerns about Near-Duplicates include:
- Different concepts of near-duplicates are applicable in different situations. Example: do you group all "Bronze Boy" soundtracks together and all "Bronze Boy" movies (i.e. Bronze Boy 1, Bronze Boy 2.5, Bronze Boy 33.333333...) together, or do you put the "Bronze Boy 1" soundtrack together with the "Bronze Boy 1" movie?
- There is "noise" in the results list, which means it makes it harder to do "noise-free" near-duplicate clustering. So it means that we need a method of presenting of clustered results to the user in a way that "noise" is not bothersome.
- Practically speaking, in the interface there is already one level of collapsing to give details of a given item, so it would also be difficult to visually group near-duplicates in a similar way (nested collapsing).

More precise definitions of similarities:
- Content similarity:
  One will do for everyone.
- Context similarity:
  One will do for a certain class of users (e.g. Portugese speakers) or situations (e.g. mobile devices).

Meeting 16-12-2010 @ 4:00pm in HB09.290

Attendees: Johan, Niels, Nitin, Raynor.
Topic: Near-Duplicate Detection

HIT design:
- Verifyable results are important.
- Niels mentioned a MTurk experiment about rewriting paragraphs that consisted of 3 successive HITs.
  1. Highlight a candidate sentence in a given paragraph to be rewritten.
  2. Rewrite the highlighted sentence (from HIT 1) in a given paragraph.
  3. Given a highlighted sentence in a paragraph, choose a sentence (from HIT 2) to replace it.
- Similarly, we can split our experiment into multiple HITs:
  1. Let workers group search results into 2 or 3 groups.
  2. Let workers, given groupings from HIT 1, pick the best grouping.
Nitin mentioned that our NDD and grouping should be generally applicable if we want to write a paper (in order to avoid legal issues).
- We could draw comparisons with YouTube and their near-duplicates problem.
- Niels remarked that we then might have to focus on YouTube, using search results from YouTube and not public BitTorrent trackers.
- For a paper, focus should be on user-generated content?

Meeting 09-12-2010 @ 4:00pm in HB09.290

Attendees: Johan, Martha, Niels, Nitin, Raynor.
Topic: Near-Duplicate Detection

HIT design:
- Display as much as information as possible in a HIT, such that the user can group the data.
Dataset:
- Don't use Tribler search results (not enough different sources, hence, less duplicates).
- Fetch more hits and metadata from the search results page.
- Sample from all hits.

Meeting 24-11-2010 @ 3:00pm in HB10.320

Attendees: Carsten, Martha, Raynor.
Topic: CSDM2011 paper and MTurk experiments.

Run 10 HITs for a reward of $0.10:
- 5 "Before HITs".
- 5 "After HITs".
Run these HITs under 5 different conditions:
- 3 different titles.
- 2 question conditions.
Have each HIT be completed by 10 workers: (5+5)*(3+2)*10 = 500 assignments.

Run HITs in succession (not concurrently). Start with 5 Before HITs under 1 condition, monitor the uptake and predict the total time required to complete the full experiment.

Meeting 24-11-2010 @ 2:00pm in HB09 Small Meeting Room

Attendees: Niels, Nitin, Martha, Raynor.
Topic: Near-Duplicate Detection

What is a near-duplicate? Several possible candidates:

Contains "the same" content.
One person would not watch both.
- One will do for everyone.
- Functional duplication.
Non-junkie would want only *one* of these.

Do we want to detect serial media?

Tractable?
Tolerable?
Niels: might be hindering rather than helping if several "Series X" episodes are grouped+collapsed.

Concerns and restrictions:

Niels: computational complexity is important. Needs to be real-time when presenting the search results.

Current plan of attack:

How to visualize search results?
- Go to a public tracker e.g. TPB and fetch the top 100 popular torrents.
- Use Tribler to get lists of search results.
- Make a HIT (crowdsource term, "Human Intelligence Task") requiring users to cluster the lists.
- Pilot the HIT.
- Run the HIT.
Use Turk (or Crowdflower) to understand the hard issues related to noise. Need to get typical examples of ND+SI.
- Get Top 100 from TPB.
- Use Tribler to search (try several obvious searches).
- Create ND+SI groups from results.
- Create plausible corruptions.
- Test for user sensitivity on Turk.
- Present the user with three or four different groupings + compare.
Algorithm for identifying near-duplicates (ND) and serial items (SI):
- Two-step approach: first use category information.
- Rule based?
- Machine learning?
Create a distributed version.
Simulate how this leads to network health.

Meeting 18-11-2010 @ 10:00am in HB10.320

CSDM2011: aim to get a short paper (<=4 pages) about crowdsourcing.
Tag cloud user study is not a piecemeal task, rather it requires cognitive processing and an opinion of the worker. It's also different because P2P networks are outside of the daily experience of the average Internet user. The final point is that there's risk that workers apply a matching strategy to substrings in the filenames to classify them (i.e. matching strategy is literal or superficial strategy rather than one that reflects the complete comprehension of the task).
- We study the following research questions:
  - The impact of the title on the uptake and work quality.
  - The impact of sketch-like mockup on the uptake and work quality.
  - The impact of the free-text justification questions on the uptake and work quality.
- We are also interested in the following aspects:
  - Optimal award level.
  - Difference between open and closed.
  - How many workers in the open condition do both the before and after HIT (this will answer whether recruitment is necessary).

Meeting 02-09-2010 @ 2:00pm in HB10.320

"Single glimps of understanding of what's in the system"
"Feeling that there's a network behind it"
"Shortens the time to understanding what you can find"
"Surrogate for experience with the system"
"Overspecification: right level of specific. of the query"

Meeting 19-08-2010 @ 5:00pm in HB09.030 / HB10.320

Current plan:

Work on a first version of "content overview"/"network buzz". Build a small prototype and then merge it into the native GUI mainbranch of Tribler.
Continue to work on mockups.
Short paper at ECIR 2011? Tag clouds in distributed systems: peer by peer variability.
- Difference between peers
  - Idea: compare difference with searching/non-searching peers.
- How close do we approximate global state?
  - What is global state? E.g. what to do with dead torrents?
- "Fake idf"
- Spam issues
- Unbounded scalability

Meeting 26-07-2010 @ 3:30pm in HB09.030

By Johan:

Take all crawled ClickLog data.
Simply take existing visualisation techniques.
Focus: extract "tags" (terms) from ClickLog; spam prevention; tf-idf...