ClickLogs: Goals and Research questions
How relevant is a document for a given search term? This is the central question when we try to answer a user's search query. In a decentralized setting like when searching torrents, we mainly have to work with metadata like titles: The actual content is not available at search-time, so it cannot be analyzed to better predict relevance. However, user feedback is a precious resource in estimating relevance. Particularly implicit feedback, i.e. feedback which the user provides simply by using a system, can be a great benefit because it can be collected cheaply and at a large scale. This package aims to improve search quality in Triber by collecting and processing a particular kind of implicit feedback: click log data.
Click logs contain information about which document was selected after searching for which keyword, and at which position in the result list the selected document was positioned. This data can be used to improve future search result lists; the easiest scenario would be a case in which we assume that for a given keyword, it's always the same torrent that's selected although it's displayed at rank #16. A simple way to improve search results would simply be to always put this torrent on the top of the list for the particular keyword. Of course, such situations will turn out more complicated in reality, and how to optimally address them is one of the questions of this research project.
The goals of this project are three-fold:
- Improve the search experience for Tribler users by reranking search results using clicklog data.
- Explore and compare reranking algorithms using the obtained data.
- Gain experience with the dissemination and processing of user-generated metadata; lay foundations for a distributed tagging system.
1. Improvement of Tribler Search
Find what you're looking for, and have a pleasant experience while you're at it - this should be the offer of Tribler's search engine, particularly towards average users which do not want to browse many pages of results or become particularly smart about search terms. The aim of the clicklog approach is to support both search experience and quality:
1.1. Integrating reranking mechanisms into Tribler Search
The main step is to rerank search results based on latent information in the clicklog data, as explained above.
1.2. "Suggest" Feature using database of frequent keywords
Another use of clicklog data is a search field offering "auto-complete" or rather "suggest" functionality; while the user it typing, the program offers possible extensions to the already typed letters.
2. Exploring Reranking Algorithms
We are aiming to examine the influence of different reranking algorithms / of using reranking at all. Therefore, it is planned to randomly pick a reranking algorithm at search time and save it with the clicklog data. If we can accumulate enough data, we should be able to make a statement about the use of reranking by regarding, in the simplest case, the average position of clicked results; if clicked results are significantly higher for reranked results than for un-reranked results, reranking should be considered successful.
Different reranking algorithms inlude
- No reranking
- Rerank by number of cooccurrences
- Rerank by TfIdf
3. User-Generated Meta-Data
3.1 Spam Issues
Tribler's server-free infrastructure creates a different situation than for most other clicklog applications; normally, clicklog analysis is performed by the owner of a centralized search engine; with Tribler, the only clicks really observed are those of the local user; other clicks are only received via the BuddyCast protocol. Since these transmissions might be faked, it has to be examined in how far faked clicklogs could be used to, e.g., promote spam, and what counter-measures are appropriate. * A very simple approach for boosting a torrent's search position might be to transmit information about how it was selected for just about any keyword the spammer would like it to be associated with. However, the TfIdf weighting scheme introduced in the previous section counters such an attack because torrent/keyword cooccurrences are linearly weighted by the number of keywords associated with a torrent.
How do these issues scale, and which additional counter-measures are required when we are facing explicitly given tags in the future?
3.2 Privacy Issues
In the current phase, the goal is to overcome the technical difficulties in distributing and using clicklog data, i.e., a technological one. However, on a conceptual level, we have to ask ourselves under which situations clicklog data might be harmful in terms of privacy, and which countermeasures could be taken. Such questions will be explicitly addressed in a later step of the project.
Contact: Nicolas Neubauer, Neural Information Processing Group, TU Berlin [neubauer@cs.tu-berlin.de]
Overall Todos
- Create database structure to store relevant data (search terms used for and position of own and peers' downloads, from here on "clicklog data")
- done
- Extend interface to store own clicklog data
- first version running
- Extend BuddyCast Protocol
- Active: Attach clicklog to each transmitted preference hash
- Passive: Receive and store clicklog data
- Use clicklog data
- AutoComplete in search dialog
- Rerank search results
- Track distributed clicklog data
1. Database
Database Structure
ALTER TABLE MyPreference ADD COLUMN click_position INTEGER DEFAULT -1;
ALTER TABLE MyPreference ADD COLUMN reranking_strategy INTEGER DEFAULT -1;
ALTER TABLE Preference ADD COLUMN click_position INTEGER DEFAULT -1;
ALTER TABLE Preference ADD COLUMN reranking_strategy INTEGER DEFAULT -1;
CREATE TABLE Search (
peer_id INTEGER DEFAULT 0,
torrent_id INTEGER DEFAULT 0,
term_id INTEGER DEFAULT 0,
term_order INTEGER DEFAULT 0
);
CREATE INDEX idx_search_term ON Search (term_id);
CREATE INDEX idx_search_torrent ON Search (torrent_id);
CREATE TABLE Term (
term_id INTEGER DEFAULT 0,
term VARCHAR(255),
PRIMARY KEY (term_id)
);
CREATE INDEX idx_terms_term ON Term(term);
Notes & questions
- A note on indices: Maybe I'm a bit too generous with those, but I think we need them unless we want to keep the tables in memory...
- Both own and others' search terms are stored in Search. I use a peer_id of 0 for identifying the actual user's entries.
- This structure is now automatically created when CURRENT_DB_VERSION<2 *. Torrents are removed from MyPreference when the torrent itself is removed,- this means the corresponding search terms and position info should also be removed? Or might it make sense to keep this info anyway, somewhere else?
2. Store clicks
- On starting a download, store now additionally click_position and the currently used reranking strategy into MyPreference.
- Also get the current query terms and store them into Search and Terms
Notes
- It just seems that sometimes updateSelection is not called and the panels don't know their click position. Is it possible for a panel to be doubleclicked without first being selected?
- By the way: By storing clicklogs in Preferences, we're throwing away any clicklog data leading to YouTube, as these dont show up there, right? But if I got it right, this is analogous to BuddyCast which works on MyPreference?, only taking into account torrent downloads, right?
Short overview of code changes
- package Main
- file vwxGUI.standardGrid:
- standardGrid.updateSelection
- tell panel where it stands
pan.select(rowIndex,colIndex, self.standardPager.currentPage, self.cols, self.currentRows)
- tell panel where it stands
- standardGrid.updateSelection
- file vwxGUI.filesItemPanel
- FilesItemPanel.select
- have panel store its position in torrent data
if pageIndex>-1: panelsPerPage = panelsPerRow * rowsPerPage self.data["click_position"] = pageIndex * panelsPerPage + rowIndex * panelsPerRow + colIndex
- have panel store its position in torrent data
- FilesItemPanel.select
- file vwxGUI.standardDetails
- standardDetails.download
- create clicklog dict to pass on to MainFrame.startDownload
clicklog={"keywords": self.guiUtility.torrentsearch_manager.searchkeywords[self.mode]} if "click_position" in torrent: clicklog["click_position"] = torrent["click_position"] # Api download d = self.utility.frame.startDownload(torrent_filename,destdir=dest, clicklog=clicklog)
- create clicklog dict to pass on to MainFrame.startDownload
- standardDetails.download
- file vwxGUI.MainFrame
- startDownload
- pass on clicklog data to Core by getting the instance of MyPreferenceHandler and calling, after starting the download, the new method addClicklogToMyPreference(self, infohash, clicklog_data, commit=True)
- startDownload
- file vwxGUI.standardGrid:
- package Core
- file CacheDB.SqliteCacheDBHandler
- MyPreferenceDBHandler
- new method addClicklogToMyPreference(self, infohash, clicklog_data, commit=True)
- PreferenceDBHandler
- addPreference:
- use click_position and search terms if available;
- i.e., change signature, handle presence/absence of these values, if present, use different sql
- addPreferences
- allow for prefs to contain click_position and search terms as well
- addPreference:
- New class TermDBHandler
def getTermID(self, term): """returns the ID of term in table Term; creates a new entry if necessary""" def insertTerm(self, term, commit=True): """creates a new entry for term in table Term""" def getTerm(self, term_id): """returns the term for a given term_id""" def getTermsStartingWith(self, beginning): """returns all terms starting with beginning""" - New class SearchDBHandler
class SearchDBHandler(BasicDBHandler): def storeKeywords(self, peer_id, torrent_id, terms, commit=True): "creates a single entry in Search with peer_id and torrent_id for every term in terms" def getNumTermsPerTorrent(self, torrent_id): """returns the number of terms associated with a given torrent""" def getNumTorrentsPerTerm(self, term_id): """returns the number of torrents stored with a given term.""" def getNumTorrentTermCooccurrences(self, term_id, torrent_id): """returns the number of times a torrent has been associated with a term""" def getRelativeTermFrequency(self, term_id, torrent_id): """returns the relative importance of a term for a torrent This is basically tf/idf term frequency tf = # keyword used per torrent/# keywords used with torrent at all inverse document frequency = # of torrents associated with term at all normalization in tf ensures that a torrent cannot get most important for all keywords just by, e.g., poisoning the db with a lot of keywords for this torrent idf normalization ensures that returned values are meaningful across several keywords """
- MyPreferenceDBHandler
- file CacheDB.SqliteCacheDBHandler
3. Extend BuddyCast Protocol
Active
Passive
TODO (I'm aware that this is where the actual work lies...)
4. Using the data
AutoComplete
- Implement minimum edit distance search
- Integrate into interface (me?)
- Either simply propose one keyword (completing it in the text field with the completion marked)
- Or replace text field with a drop down list, proposing more than one possible completion
Reranking
Implement heuristics for improving rankings. Very generally, rank documents higher if we know peers found them using the current search terms.
- By absolute times found using the current search term by peers, or
- By times found using current search terms by peers, normalized by number of peers who have this torrent at all to avoid "Harry Potter phenomena" and discount spam attacks claiming their file to be relevant for any search term
- How would position of previous clicks come into play here? Would it?
- Could we use the information about the user search terms here as well?
In order to be able to make any statement about the effect and quality of reranking algorithms, we should swap them randomly (even if we only use one actual reranking strategy, we should swap it with strategy 0, the current, non-existing one).
We could do more elaborate learning if, for a click on position 3, we would also store the non-clicked torrents. I guess this is out of the question traffic-wise..?
Code-wise, this means writing alternative implementations of Core.Search.KeywordSearch and pass them on to Core.Search.SearchManager. Preferably, SearchManager would receive a couple of instances to choose from, each with a "getRerankingStrategyID()" or so, choose one at random at every search (while evaluating) and make the choice readable for later logging.
5. Tracking the data
- There's a lot to be said about
- which keywords were used for which torrent,
- which keywords were used most frequently (this brings to my attention that the family filter would probably have to check the autocomplete...)
- Of course, we would like to see if the reranking improves position. This is where the reranking_strategy comes into play. There should be some significant differences in average click position between reranked and non-reranked results (or between the different reranking strategies). Besides the possibility for fancy graphs, this allows for an informed decision which reranking strategy to finally use in a later version of Tribler.
- Actually, any analysis we might perform on aggregated clicklog data could also be a funky analysis in the regular Tribler client, performed on the peer data available to the user. Just as a side node.
