TREC Microblog 2015 competition

Authors: Anh Dang, Raheleh Makki Niri, Aminul Islam, Abidalrahman Moh’d

Trec Competition: https://github.com/lintool/twitter-tools/wiki/TREC-2015-Track-Guidelines

Overview

In this work, we explore different strategies to match Tweets in Twitter to 250 user interest profiles so that users will receive tweets that they may be interested in. Observing that a potentially interesting topic may directly or indirectly exist in a user profile, we propose a novel approach to determine whether two tweets are semantically duplicated using of Wikipedia as an external knowledge and corpus-based word semantic relatedness and use this approach to assign a Tweet to a profile. Two tweets are semantically duplicated if their contents are both lexical (TF-IDF) and semantically similar [1]. For data processing, we filter out all stop words, non-English tweets, retweets, and semantically duplicated tweets. For semantic similarity, each tweet and profile are represented as a bag of concepts based on Wikipedia entity linking and we compute semantic similarity between two Wikipedia bags of concepts [2]. For the automatic run of Scenario A, we compute similarity score between a tweet content and 250 profiles and assign it to the profile that achieves the highest similarity score and this tweet is not semantically duplicated with the previous chosen tweets in this profile. For the automatic run of Scenario B, for each profile, we collect all the tweets that are related to this profile. At the end of the day, we cluster these tweets into 100 clusters and select a representative tweet in each cluster for the result. For the manual run, we manually construct a list of discriminative features for each profile before the evaluation started. Each feature in this list has a weight that indicates the importance of that feature for the specific profile. For Scenario A manual run, we use these weights to bias our semantically duplicated tweet calculation. For Scenario B manual run, during the evaluation and at the end of each day, we use Lucene and index the tweets posted in the last 24 hours. Then, in order to retrieve relevant tweets for each profile, we create a query from the list of features of that profile, and use query level boosting to set a boost for each feature based on their weights. Then, we use the Dirichlet prior function to retrieve a ranked list of relevant tweets based on their relevance score. After that, we start from the top of the ranked list and include each tweet in the result if they meet both of these two conditions: 1- the tweet should not be a near-duplicate of what has already been included in the result list, 2- its score should be higher than a predefined threshold (it prevents retrieving non-relevant tweets with some similar keyterms). If the tweet does not meet either of these conditions, we ignore it and move to the next tweet in the ranked list. This is continued until there are 100 tweets in the results, or there are no more tweets in the ranked list.

[1] Islam, A., Milios, E.E., Keselj, V.: Text similarity using Google tri-grams. In Kosseim, L., Inkpen, D., eds.: Canadian Conference on AI. Volume 7310 of Lecture Notes in Computer Science., Springer (2012) 312-317.

[2] Trani, S., Ceccarelli, D., Lucchese, C., Perego, R. Dexter 2.0 – an open source tool for semantically enriching data , In Proceedings of the 13th International Semantic Web Conference, Riva Del Garda, Italy, October 2014.