TY - GEN
T1 - Efficient clustering of short messages into general domains
AU - Tsur, Oren
AU - Littman, Adi
AU - Rappoport, Ari
PY - 2013
Y1 - 2013
N2 - The ever increasing activity in social networks is mainly manifested by a growing stream of status updating or microblog-ging. The massive stream of updates emphasizes the need for accurate and efficient clustering of short messages on a large scale. Applying traditional clustering techniques is both inaccurate and inefficient due to sparseness. This paper presents an accurate and efficient algorithm for clustering Twitter tweets. We break the clustering task into two distinctive tasks/stages: (1) batch clustering of user annotated data, and (2) online clustering of a stream of tweets. In the first stage we rely on the habit of 'tagging', common in social media streams (e.g. hashtags), thus the algorithm can bootstrap on the tags for clustering of a large pool of hashtagged tweets. The stable clusters achieved in the first stage lend themselves for online clustering of a stream of (mostly) tagless messages. We evaluate our results against gold-standard classification and validate the results by employing multiple clustering evaluation measures (information theoretic, paired, F and greedy). We compare our algorithm to a number of other clustering algorithms and various types of feature sets. Results show that the algorithm presented is both accurate and efficient and can be easily used for large scale clustering of sparse messages as the heavy lifting is achieved on a sublinear number of documents.
AB - The ever increasing activity in social networks is mainly manifested by a growing stream of status updating or microblog-ging. The massive stream of updates emphasizes the need for accurate and efficient clustering of short messages on a large scale. Applying traditional clustering techniques is both inaccurate and inefficient due to sparseness. This paper presents an accurate and efficient algorithm for clustering Twitter tweets. We break the clustering task into two distinctive tasks/stages: (1) batch clustering of user annotated data, and (2) online clustering of a stream of tweets. In the first stage we rely on the habit of 'tagging', common in social media streams (e.g. hashtags), thus the algorithm can bootstrap on the tags for clustering of a large pool of hashtagged tweets. The stable clusters achieved in the first stage lend themselves for online clustering of a stream of (mostly) tagless messages. We evaluate our results against gold-standard classification and validate the results by employing multiple clustering evaluation measures (information theoretic, paired, F and greedy). We compare our algorithm to a number of other clustering algorithms and various types of feature sets. Results show that the algorithm presented is both accurate and efficient and can be easily used for large scale clustering of sparse messages as the heavy lifting is achieved on a sublinear number of documents.
UR - http://www.scopus.com/inward/record.url?scp=84900417810&partnerID=8YFLogxK
U2 - 10.1609/icwsm.v7i1.14420
DO - 10.1609/icwsm.v7i1.14420
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84900417810
SN - 9781577356103
T3 - Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013
SP - 621
EP - 630
BT - Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013
PB - Association for the Advancement of Artificial Intelligence
T2 - 7th International AAAI Conference on Weblogs and Social Media, ICWSM 2013
Y2 - 8 July 2013 through 11 July 2013
ER -