The Bag of Communities: Identifying Abusive Behavior Online with Preexisting Internet Data


論文アブストラクト:Since its earliest days, harassment and abuse have plagued the Internet. Recent research has focused on in-domain methods to detect abusive content and faces several challenges, most notably the need to obtain large training corpora. In this paper, we introduce a novel computational approach to address this problem called Bag of Communities (BoC)---a technique that leverages large-scale, preexisting data from other Internet communities. We then apply BoC toward identifying abusive behavior within a major Internet community. Specifically, we compute a post's similarity to 9 other communities from 4chan, Reddit, Voat and MetaFilter. We show that a BoC model can be used on communities "off the shelf" with roughly 75% accuracy---no training examples are needed from the target community. A dynamic BoC model achieves 91.18% accuracy after seeing 100,000 human-moderated posts, and uniformly outperforms in-domain methods. Using this conceptual and empirical work, we argue that the BoC approach may allow communities to deal with a range of common problems, like abusive behavior, faster and with fewer engineering resources.


既存のコミュニティ内にある不適切表現のラベル付きデータ を用いて別のコミュニティの不適切表現を識別できる か実験. コミュニティ間の発言の類似度を求めたり、新規投稿データを随時取り込むことで精度向上.