The internet is a messy place, and that is especially true for the blogosphere with plenty of spam blogs, link farms, comment spam, and other data that is undesirable for our users. At Scout Labs we do quite a bit of analysis on the data we are indexing in our system and use a number of tools to keep the quality of text content high. Our main tool in fighting spam is machine learning, which we use to identify “spammy” documents and suppress them in our results.
Our first goal has been to identify and suppress keyword spam. This narrow focus has allowed us to make rapid progress, and in initial tests our spam model catches over 70% of keyword spam (recall) while misclassifying less than 5% of non-spam as spam (that is fewer than 5% false positives). We now suppress results that appear as real blog posts on real blogs but look like this:

How do we do it? We use a set of processes and algorithms called machine learning to build a spam predictor. We start out with a large set of documents that people had identified as spam or non-spam. We then use machine learning to create a program (or model), that predicts if a person would identify a document as spam or non-spam. We created this initial set of documents either by judging documents ourselves, or use Amazon’s Mechanical Turk.
When we import documents, we save the judgment of the machine together with the document in our system, which allows us to do things like filter them from our search results or simply rank them lower than non-spam results, graph the number of spam vs. non-spam results, and do any number of other interesting things. For instance, we might discover new spam blogs by looking at what sources spammy documents come from.
At Scout Labs we are working hard to make our algorithms better and get closer to human quality, but we understand that machines get it wrong at times. With the release of our new “show spam” feature, users can see what text results we classified as spam and re-classify them as legitimate for their account, if they so choose. With the concurrent release of the “mark as spam” feature, we likewise enable users to make spam disappear from results, so users can improve the data everyone on their team is seeing.

Best of all, these features help us to create a focused set of training data that improves our spam classifier in our next round of training. What we especially like about this approach is that, over time, our system will be increasingly trained to identify as spam exactly what Scout Labs users think is spam because we are learning directly from them. So if you’re debating whether or not it’s worth it to mark a result as spam, just think of benefits you and every other user will reap when you do- and click that button!
Spammers are always devising new tricks, so the task of suppressing spam is never finished, but we are getting some really promising feedback on how the quality of our data stacks up against the competition (“I’ve been trying to get good clean results out of ‘competitive product’ for months now, and with Scout Labs the right stuff just pops out.”). However, we really want to hear how you think we are doing, so please drop us a line, and be sure to hit that “mark as spam” button liberally!
Post a comment