Blog

March 2009Archives

At the end of last week we released a new feature that allows users to show or hide spam results for a given search. Now you can either see your blog & news search results with spam suppressed (the clean view), or opt to see the ones Scout Labs or other members of your team have marked as spam (the dirt). Whether you remove the result entirely or mark it as spam, the mention will not be included in graph data.

Picture 5.png

Depending on your search, spam volume could be either a trickle (California Bureau of Land Reclamation) or a flood (Britney). Warning: A lot of the spam content for ANY search is pretty “unsafe” for work!

Of course most of the spam we find is just annoying. Take a look at this safe-to-blog-about spam example. For a general Nike search, there are ~2800 legitimate results and ~370 spam results. Check out the spam results. They look pretty darn spammy to us:

Picture 9.png

This feature is especially useful for:

  • Marketers who need to know everywhere their brand is being mentioned, including the dubious mentions, but don’t need the spam mixed in with their “real” results

  • Search marketers trying to figure out when their search traffic is being hijacked by sploggers, so they can bid down campaigns

  • Analysts trying to clean up results to make their graph numbers as accurate as possible (put all those sketchy results in the “spam” bucket and you won’t see them in graph data)

  • PR people who need to know every place their client or client’s brand is being mentioned, particularly the dubious mentions, in case these mentions require action to address

Of course part of the reason we released this functionality is that we have received a lot of positive feedback about our spam reduction measures, so we really want to keep improving it—and the more results that our users mark as spam, the better our training data will be.

In next week’s release expect to see:

  • 6 months data available in the application (sentiment processing will still be only for the last 3 months, alas—if you want more, let us know at support <at> scoutlabs <dot> com or via Twitter—so we have an idea how important more sentiment history is to you

  • Agency cobranding for workspaces: the ability to put up two logos, a main one in the top left of the header, a secondary one in the top right of the header, and add some custom text to the homepage. This is for all you agencies out there who are using the application on behalf of a client and want to share the workspace with them

In general we release code with new features every week. It’s not always a major feature and this past few weeks since launch our releases have been mostly commerce and account features that work through launch kinks (More of you want to pay via invoice than we thought! Who knew credit cards issued in Australia wouldn’t like a 1 cent validation!), but there’s a bunch of more interesting content and features coming up soon. More on our roadmap forthcoming.

The internet is a messy place, and that is especially true for the blogosphere with plenty of spam blogs, link farms, comment spam, and other data that is undesirable for our users. At Scout Labs we do quite a bit of analysis on the data we are indexing in our system and use a number of tools to keep the quality of text content high. Our main tool in fighting spam is machine learning, which we use to identify “spammy” documents and suppress them in our results.

Our first goal has been to identify and suppress keyword spam. This narrow focus has allowed us to make rapid progress, and in initial tests our spam model catches over 70% of keyword spam (recall) while misclassifying less than 5% of non-spam as spam (that is fewer than 5% false positives). We now suppress results that appear as real blog posts on real blogs but look like this:

KeywordSpam.png

How do we do it? We use a set of processes and algorithms called machine learning to build a spam predictor. We start out with a large set of documents that people had identified as spam or non-spam. We then use machine learning to create a program (or model), that predicts if a person would identify a document as spam or non-spam. We created this initial set of documents either by judging documents ourselves, or use Amazon’s Mechanical Turk.

When we import documents, we save the judgment of the machine together with the document in our system, which allows us to do things like filter them from our search results or simply rank them lower than non-spam results, graph the number of spam vs. non-spam results, and do any number of other interesting things. For instance, we might discover new spam blogs by looking at what sources spammy documents come from.

At Scout Labs we are working hard to make our algorithms better and get closer to human quality, but we understand that machines get it wrong at times. With the release of our new “show spam” feature, users can see what text results we classified as spam and re-classify them as legitimate for their account, if they so choose. With the concurrent release of the “mark as spam” feature, we likewise enable users to make spam disappear from results, so users can improve the data everyone on their team is seeing.

Best of all, these features help us to create a focused set of training data that improves our spam classifier in our next round of training. What we especially like about this approach is that, over time, our system will be increasingly trained to identify as spam exactly what Scout Labs users think is spam because we are learning directly from them. So if you’re debating whether or not it’s worth it to mark a result as spam, just think of benefits you and every other user will reap when you do- and click that button!

Spammers are always devising new tricks, so the task of suppressing spam is never finished, but we are getting some really promising feedback on how the quality of our data stacks up against the competition (“I’ve been trying to get good clean results out of ‘competitive product’ for months now, and with Scout Labs the right stuff just pops out.”). However, we really want to hear how you think we are doing, so please drop us a line, and be sure to hit that “mark as spam” button liberally!

There is a large fabulous agency who has a large fabulous toymaker as a client. Together they are using Scout Labs to monitor various toy and doll brands, and some really interesting insights are coming back! Inspired by those findings, I thought I’d dig in to Scout Labs and see if I can find any other juicy doll and toy news to share with you all while demonstrating how Scout Labs works.

I started with the Bratz brand, as I keep hearing news updates about the epic trademark infringement battle between Bratz parent company, MGA Entertainment, and Mattel, the maker of Barbie.

For those of you who do not have nine year old daughters, the Bratz dolls — Chloe, Jade, Yasmine, and Sasha — are BFFs and are sassy / trashy younger versions of the Barbie doll. Way more bling. Way more color streaks in their hair. Lips that are way more plump. Way into fashion.

Thumbnail image for car_bratz.jpg

In Scout Labs, if you click on the Sentiment tab within a Bratz search, you actually see a mix of positive and negative views expressed.

Picture 44.png

But I wanted to know what moms, or parents, think about Bratz dolls, guessing that it might be a bit different. To do that, I created a search for “Bratz” and also made the phrase “my daughter” required. The results are pretty damning. Parents clearly don’t like these dolls. The thong underwear on the Bratz baby doll seems to be particularly offensive:

I have never been a fan of the Bratz doll line and we haven’t had them in this house since the first time my daughter received one as a gift — the baby Bratz doll was wearing thong underwear — eewww. So the recent news the Bratz manufacturer, MGA Entertainment, must stop producing the dolls because of a copyright infringement lawsuit from Mattel is music to my ears.

Even The Onion chimed in with this riveting piece that exposes how the Bratz dolls are warping the self-image of today’s young girls by making them want to have giant heads too:

As I scanned the Bratz-hating posts, I noticed some equally strong words about Dora the Explorer. Wait, Dora, the happy backpack-toting multi-lingual traveler beloved by preschoolers?

Thumbnail image for Thumbnail image for Picture 47.png

Yep. Another scandal!

Mattel has announced that it is “updating” Dora to create a hip junior-high girl (whose silhouette does really look an awful lot like a Bratz doll).

Picture 46.png

From their press release: “As tweenage Dora, our heroine has moved to the big city, attends middle school and has a whole new fashionable look.” OK, is that really necessary?

In case you were wondering what parents think of the change, you can look to Scout Labs’ Frequent Words module which summarizes the top conversations, and you quickly see that both “skank” and “skanky” made it into the top 20 this week!

Picture 42.png
Good to also note those are NEW terms (in orange) this week, meaning that we haven’t seen ‘skank’ or ‘skanky’ associated with Dora the Explorer in past weeks. Go figure.

So, yes, parents are outraged over this one, inspiring some very clever blog post titles, such as:
Dora the Slutty Explorer
Dora the Sexplorer
Don’t Bratz Dora!

You can sign the petition here.