With my request to use Google’s black-box Prediction APIs finally approved and a little time available in my schedule, I set out to see how well Google’s racks of CPUs would do against a few training sets I had in mind.
Ultimately, I was hoping to gain more insight into the question: Can software algorithms (with help from the crowd) predict what I’ll like in books, movies, web sites, and food?
To make this a manageable project, I limited the scope of my exercise to the modest problem of predicting amusing movie titles.
Wait, don’t laugh! I have some definite ideas on this subject, which I was able to compress into simple rules. For example, a number or date with an exclamation after it, funny! I’m tickled by these somewhat hypothetical movie titles:“Ten!”, “1941!”, or this real knee slapper, “22!”
I’m also similarly affected by titles with a man or woman’s name that ends in a vowel followed by an exclamation or question mark. “Ralphie?” Hilarious. “Albert.” Not funny. And titles with “Being”, as in “Being Ralphie”, are funny in a knowing, ironic way.
So how did Google’s mysterious Prediction oracle do ?
According to its home page, the goal of Google’s Prediction APIs is to make your apps smarter. The engineers behind Google Prediction list as possible uses of their toolkit, “upsell opportunities”, ”recommendation systems”, and “customer sentiments.”
Since this beta software is coming out of Google Labs, it would seem we’re being granted intimate access to the mysterious Googleplex computing mind and some of those same techniques that make Google search so uncannily accurate.
To use Google Prediction, you first come up with a set of training data. The data formats are quite flexible, allowing numeric, categorical, and what I’ll call “bag of words”, which was my choice.
In Google’s BoW format, you give the Prediction APIs a sentence or, perhaps, a document length piece of text, and then tag it with a category. The movie titles I generated (from a pretty simple finite automata I hacked out in Javascript), were categorized as either “funny”, “not funny”, or “ironic.” You can see part of my training set over on the right. ??
?The key assumption in a BoW model makes is that word order is unimportant. By the way, many spam filters and document classifiers, as well as recommendation engines, all treat text in this fashion—an unordered blob of verbiage.
In a BoW approach, there are few standards techniques to predict the proper categorization of a new arrangement of words that’s not in the original training samples. Depending on your level of insomnia, you read more about Naive Baye’s Classifiers, Term Vectors, Latent Semantic Analysis, etc. in the reference section below.
Back to my test. I used a Curl interface to send the the appropriate commands to a set of RESTful Prediction APIs. Overall, Google Prediction did a fine job, up to a point.
It’s clear that the cloud-based oracle is not leveraging any deep semantic information or finding, let’s say, underlying hidden or latent conceptual pattern categories.
For example, I trained Google to view “Ralphie!”, “Joey!”, and “Janey?” as funny. There’s something about a name with a diminutive ending followed by an exclamation or question mark that tickles my funny bone. I also think “Kumar!” would make for a funny movie title.
However, replace these same name with more conventional “Michael” or “Robert” or perhaps use an inanimate object (say, “Toaster!”), and in my opinion you’ve tipped the funny meter into turkey territory. So, I made sure to add these not funny titles into the training set to let Prediction get a better sense of my preferences.
After sending Prediction a few potentially hilarious movie tiles to rate, it became clear that it doesn’t really care too much for these subtleties. And it certainly doesn’t distinguish between people names and the rest of noun-dom. In my testing, it found “Albert!” and “Laptop!” to be funny. I don’t. And “Visiting with John” to be ironic. Ok, maybe that’s a little ironic.
(To be fair, in one of the footnotes I stumbled on, it’s made clear that Google Prediction really can’t draw a relationship between a new word and existing words in the training set.)
I strongly suspect that Google is using traditional approaches to predictions and ratings. This means that the machinery of mathematics, and specifically, gulp, linear algebra, is being enlisted to process the training samples.
By the way, the reason that startups, such as Hunch, Miso, and others have a few PhDs from <fill in name of fancy university> on their staff is to try to tame these numeric models.
The less numerically inclined can skip the next few sentences.
Basically, if you go down this route, you’re forced to convert a bunch of words into, oh my, a numeric vector, based on word frequency or another weight metric. And then use standard formulas to compare vectors for similarity.
Or perhaps Google is using basic text pattern matching. I’m not sure. Whatever it is using, the weight of the “!” added to a movie title generally tipped my queries into the funny territory.
Bottom line: Google’s prediction APIs offer IT-less and PhD-less companies the chance to do some serious number crunching, and for many purposes, they will be able to upsell and measure customer sentiment on web sites, just as Google promised.
I realize I was pushing the envelope of Google Prediction, but only to make a point.
For kicks, I wanted to see what would happens if I applied a deeper semantic analysis to Joey, Ralphie, Mikey, and Phoebe.
I revisited Google Sets with this list, and was pleasantly surprised: Sets seemed to grok my underlying sense of the funny, and came up with a few names ending with a vowel that I thought would make for a good movie title.
I especially liked Manny!, which would make a boffo name for a movie starring, say Mark Ruffalo and Cameron Diaz.
Get my agent ASAP!
Related articles
- Google’s Pretty Good Recommendation Service (technoverseblog.com)
- A R wrapper for Google Prediction API (r-bloggers.com)
- Statistical Data Mining Tutorials (autonlab.org)
Pingback: Trying out Google Prediction API from R « DECISION STATS