A Google Correlate View of July

Google Correlate is yet another R&D project that can be found in Sergey and Larry’s basement— the Google Labs area of the site.

I discovered Correlate in my last Labs visit, and though it has gotten some press, it’s another one of those Googley science projects that deserves more attention.

The proposition is simple: you feed Correlate a time series of your making, and it searches a galactic-size database of keyword search frequencies to find a matching pattern. Or in math speak, a time-based pattern that correlates with a fairly high R2.

For my experiment, I was interested in finding Google searches that remained level during the year but peaked in July.

What keywords would match this seasonal variation? ‘air conditioners’, ‘pool supplies’, ‘vacation rental’?

Nope. The answer is …

The Orange County Fair.

Google Correlate came about when Google engineers posed the reasonable question “can Google search words tell us something about the real world?”  The answer was Google Flu Trends, which captured certain keyword activity associated with flu symptoms (‘fever’, ‘aches’, ‘congestion’) and then mapped it to geographic regions.

Flu Trends in theory should be of great help to disease experts in tracking the course of an epidemic in real-time. That’s assuming the victims have the strength to enter the search terms.

Google Correlate builds on Google Flu’s real-time processing with the additional power to match arbitrary time series.

Back to Orange County Fair. This annual farm gathering takes place in Costa Mesa, California between July and August. There are other Orange counties in other states with their own fairs, but this is the biggest: livestock, food,  farm equipment, more  food, crafts, and of course lots of competitions.

And we now have validation that this fair is searched quite heavily in July, not so much during the rest of the year.

My summery time series was quite simple: I generated dates starting in 2003 and arbitrarily assigned each day of the year a value of.3. Except for the days in July, where I assigned a whopping .7.

In effect, I created a July spike (see below), and fed this data stream to Correlate. Google’s app then searched for keywords that had a similar pattern of usage—not searched for during most of the year, except in July.

“orange county fair” topped out with an R 2 of .9157  followed by “the orange county fair” at .9078.

By the way, Correlate tells me that “tour de france bicycles”, “ussa world series”, and “washington state little league” fill out the next three slots.

Orange County Fair takes place at Costa Mesa in July, and Americans are searching on it.

Enhanced by Zemanta