And The Intercept Discovers Homophily or the “Birds of a Feather” Principle

This is important. The Intercept has done some very good investigative reporting showing yet another example of Facebook virtually f&@$ing the public in their ear. Read it now. Can you believe Facebook was caught taking users’ data and sharing it with their corporate friends?

In this case it’s cellular carriers and ISPs, who already know a lot about you thanks to some deregulation work of the current regime. Yeah, Comcast, Verizon, and the rest are allowed to scoop up our web browsing histories. With “Actionable Intelligence” — sounds like a CIA-NSA program with the consumer being the enemy— Facebook shares more details about the person using the phone to the digital Draculas running our telecom companies. Sure Facebook can kind of claim, cough, it doesn’t sell their users personal data: Actionable Intelligence is a small sweetener that’s added to their existing advertising relationships with companies.

The more interesting point is that the Intercept references the obscure technical term homophily, which is the idea that we connect to similar people. Known to sociologists, who have studied who we “friend” as a serious academic topic, and to rest of us as “birds of a feather flock together, homophily allows carries and ISPs (and the NSA) to make good guesses about the behaviors and preferences of those in a Facebook friend network, or in your real neighborhood, or organizations you belong to. Scary. By the way, if you want to read a very good science fiction take on this topic, pick up a copy of Tom Purdom’s Barons of Behavior.

Once upon a time I wrote about homophily for a certain data security company, and though the post was taken down, it’s still a good read. I also wrote another piece about Subject Attribute Networks or SANs, which is the computer science technique used to study homophily. It has benign applications in corporate environments to help admins understand who’s likely to look at files and folders. Courtesy of Facebook and other information cartel members, the same idea has far more menacing consequences. I’ve dug the two posts out of the archives and republished below for your reading pleasure …

Homophily and Medadata

True or false: metadata can be used to discover important patterns and behaviors in social networks? This is as true now as it has been since the start of electronic communications: those whom you telegraphed and called, and more recently, email and “like”, can reveal important, non-trivial facts about both sender and receiver, even without analyzing the actual content exchanged. But more significantly, a single piece of metadata also provides details about the surrounding network neighborhood.

Let’s leave aside current headlines about metadata for a moment. E-retailers have understood since the first online purchase that metadata is a powerful way to cluster customers into like-minded communities. What are book and movie suggestions on Amazon and Netflix, but metadata that’s been used to slot members into virtual neighborhoods with uniform preferences?

That’s the key point: metadata of those in a virtual neighborhood are very likely to be similar.

These principles have been validated by empirical sociology research going back to at least the 1950s. The rest of us intuitively understand this as “birds of a feather flock together”: we connect up with people who have similar education, income, age, religion, ethnicity, aspirations, and on and on. So it’s not a surprise that we recreate these flocking patterns in 21^st-century cyber neighborhoods.

It’s easier to see these ideas play out in social networking services, where users choose their neighbors. Recently, Facebook started to directly exploit its metadata network with Graph Search. Based on my brief testing, it still in an early stage of development. But the intention behind Graph Search is clear: use metadata—i.e., likes and dislikes—of friends and friends of friends to provide more directed answers to queries. Nearby connections are far more predictive of users’ preferences and other attributes than those connections further down the chain.

For Facebook wonks, there’s also an informative post from their development team about how close “birds of a feather” connections — these friends make connections.

This is an issue in Facebook— by no means the only service with this loophole— wherein friends of friends can always view their friends’ “friends lists”. In other words, someone two hops away from me can see my connection to our common friend, and then surmise much about me from my neighbor. While this does encourage new connections, making friends lists so freely available in this way leaves the door wide open for stalkers and identity thieves to enter and gain information about targets.

Back to some of the recent headlines about metadata and phone calls. Compared to the deregulated Internet services, regulated voice carriers give metadata tighter privacy protections. Who would be surprised (and outraged) if our cell phone companies, in their monthly bills, published the numbers of people with similar interests or suggested local businesses to visit based on mining call records?

We all would. But this, of course, is common practice in social networks.

I have two takeaways to put all this into context. One, social network graphs combined with metadata are essentially inference engines. Two, the surprise for many of us is that metadata inference can work quite well in even old-fashioned phone systems with quaint 10-digit URLs.

Subject Attribute Networks and Big Data (Brother)

Like many others, I think of Big Data as enormous data sets that are worthy of distributed processing, say in the multi-petabyte range. A petabyte for those who need a quick refresher is over 1 million gigabytes—a warehouse full of thumb drives. Typically, organizations enter the Big Data zone by collecting transactional data from tens of millions of customers or, if you’re a social media company, by storing status posts, images, and videos from a huge subscriber base.

But there’s another way to cross the Big Data threshold, and it is right under our noses. Internal file systems for large companies can easily pass the 1 petabyte line. We chatted with an IT exec who manages a 1.5 petabyte file system, which is based only on the human-generated data from the company’s 40,000 employees. Your company’s files system may not be that large, but if it’s in the enterprise range (over 1,000 employees), you likely have multi-terabytes or more of storage space. That’s short of a petabyte but still very significant.

But Is Enterprise Search Big Data?

The Big Data problem space is somewhat fuzzy, and there’s no agreement on many of parameters. But there are other considerations for deciding whether something is Big Data—complexity of computation coupled with high performance. If you have difficult calculations or algorithms that need to be speedily carried out on a large data set, then you’re in the realm of Big Data.

What type of Big Data problem do I have in mind for internal file systems? Similar to web-based search, enterprise search lets employees query their company’s file systems, generating results—a la Google—as an ordered list based on relevancy and also honoring file permissions. The last requirement means that the search app has to figure out, unlike web search, whether users can view the search results based on the permissions (ACLs) for the contents.

And an enterprises search app should give similar lightning fast performance to the search engines of the web world, but using far less computing resources.

Taken altogether, enterprise search starts to look worthy of the Big Data classification. And in case you’re wondering, there is a metadata connection to both web and enterprise search.

Ranking the Results

If we dive a little more deeply into enterprise search, we can get a sense of its bigness and how metadata plays an important role. As in the consumer search world, the “Page 1” results that are returned from a query should be the most relevant. Formally, this is known as the ranking problem. And it was most dramatically solved by Google’s founders, who developed the PageRank algorithm. While Google has long since moved on to other ways to calculate rank, the underlying idea behind it is instructive: PageRank essentially uses core metadata, in this case, incoming links to a web page as a kind of vote.

In other words, the more popular pages—the ones that will show up higher in the rank list of those pages that match the keyword query—have more incoming links. For the wonky, the original paper by Sergey and Larry can be read here. By the way, there are other algorithms in the ranking genre, but they generally depend on this voting idea and using link count metadata.

The Big Question: is there an equivalent to this voting metadata metaphor for enterprise search that sort file results matching the keyword query based on a popularity metric?

Social Searching in Enterprise Search

It turns out there’s a nice analogue to page link voting. One can think of the metadata related to file access—the number of users viewing or modifying a file as a proxy for popularity. As in the case with internet search, additional metadata is also an advantage in enterprise search, and we can apply similar popularity algorithms to these plain old files. With file activity metadata now in the mix, this definitely becomes a business-class Big Data challenge—the type that your boss at work would love solved.

There are lots of way to slice this problem, but awhile back I wrote about the “birds of a feather” principle. This says if we’re both attracted to the same general class of things, we’re likely to have other things in common—I will like what you like. You can also describe this as herding behavior—we’re following each other. This phenomena is exploited in social-based searching by many of the usual suspects in the social networking world—see Facebook’s Graph Search for more details.

We can do something similar for enterprise search by fine tuning the voting. For example, supposed user A accesses a file, say “Marketing Roadmap for Product X’, which is also accessed by user B. And User B also has accessed a file “Sales Data For Product X’, which is not accessed by A. By the birds of a feather principle, you would want to allocate a bit of A’s vote to the sales file, even though it’s not directly accessed. Suppose user A was searching on some keywords that are used in the “Sales Data For Product X” file—let’s say “metadata software”. The file would appear higher in the results because of its SAN weighting than if A and B were not “birds of a feather”.

Briefly, SANs

No, not Storage Area Networks. I’ve just described a ranking model more formally known as a Social-Attribute Network (SAN), which takes into account two types of metadata—the users, the social part, and their relationships, along with the actual data and its relationships. PageRank, unlike a SAN, doesn’t directly account for social metadata since its ranking algorithm is based solely on data or content relationships.

There are a few great survey papers on SANs, but all roads lead back to the godfather of these models and the inventor of a ranking algorithm that preceded PageRank, Cornell University’s amazing Jon Kleinberg.

The actual computation for a SAN ranking for enterprise search—and I promise to be brief—often involves a giant table, which by the way is also used in PageRank. Think of each row as representing a file, and each column a user. The initial entry represents whether a user accesses the file—say give it a 1. The SAN algorithm is iterative and works by adjusting the votes based on following a chain of likes. Eventually you’ll get a number—technically it’s a probability, but never mind—that ranks a file’s relevance for each user. In other words, unlike PageRank, SAN provides user specific rankings.

This table is huge—perhaps several thousand users by maybe a hundred thousand files—the calculations complex, and they have to be carried out until the vote rankings converge.