Link Prediction Using Friends Common ++

Imagine the following scenario: You are provided a list of people in a conference, and you can see all of their Facebook friends. You want to know which people probably know each other. How can you do that in a way that is efficient?

Idea 0: If they are FB friends, then they are probably real life friends

Right off the bat, you know that if somebody at the conference is Facebook friends with another person at the conference, then they're very likely to know each other. This is pretty quick to compute. You just build a map of conference goers to Facebook Friends, then go through all their Facebook Friends and check if they are a conference goer.

However, now that we know who probably directly knows each other based on their Facebook friends, how can we figure out who might know each other at the conference but have not added each other on Facebook?

Idea 1: Count the number of common Facebook friends

One easy way to do this would be to say, for any two people, the more Facebook friends they have in common, the more likely they know each other in real life. This would be simple to implement as well. We set some threshold N, and if two people have more than N friends in common, we say that they probably know each other.

Our equation for whether or not two people know each other is as follows:

$N > | \Gamma(x) \cap \Gamma(y) |$

(A bit of notation, x and y are the two conference goers in question, Γ(x) denotes the set of Facebook friends of x, Γ(y) denotes the set of Facebook friends of y.)

However, what if there are folks who are serial networkers? The folks who have 500+ LinkedIn connections, 5000 Facebook friends? They could have tons of friends on common with everybody at the conference, but may not necessarily know in real life the conference goers who they have many friends in common with.

Idea 3: Use Jaccard Similarity

Instead of simply counting the number of common Facebook friends, we can apply a very nice concept called Jaccard similarity. Jaccard similarity is where we take a look at the intersection of the two conference goers' Facebook friends divided by the union of the two conference goers' Facebook friends. The higher the Jaccard similarity, the more likely it is that the two conference goers know each other. Intuitively, you can think of it like this. Let's say that Mark Zuckerberg attended the conference, and for the sake of this scenario, he is Facebook friends with everybody in the world. Now if we use our naive algorithm where we check how many friends a random conference goer and Mark Zuckerberg have in common, we will come to the conclusion that they very likely do know each other in real life, which is probably false, but we think this because they have so many Facebook friends in common. If we use Jaccard similarity instead, the intersection of their friend groups will be the size of the random conference goer's entire friend group, but the union of their friend groups will be the size of every single Facebook user in the world. This means that a different conference goer with a smaller friend group and but a decently large friend group overlap with the first conference goer will be able to outrank a celebrity like Mark Zuckerberg in Jaccard similarity. Again, we set some threshold N, where if two people have a Jaccard similarity of more than N, we say that they probably know each other.

Our equation for whether or not two people know each other is as follows:

$N > \dfrac{| \Gamma(x) \cap \Gamma(y) |} {| \Gamma(x) \cup \Gamma(y) |}$

However, in this case we run into an issue where imagine if there are a few very popular people that almost everyone is Facebook friends with, such as Mark Zuckerberg. Being friends with these folks should be weighted to give a reduced signal. For example, if two people follow Bill Gates, Mark Zuckerberg, but one of them follows a random person named Alice and the other follows a random person named Bob, they likely do not know each other. However, if two people follow Alice, Bob, but one of them follows Mark Zuckerberg while the other follows Bill Gates, they probably do know each other. Both pairs would have the same Jaccard similarity, but it is clear that the second pair of people is more likely to know each other than the first, since the in the first pair the Facebook friends they have in common are celebrities.

Idea 3: Use Adamic/Adar Similarity

To address this, we might try something like Adamic/Adar similarity. Take each Facebook friend and determine how many conference goers have that Facebook friend. Let's call this property of a Facebook friend the "centrality". For each pair of conference goers, find the intersection of their Facebook friends (which ones they have in common) and sum inverse logs of the centralities of that intersection of their Facebook friends. This is the Adamic/Adar similarity. Again, we set some threshold N, where if two people have a Adamic/Adar similarity of more than N, we say that they probably know each other.

Our equation for whether or not two people know each other is as follows:

$N > \sum_{f \in \Gamma(x) \cap \Gamma(y)} 2^{-n} \dfrac{1} {\log C(f)}$

(A bit more notation, C(f) we denote as the centrality of Facebook friend f.)

The centrality of each Facebook friend is effectively a measure of how much of a "celebrity" the Facebook friend is. By using the inverse log of the number of neighbors each Facebook friend has, you are effectively reducing the weight of a highly friended Facebook user, such as Mark Zuckerberg, so that having a celebrity Facebook friend in common with another person is treated as less indicative of knowing that person in real life than having a non-celebrity Facebook friend in common.

There are a lot more ways to guess whether or not two people know each other, and you can read more about it in the paper The Link Prediction Problem for Social Networks by Liben-Nowell and Kleinberg. It turns out from their experimental evidence, Adamic/Adar similarity seemed to work best, but Jaccard also offers considerable improvement over a simple common neighbors approach and requires very little extra complexity. Thank you for reading!

Search This Blog

👨‍💻