14  Link Prediction and Feedback Loops

Open the live notebook in Google Colab here.

In this set of lectures, we’ll study an important task in network data science: link prediction. The link prediction task is to, given a current network and possibly some additional data, predict future edges. This task has many applications:

Link prediction was popularized as a task in network analysis and machine learning by Liben-Nowell and Kleinberg (2003)

In the first part of these lecture notes, we’ll implement a simple link prediction model. In the second part, we’ll do a simple simulation to learn about how link prediction models can change the structure of social networks.

We mostly won’t write our own code in these notes: a lot of the code involves manipulation of data frames or machine learning workflows which we don’t assume that you’ve learned yet. The Python Data Science Handbook by Jake VanderPlas is an outstanding resource for learning these tools.

Impact of Algorithmic Recommendations on Social Networks

Link prediction algorithms are often used by apps and platforms to make recommendations. When Twitter suggests a new profile for you to follow, for example, they often do this on the basis of a link prediction algorithm: users like you have often followed profiles like that one in the past, and so they think that you might like to follow it now. From the perspective of the company making these recommendations, the overall purpose is to increase “engagement” on their platform. More engagement leads to more time spent scrolling, which leads to more time watching money-making ads.

But what happens to the structure of social networks under the influence of link-prediction algorithms? The details of course here depend on the algorithm, but let’s use a version of the one we used in the previous section. We’re going to wrap the whole thing up in a Python class. The idea is that we are going to repeatedly:

  1. Train a link prediction model on the current state of the network.
  2. Update the network by allowing some edges to decay and replacing them with new edges that the model predicts are likely to form.
Show code
class LinkPredictionSimulator:

    def __init__(self, edge_df, **kwargs):
    
        self.edge_df = edge_df.copy()
        self.G = nx.from_pandas_edgelist(self.edge_df)
        self.kwargs = kwargs
        self.node_list = list(self.G.nodes)
        self.comm_dict, self.comms = louvain_communities(self.G, True)

    # ---------------------------------------------
    # SIMULATION FUNCTIONS
    # ---------------------------------------------

    def prep_data(self):
        """
        add negative examples and compute features on the current data frame of edges, using stored community labels for community features. 
        """
        self.train_df = add_negative_examples(self.edge_df)
        self.train_df = compute_features(self.train_df, comm_dict = self.comm_dict, **self.kwargs)

        # store the names of the feature columns for later
        self.feature_cols = [col for col in self.train_df.columns if col not in ["source", "target", "link"]]
        
    def train_model(self):
        """
        Train a logistic classifier on the current data after features have been added. 
        """
        X = self.train_df[self.feature_cols]
        y = self.train_df["link"]

        self.model = LogisticRegression(solver = "liblinear")
        self.model.fit(X, y)
        
    def get_predicted_edges(self, m_replace):
        """
        Return a data frame containing the m_replace most likely new edges that are not already present in the graph. 
        """
        
        # data frame of candidate pairs
        pairs = pd.DataFrame(product(self.node_list, self.node_list), columns = ["source", "target"])
        pairs = pairs[pairs["source"] < pairs["target"]]

        # add features to the candidate pairs
        pairs = compute_features(pairs, comm_dict = self.comm_dict, G = self.G, **self.kwargs)

        # add the model predictions
        pairs["edge_score"] = self.model.predict_proba(pairs[self.feature_cols])[:,1]

        # remove pairs that already present in the graph
        pairs = pd.merge(pairs, self.edge_df, on = ["source", "target"], indicator = True, how = "outer")
        pairs = pairs[pairs._merge == "left_only"]

        # get the m_replace pairs with the highest predicted probability
        # and return them
        pairs = pairs.sort_values("edge_score", ascending = False).head(m_replace)
        return pairs[["source", "target"]]

    def update_edges(self, m_replace):
        """
        removes m_replace edges from the current graph, and replaces them with m_replace predicted edges from get_predicted_edges. 
        """

        # remove m_replace random edges
        self.edge_df = self.edge_df.sample(len(self.edge_df) - m_replace)

        # add m_replace recommended edges 
        new_edges = self.get_predicted_edges(m_replace)
        self.edge_df = pd.concat([self.edge_df, new_edges])


    def step(self, m_replace = 1, train = True):
        """
        main simulation function. In each step, we do the data preparation steps, train the model, and update the graph. 
        """
        self.prep_data()
        self.train_model()
        self.update_edges(m_replace)
        self.G = nx.from_pandas_edgelist(self.edge_df)
        self.G.add_nodes_from(self.node_list)

    # ---------------------------------------------
    # MEASUREMENT FUNCTIONS
    # ---------------------------------------------

    def degree_gini(self):
        """
        The Gini coefficient is a measure of inequality. We are going to use it to measure the extent of inequality in the degree distribution. Higher Gini = more inequality in the degree distribution. 

        code from https://stackoverflow.com/questions/39512260/calculating-gini-coefficient-in-python-numpy
        """
        
        degs = np.array([self.G.degree[i] for i in self.G.nodes])
        mad = np.abs(np.subtract.outer(degs, degs)).mean()
        rmad = mad/np.mean(degs)
        g = 0.5 * rmad
        return g

    def modularity(self):
        """
        modularity of the stored partition
        """
        return nx.algorithms.community.modularity(self.G, self.comms)

This hidden code cell defines a class that simulates the impact of a link prediction algorithm on a social network. The class has methods for training a link prediction model, updating the network, and measuring the modularity and degree Gini coefficient of the network.

Let’s now instantiate the simulator, using the entire contact network.

edges = contact.groupby(["source", "target"]).count().reset_index()
LPS = LinkPredictionSimulator(edges)

Now we’re going to conduct our simulation. Along the way, we’ve set up code so that we can see the graph (and its community partition) before and after the simulation. While we do the simulation, we’ll collect the modularity and degree Gini coefficient, which measures how unequal the degrees in the graph are.

# set up the plot
fig, axarr = plt.subplots(1, 2, figsize = (9, 5))
pos = nx.fruchterman_reingold_layout(LPS.G)

# initial state of the graph
louvain_plot(LPS.G, clusters = LPS.comm_dict, ax = axarr[0], pos = pos)

# initialize tracking the modularity and gini coefficients
Q    = []
gini = []

# main loop
LPS.step(0, train = True)
for i in range(50):
    LPS.step(100, train = True)
    Q.append(LPS.modularity())
    gini.append(LPS.degree_gini())

# visualize final state of the graph
louvain_plot(LPS.G, clusters = LPS.comm_dict, ax = axarr[1], pos = pos)

Before-and-after visualizations of the high school contact network after 50 iterations of the link prediction recommender system simulation.

Here’s what happened to the modularity and the degree Gini inequality as the simulation progressed:

fig, ax = plt.subplots(1, 1)

ax.plot(Q, label = "Modularity of original partition", color = plt.cm.Set3(4))
ax.plot(gini, label = "Gini inequality in degrees", color = plt.cm.Set3(5))
ax.legend()

ax.set(xlabel = "Timestep", ylabel = "Value")

Modularity and degree Gini coefficient of the high school contact network across iterations of the link prediction recommender system simulation.

In this specific, simple model, algorithmic recommendations caused the network to change considerably in its structure.

  • There are more closed-off, insular communities as indicated by the higher modularity score.
  • There is increased inequality of influence, at least as measured by the node degree.

It’s important to note that these results have multiple interpretations. Tighter communities could just mean that the platform is better at helping people connect to their interests, and in some cases this might be harmless. On the other hand, such tight communities also smack of echo chambers; in cases related to opinion exchange or debate, it might be difficult for people to actually encounter contrary opinions in this setting. Equality of influence might seem like a good thing, but could also indicate that people with extreme or repugnant viewpoints have become mainstreamed. So, while it’s clear that the algorithm has significantly changed the overall structure of the social network, it’s important to think critically in context in order to understand whether that’s truly a bad thing or not.

Overall, our findings suggest that the influence of automated recommendation algorithms have the possibility to change the overall shape of social networks in ways that may be harmful or helpful. For some perspectives on how algorithmic influence shapes collective behavior, and what this might imply, see Bak-Coleman et al. (2021).

References

Bak-Coleman, Joseph B, Mark Alfano, Wolfram Barfuss, Carl T Bergstrom, Miguel A Centeno, Iain D Couzin, Jonathan F Donges, et al. 2021. “Stewardship of Global Collective Behavior.” Proceedings of the National Academy of Sciences 118 (27): e2025764118.
Fournet, Julie, and Alain Barrat. 2014. “Contact Patterns Among High School Students.” PloS One 9 (9): e107878.
Liben-Nowell, David, and Jon Kleinberg. 2003. “The Link Prediction Problem for Social Networks.” In Proceedings of the Twelfth International Conference on Information and Knowledge Management, 556–59.



© Heather Zinn Brooks and Phil Chodrow, 2024