In our previous lecture we introduced the fundamentals of different representations of networks: mathematically through graphs and their matrix representations, and computationally through the Python package NetworkX. We’re now ready to start defining and measuring some more complex properties of networks.
The fundamental point of networks is that they are connected – we can “move along the edges” in order to understand the network structure. Our unifying theme in this set of lecture notes is how the idea of “taking steps in the network” unifies some important ideas: node degrees, walks between nodes, and paths between nodes.
Degree of a Node
Definition 2.1 The degree of a node in an undirected network is the number of edges connected to it: \[
k_i = \sum_{j=1}^n A_{ij} \,.
\]
We can collect all the node degrees together into the degree sequence\(\mathbf{k} \in \mathbb{R}^n\) whose \(i\)th entry is \(k_i\), the degree of node \(i\).
A small complication in this definition is that, if there is a self-loop on node \(i\), then \(A_{ii} = 2\). So, when we say that the degree is the number of connected edges, we need to remember that self-loops count twice!
There is an important relationship between the degree sequence and the total number of edges \(m\) in the network:
Exercise
Give a formula for \(m\) in terms of the degree sequence \(\mathbf{k}\).
Exercise
The mean degree of a network is the average of the degree sequence: \(c = \frac{1}{n} \sum_{i = 1}^n k_i\). Give a formula for the mean degree in terms of \(m\) and \(n\) alone.
Computing with Degrees
Let’s grab some sample data. We’ll use the Les Miserables network, which is a network of coappearances of characters in the book Les Miserables by Victor Hugo. Nodes represent characters and edges represent characters who appear within the same chapter. This data set is supplied as a built-in example in NetworkX, and for this reason we’ll use it several times throughout these notes.
This hidden code cell imports several packages and defines an unweight function which we’ll use to convert the network from its native weighted format to an unweighted format.
Code
from matplotlib import pyplot as pltimport networkx as nximport numpy as npplt.style.use('seaborn-v0_8-whitegrid')plot_kwargs = {"node_size" : 100, "edgecolors" : 'white', "node_color" : "steelblue", "width" : 0.5, "edge_color" : "darkgrey"}# This is a coding exercise in the live notes.def unweight(G):for source, target in G.edges(): G[source][target]['weight'] =1return G
Figure 2.1: A network of coappearances of characters in the book Les Miserables by Victor Hugo. Nodes represent characters and edges represent characters who appear within the same chapter.
Now let’s compute the degree of each node. There are multiple ways to achieve this task: we can work directly using the adjacency matrix, or we can use the built-in NetworkX function nx.degree().
One way to compute the degrees of the nodes in a graph is to use the adjacency matrix, as directly described by Definition 2.1. With convenient functions from NetworkX and NumPy, this is a two-liner:
For undirected graphs, we could equally do np.sum(A, axis = 0) because \(\mathbf{A}\) is a symmetric matrix. When we discuss directed graphs soon, it will become necessary to be careful!
# computing degrees directly from the adjacency matrixA = nx.adjacency_matrix(G_LesMis) # extract the adjacency matrixdegree_vector = np.sum(A, axis =1) # sum the adjacency matrix over rowsprint(f"The degree vector has shape {degree_vector.shape}")
The degree vector has shape (77,)
Networkx also has a helpful built-in function to calculate the complete set of degrees:
degree_view = G_LesMis.degree()
The result is a DegreeView object which behaves much like a Python dictionary (and which can be easily converted to a dictionary using the dict constructor).
Let’s take a moment to compare the first few nodes to make sure that our two methods agree:
i =0for name, degree indict(degree_view).items():if i <5:print(f"{name} has degree: {degree} (networkx) and {degree_vector[i]} (adjacency matrix)") i +=1
Napoleon has degree: 1 (networkx) and 1 (adjacency matrix)
Myriel has degree: 10 (networkx) and 10 (adjacency matrix)
MlleBaptistine has degree: 3 (networkx) and 3 (adjacency matrix)
MmeMagloire has degree: 3 (networkx) and 3 (adjacency matrix)
CountessDeLo has degree: 1 (networkx) and 1 (adjacency matrix)
Degree in directed graphs
We have to be a little more subtle in how we define degree in a directed network because there is a distinction between in-edges and out-edges in these networks.
Definition 2.2 In a directed network, the in-degree is the number of ingoing edges to a node and the out-degree is the number of outgoing edges. These are defined by the formulas \[
k_i^{\text{in}} = \sum_{j=1}^n A_{ij}
\] and the out-degree is \[
k_j^{\text{out}} = \sum_{i=1}^n A_{ij}
\]
Just as before, we can also define the in-degree sequence \(\mathbf{k}^{\mathrm{in}}\) and out-degree sequence \(\mathbf{k}^{\mathrm{out}}\).
We will repeat the exercises above for directed networks.
Exercise
Express the total number of edges \(m\) in a directed graph in terms of the degree sequences \(\mathbf{k}^{\text{in}}\) and \(\mathbf{k}^{\text{out}}\).
Determine the mean in-degree and mean out-degree of a directed graph in terms of \(m\) and \(n\).
Let’s see these definitions in action on a directed graph. This time, we’ll use the mention graph from the musical Hamilton. Each node in this graph is a character. There is a directed edge from character \(i\) to character \(j\) if character \(i\) mentions character \(j\) in a song.
Code
import pandas as pd# import data for our directed networkdf = pd.read_csv("https://philchodrow.github.io/PIC16A/homework/HW3-hamilton-data.csv", names = ["mentioner", "mentioned"])df = df[df["mentioner"] != df["mentioned"]]G_Hamilton = nx.from_pandas_edgelist(df, source ="mentioner", target ="mentioned", edge_attr=None, create_using=nx.DiGraph())layout = nx.arf_layout(G_Hamilton)fig, ax = plt.subplots(1, 1, figsize = (10, 7))nx.draw(G_Hamilton, pos = layout, with_labels =True, edge_color ="darkgrey", width =0.5, node_color ="lavender", edgecolors ="white")# This is a coding exercise in the live notes.# Create and print a numpy array of node in-degrees and out-degrees directly from the adjacency matrix
Figure 2.2: A network of mentions in the musical Hamilton. Nodes represent characters; there is an edge from character \(i\) to character \(j\) if \(i\) mentions \(j\). This data set was collected and publicly shared by The Hamilton Project.
Let’s calculate the in-degrees of the nodes in the Hamilton network using the adjacency matrix and compare to the networkx built-ins.
Code
A = nx.adjacency_matrix(G_Hamilton) # grab the adjacency matrixin_degree_vector = np.sum(A, axis =0) # manual in-degrees from Ain_degree_view = G_Hamilton.in_degree() # networkx builtini =0for name, degree indict(in_degree_view).items():if i <5:print(f"{name} has degree: {degree} (networkx) and {in_degree_vector[i]} (adjacency matrix)") i +=1
burr has degree: 13 (networkx) and 13 (adjacency matrix)
hamilton has degree: 14 (networkx) and 14 (adjacency matrix)
weeks has degree: 1 (networkx) and 1 (adjacency matrix)
madison has degree: 4 (networkx) and 4 (adjacency matrix)
jay has degree: 1 (networkx) and 1 (adjacency matrix)
Exercise
Repeat this calculation and check that you can also compute out-degrees, with a matching result.
Regular Graphs
The degree sequence can contain a lot of information about the graph structure. One important class of graphs has constant degree:
Definition 2.3 A network in which all nodes have the same degree is called a regular graph or regular network. A regular graph where all nodes have degree \(k\) is called \(k\)-regular.
Some special cases of regular graphs are lattices (e.g., a square lattice is 4-regular) and the complete graph where every node is connected to every other node (which is \((n-1)\)-regular).
A complete graph on 5 nodes. Each node has degree exactly equal to \(5-1 = 4\).
Regular graphs are of great interst in theoretical mathematics and computer science. We don’t usually spend a lot of time studying them in network science because most interesting systems and data sets are not regular.
Density and Sparsity
Definition 2.4 The density or connectance\(\rho\) of a simple network is the fraction of possible edges that are actually present. That is,
One way to interpret density is to think of it as a probability that a pair of nodes picked uniformly at random is connected by an edge.
We can rewrite density in terms of expected degree using our earlier exercises :
This simplification comes from the binomial coefficient formula \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\). Also, we can notice the cool fact that \(n \choose 2\) is equivalent to the sum of the first \(n-1\) integers!
If a network is sufficiently large, you can approximate the density as \(\rho \approx \frac{c}{n}.\)
Let’s compute density in network below using three different strategies:
Calculating directly using number of edges and number of nodes;
Calculating directly using mean degree and number of nodes;
Using the built-in NetworkX function nx.density().
Code
from matplotlib import pyplot as pltimport networkx as nximport numpy as npplt.style.use('seaborn-v0_8-whitegrid')# Create an unweighted version of the Les Mis networkG = G_LesMis# This is a coding exercise in the live notes.# Calculate density directly using number of edges and number of nodesn = G.number_of_nodes()m = G.number_of_edges()print(f"Density using nodes and edges: {2*m/(n*(n-1)):.4f}")# Calculate density directly using mean degree and number of nodes# Hint: you may want to calculate degree from the adjacency matrix so that you can calculate mean using numpyA = nx.adjacency_matrix(G)degree = np.sum(A, axis =1)c = np.mean(degree)print(f"Calculating using mean degree {c/(n-1):.4f}")#----# Use the built-in NetworkX function nx.density()density = nx.density(G)print(f"Density with NetworkX built-in: {density:.4f}")
Density using nodes and edges: 0.0868
Calculating using mean degree 0.0868
Density with NetworkX built-in: 0.0868
Informally, we call a network sparse when it has low density. This is pretty tricky to formally define: how low is “low enough?” There isn’t a universally agreed upon threshold for density below which a real-world network would be considered sparse. However, we can create a definition which applies for certain theoretical models of networks. Many models are defined for graphs on \(n\) nodes. If we have a model where we can take a formal limit as \(n\) grows large, then such a network is sparse if \(\rho \to 0\) as \(n \to \infty\). In this scenario, the mean degree grows (much) more slowly than the number of nodes. We’ll study limiting arguments about sparse graphs much more when we discuss random graphs.
Exercise
Let \(\{G_n\}\) be a sequence of networks on \(n\) nodes which are \(k\)-regular for some fixed constant \(k\). Show that this sequence is sparse in the sense defined above.
There is a small cheat in this problem statement, since if \(k\) and \(n\) are both odd then it is not possible to form a valid network (since \(kn = 2m\), \(kn\) must always ben even). This kind of detail does not usually pose difficulties in the study of sparse network.
Walks and Paths
We may like to know if it it is possible to reach one node from another by traversing edges. For this task, we introduce the notion of a walk.
Definition 2.5 A walk of length \(k \geq 2\) is a set of edges \(\{ (i_1,j_1), (i_2, j_2), \dots, (i_k, j_k)\}\) with the property that \(i_l = j_{l-1}\) for each \(2 \leq l \leq k\). We say this is a walk from node \(i_1\) to node \(j_k.\)
The length of a walk is the number of edges it contains.
A single edge \((i,j)\) is always considered a walk of length 1 from \(i\) to \(j\).
Figure 2.3: Not all pairs of nodes in this network have a walk between them.
Connected Graphs
It is not always the case that there exists a walk between two specified nodes, as shown in Figure 2.3. In many networks, it’s very important that there exist a walk between any two nodes (think, for example, of a power grid!), and we have a special definition to reflect this:
Definition 2.6 A connected graph is a graph with the property that, between every pair of nodes \(i\) and \(j\), there exists a walk from \(i\) to \(j\).
There are several ways to determine computationally whether a graph is connected.
Exercise
Outline an algorithm for determining whether a simple undirected graph is connected by progressively querying node neighbors. Assume that you know ahead of time how many total nodes exist in the network.
Counting Walks
A question that pops up a lot in network analysis is “How many walks of length \(r\) exist between nodes \(i\) and \(j\)?” The adjacency matrix gives a concise way to address this question. First, let’s consider \(r=1\). That’s just the number of edges from node \(j\) to \(i\), which is exactly \(A_{ij}\). Said in a slightly more cumbersome way,
The \(ij\)th entry of \({\bf A}^1\) counts the number of walks of length 1 from node \(j\) to node \(i\).
This observation generalizes by induction.
Theorem 2.1 (Counting Walks with Powers of \(\mathbf{A}\)) The \(ij\)th entry of the matrix \({\bf A}^r\) contains the number of walks of length \(r\) from \(j\) to \(i\).
There are many applications of this idea:
Exercise
What is \(\frac{1}{2}\mathrm{trace} \; \mathbf{A}^2\), in terms of quantities we’ve seen before?
Exercise
Give a mathematical criterion for a graph to be connected by considering the matrix
Paths are special walks which do not repeat any edges.
Definition 2.7 A path is a walk that is not self-intersecting. That is, any edge \((i,j)\) shows up in a path at most once.
A geodesic path or shortest path is from \(i\) to \(j\) is a path fom \(i\) to \(j\) of minimum length; i.e. a path such that no other path has shorter length.
The length of a geodesic path is called the (geodesic) distance between \(i\) and \(j\). If two nodes are not path-connected, their geodesic distance is undefined.
Here is an example of a geodesic path between two nodes in Zachary’s Karate Club network. The geodesic distance between these two nodes is 5.
Code
# code example adapted from: https://stackoverflow.com/questions/24024411/highlighting-the-shortest-path-in-a-networkx-graphnp.random.seed(123)G = nx.karate_club_graph()pos = nx.spring_layout(G)nx.draw(G,pos, **plot_kwargs)path = nx.shortest_path(G, source =14, target =16)pos_path = {i: pos[i] for i in path}path_edges =list(zip(path,path[1:]))nx.draw_networkx_nodes(G,pos_path,nodelist=path,node_color='steelblue', node_size =100, edgecolors ="black")nx.draw_networkx_edges(G,pos_path,edgelist=path_edges,edge_color='k',width=1)
A geodesic path between two nodes in the Karate Club network.
Shortest paths are not necessarily unique; there may be more than one path of the same length between two nodes. There exist several algorithms for computing the shortest path between a pair of nodes or between every pair of nodes in a graph.
Dijkstra’s algorithm (Dijkstra 1956) is frequently used to compute shortest paths with a single fixed node as the point of origin.
The Floyd-Warshall algorithm (Rosen 2011) is a common choice for computing the shortest path between all pairs of nodes in a graph.
Explain briefly why shortest paths are self-avoiding: the same node cannot appear twice in a shortest path.
Cyclic and Acyclic graphs
Many algorithms and measures of graph structure reflect the existence of cycles within a graph.
Definition 2.8 (Cycles) A cycle is a path from a node \(i\) to itself.
A network with no cycles is acyclic.
By definition, self-edges are cycles of length 1. Multigraphs without self-edges can have cycles of length 2. Cycles in simple graphs are always of length 3 or more.
Exercise
How could we use the adjacency matrix \(\mathbf{A}\) to compute the number of cycles of length \(r\) in a network starting and ending at node \(i\)?
In some cases, it is useful to check a network to determine whether it contains any cycles. While the adjacency matrix can be used to count cycles of a specific length, it could be quite inefficient to use to detect whether we have a cycle of any length (because we may have to check the diagonal entries of \({\bf A}^r\) for all possible cycle lengths \(r\)). We can construct a simple algorithm to determine computationally whether a network is cyclic or acyclic. For the purposes of this algorithm, we’ll consider directed graphs; the algorithm can be easily adapted to undirected graphs. Here it is:
While \(n > 0\):
Find a node \(i\) with no out-edges (i.e. \(k_i^{\mathrm{out}} = 0\)).
If no such node exists, then the network is cyclic.
Remove node \(i\) and all its edges.
If all nodes can be removed this way, then the network is acyclic.
Exercise
Give a simple argument for why the algorithm above works. In particular:
Why is this algorithm guaranteed to terminate?
Why is this algorithm guaranteed to give the correct answer upon termination?
Now that we’re convinced that the algorithm works, let’s go ahead and implement it in Python. Our implementation accepts a NetworkX DiGraph object as an argument, returning True if the network is cyclic and False if the network is acyclic.
def is_cyclic(G):while G.number_of_nodes() >0: zero_out_degree = [node for node in G.nodes if G.out_degree(node) ==0]iflen(zero_out_degree) ==0:returnTrue G.remove_nodes_from(zero_out_degree)else: returnFalse
Now let’s test our algorithm on a few examples. First, we’ll construct a random toy directed graph that we know does not possess any cycles by its construction. Indeed, our implemented cycle detector indicates that no cycles are present.
Figure 2.5: The Hamilton mention network, a directed network which contains many cycles.
This algorithm has a nice mathematical consequence. If we label the nodes in an acyclic network according to the order we removed them, we will end up with an adjacency matrix that is strictly upper triangular. This is because each node that is removed could only have out-edges that were already removed previously, i.e., nonzero entries of the \(i\)th column could only occur between columns 1 and \(i-1\). There exists at least one such labeling for any acyclic network.
Trees
Definition 2.9 A tree is a connected, acyclic network. By “connected”, we mean every node is reachable from every other node by traversing a sequence of edges, i.e., there exists a walk between any two nodes.
Trees can be both directed and undirected. When we refer to trees with no further descriptors, we usually mean undirected trees.
All trees are necessarily simple graphs, because self- and multiedges would create cycles. Trees always have exactly \(n-1\) edges, as can be proven via induction. Furthermore, any connected graph with \(n-1\) edges is a tree.
Trees are often drawn as rooted trees with a root node at the top and leaf nodes below. Any node can be chosen as the root of a tree.
Trees play important roles, especially in math and computer science. Trees have many useful properties that we can exploit for network analysis. One of the most important ones is the uniqueness of paths: there is exactly one path between any pair of nodes (as long as we don’t allow “backtracking”). Many calculations on networks with this property are simple(r).
References
Dijkstra, Edsger W. 1956. “A Note on Two Problems in Connexion with Graphs.” In Edsger Wybe Dijkstra: His Life, Work, and Legacy, 287–90.
Rosen, Kenneth H. 2011. Discrete Mathematics and Its Applications (7th Edition). William C Brown Publishing.