Learning in Graphs from the Netzschleuder Repository¶
Prerequisites¶
First, we need to set up our Python environment that has PyTorch, PyTorch Geometric and PathpyG installed. Depending on where you are executing this notebook, this might already be (partially) done. E.g. Google Colab has PyTorch installed by default so we only need to install the remaining dependencies. The DevContainer that is part of our GitHub Repository on the other hand already has all of the necessary dependencies installed.
In the following, we install the packages for usage in Google Colab using Jupyter magic commands. For other environments comment in or out the commands as necessary. For more details on how to install pathpyG
especially if you want to install it with GPU-support, we refer to our documentation. Note that %%capture
discards the full output of the cell to not clutter this tutorial with unnecessary installation details. If you want to print the output, you can comment %%capture
out.
%%capture
# !pip install torch
!pip install torch_geometric
!pip install git+https://github.com/pathpy/pathpyG.git
Motivation and Learning Objectives¶
Access to a large number of graphs with different topological characteristics and from different domains is crucial for the development and evaluation of graph learning methods. Tousands of graph data sets are available scattered throughout the web, possibly using different data formats and with missing information on their actual origin. Addressing this issue the Netschleuder Online Repository by Tiago Peixoto provides a single repository of graphs in a single format, including descriptions, citations, and node-/edge- or graph-level meta-data. To facilitate the development of graph learning techniques, pathpyG provides a feature that allows to directly read networks from the netzschleuder repository via an API.
In this brief unit, we will learn how we can retrieve network records and graph data from the netzschleuder repository. We will further demonstrate how we can conveniently apply a Graph Neural Network to predict node-level categories contained in the meta-data.
We first need to import a few modules.
import numpy as np
from matplotlib import pyplot as plt
from sklearn import metrics
from sklearn.decomposition import TruncatedSVD
import torch
from torch.nn import Linear, ReLU, Sigmoid, Parameter
import torch_geometric
from torch_geometric.nn import Sequential, GCNConv, SimpleConv, MessagePassing
import pathpyG as pp
pp.config['torch']['device'] = 'cpu'
# pp.config['torch']['device'] = 'cuda'
Reading graphs from the netzschleuder repository¶
In the pathpy.io
module, there is a function that allows to read graph data from the API.
We can read a given networks from the netzschleuder database using its record name. Just browse the Netschleuder Online Repository to find the record names. As an example, we use a graph capturing co-purchase relationships between political books.
g = pp.io.read_netzschleuder_graph('polbooks')
print(g)
Mapping node attributes based on node indices in column `index` Undirected graph with 105 nodes and 882 (directed) edges { 'Edge Attributes': {}, 'Graph Attributes': { 'analyses_average_degree': "<class 'float'>", 'analyses_degree_assortativity': "<class 'float'>", 'analyses_degree_std_dev': "<class 'float'>", 'analyses_diameter': "<class 'int'>", 'analyses_edge_properties': "<class 'list'>", 'analyses_edge_reciprocity': "<class 'float'>", 'analyses_global_clustering': "<class 'float'>", 'analyses_hashimoto_radius': "<class 'float'>", 'analyses_is_bipartite': "<class 'bool'>", 'analyses_is_directed': "<class 'bool'>", 'analyses_knn_proj_1': "<class 'float'>", 'analyses_knn_proj_2': "<class 'float'>", 'analyses_largest_component_fraction': "<class 'float'>", 'analyses_mixing_time': "<class 'float'>", 'analyses_num_edges': "<class 'int'>", 'analyses_num_vertices': "<class 'int'>", 'analyses_transition_gap': "<class 'float'>", 'analyses_vertex_properties': "<class 'list'>", 'num_nodes': "<class 'int'>"}, 'Node Attributes': {'node__pos': "<class 'numpy.ndarray'>", 'node_label': "<class 'numpy.ndarray'>", 'node_value': "<class 'numpy.ndarray'>"}}
We can plot this temporal graph in an interactive way:
pp.plot(g, edge_color='lightgray', edge_size=5);
To see how we can apply GNNs to attributed graphs, let us read the famous karate club network. The record karate
actually contains two networks with labels 77
and 78
, which refer to two different versions of the data with different numbers of edges. If multiple graph data sets exist in the same record, we can specify the name of the network as second argument.
g = pp.io.read_netzschleuder_graph('karate', '78')
print(g)
Mapping node attributes based on node indices in column `index` Undirected graph with 34 nodes and 156 (directed) edges { 'Edge Attributes': {}, 'Graph Attributes': { 'analyses_average_degree': "<class 'float'>", 'analyses_degree_assortativity': "<class 'float'>", 'analyses_degree_std_dev': "<class 'float'>", 'analyses_diameter': "<class 'int'>", 'analyses_edge_properties': "<class 'list'>", 'analyses_edge_reciprocity': "<class 'float'>", 'analyses_global_clustering': "<class 'float'>", 'analyses_hashimoto_radius': "<class 'float'>", 'analyses_is_bipartite': "<class 'bool'>", 'analyses_is_directed': "<class 'bool'>", 'analyses_knn_proj_1': "<class 'float'>", 'analyses_knn_proj_2': "<class 'float'>", 'analyses_largest_component_fraction': "<class 'float'>", 'analyses_mixing_time': "<class 'float'>", 'analyses_num_edges': "<class 'int'>", 'analyses_num_vertices': "<class 'int'>", 'analyses_transition_gap': "<class 'float'>", 'analyses_vertex_properties': "<class 'list'>", 'num_nodes': "<class 'int'>"}, 'Node Attributes': { 'node__pos': "<class 'numpy.ndarray'>", 'node_groups': "<class 'torch.Tensor'> -> torch.Size([34])", 'node_name': "<class 'torch.Tensor'> -> torch.Size([34])"}}
pp.plot(g, edge_color='gray');
We see that the nodes actually have a node_groups
property, which maps the nodes to two groups. Those groups are often used as ground truth
for communities in this simple illustrative graph. We will instead use it as ground truth categorical node label for a node classification experiment based on a Graph Neural Network.
Conveniently, numerical node attributes (either scalar or vector values) are automatically converted to torch tensors, so we can directly use them for a GNN.
print(g['node_groups'])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
For convenience, let us shift the group labels to binary values 0 and 1:
g['node_groups'] -= 1
print(g['node_groups'])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
We can plot categorical labels by passing node colors in the plot function.
pp.plot(g, node_color = [g['node_groups',v].item() for v in g.nodes])
<pathpyG.visualisations.network_plots.StaticNetworkPlot at 0x7f8b38110ee0>
For convenience, let us shift the group labels to binary values 0 and 1:
color_map = {0: 'red', 1: 'blue'}
colors = [ color_map[g['node_groups',v].item()] for v in g.nodes ]
pp.plot(g, node_color = colors);
Applying Graph Neural Networks to Netzschleuder Data¶
To simplify the application of deep learning models, we can retrieve a data object that contains the graph and its attributes:
print(g.data)
Data(edge_index=[2, 156], num_nodes=34, node_sequence=[34, 1], node_name=[34], node_groups=[34], node__pos=[34], analyses_average_degree=4.588235294117647, analyses_degree_assortativity=-0.47561309768461424, analyses_degree_std_dev=3.820360677912828, analyses_diameter=5, analyses_edge_properties=[0], analyses_edge_reciprocity=1.0, analyses_global_clustering=0.2556818181818182, analyses_hashimoto_radius=5.292780644548693, analyses_is_bipartite=False, analyses_is_directed=False, analyses_knn_proj_1=3.6123615105719784, analyses_knn_proj_2=1.4566019942625823, analyses_largest_component_fraction=1.0, analyses_mixing_time=7.04834107126513, analyses_num_edges=78, analyses_num_vertices=34, analyses_transition_gap=0.8677276709836416, analyses_vertex_properties=[3])
Let's use a one-hot encoding of nodes as a simple additional node feature x
, and let's use the node groups as target label y
.
data = g.data
g["node_feature"] = torch.eye(g.n)
data['x'] = data['node_feature']
data['y'] = data['node_groups'].reshape(-1, 1).float()
It is easy to define a Graph Convolutional Network that ues the one-hot-encodings of nodes and the topology to predict binary node labels:
model = Sequential('node_ohe, edge_index', [
(GCNConv(in_channels=data.num_node_features, out_channels=8), 'node_ohe, edge_index -> hidden'),
ReLU(inplace=True),
(GCNConv(in_channels=8, out_channels=1), 'hidden, edge_index -> output'),
Sigmoid(),
])
model.to(pp.config['torch']['device'])
Sequential( (0) - GCNConv(34, 8): node_ohe, edge_index -> hidden (1) - ReLU(inplace=True): hidden -> hidden (2) - GCNConv(8, 1): hidden, edge_index -> output (3) - Sigmoid(): output -> output )
We next apply a RandomNodeSplit
transformation to split the nodes in a training and test set.
transform = torch_geometric.transforms.RandomNodeSplit(split='train_rest', num_val=0.5, num_test=0)
data = transform(data)
We then train our model for 1000 epochs on the training set.
epochs = 1000
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
losses = []
model.train()
for epoch in range(epochs):
optimizer.zero_grad()
out = model(data.x, data.edge_index)
loss = torch.nn.functional.binary_cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
losses.append(loss.cpu().detach().numpy())
plt.plot(range(epochs), losses)
plt.grid()
We evaluate the model in the test set and calculate the adjusted mutual information for the ground truth.
model.eval()
predicted_groups = model(data.x, data.edge_index).round().long()
metrics.adjusted_mutual_info_score(data.y[data.test_mask].squeeze().cpu().numpy(), predicted_groups[data.test_mask].squeeze().cpu().numpy())
1.0
We visualize node representations learned by the model. The test nodes are colored, while training nodes are greyed out.
# get activations in first-layer
embedding = model[0].forward(data.x, data.edge_index)
# dimensionality reduction
svd = TruncatedSVD()
low_dim = svd.fit_transform(embedding.cpu().detach().numpy())
# plot with colors corresponding to groups in validation set
colors = {}
for v in range(g.n):
if not data.val_mask[v]:
colors[v] = 'grey'
else:
if data.y[v].item() == 0.0:
colors[v] = 'blue'
else:
colors[v] = 'orange'
plt.scatter(low_dim[:,0], low_dim[:,1], c=colors.values());
This simple code gives you thousands of networks with various meta information at your fingertips, to wich you can directly apply graph learning models provided in pyG, or deep graoh learning architectures defined by yourself.