Accessing the Netzschleuder Repository¶
Prerequisites¶
First, we need to set up our Python environment that has PyTorch, PyTorch Geometric and PathpyG installed. Depending on where you are executing this notebook, this might already be (partially) done. E.g. Google Colab has PyTorch installed by default so we only need to install the remaining dependencies. The DevContainer that is part of our GitHub Repository on the other hand already has all of the necessary dependencies installed.
In the following, we install the packages for usage in Google Colab using Jupyter magic commands. For other environments comment in or out the commands as necessary. For more details on how to install pathpyG
especially if you want to install it with GPU-support, we refer to our documentation. Note that %%capture
discards the full output of the cell to not clutter this tutorial with unnecessary installation details. If you want to print the output, you can comment %%capture
out.
%%capture
# !pip install torch
!pip install torch_geometric
!pip install git+https://github.com/pathpy/pathpyG.git
Motivation and Learning Objectives¶
Access to a large number of graphs with different topological characteristics and from different domains is crucial for the development and evaluation of graph learning methods. Tousands of graph data sets are available scattered throughout the web, possibly using different data formats and with missing information on their actual origin. Addressing this issue the Netschleuder Online Repository by Tiago Peixoto provides a single repository of graphs in a single format, including descriptions, citations, and node-/edge- or graph-level meta-data. To facilitate the development of graph learning techniques, pathpyG provides a feature that allows to directly read networks from the netzschleuder repository via an API.
In this brief unit, we will learn how we can retrieve network records and graph data from the netzschleuder repository. We will further demonstrate how we can conveniently apply a Graph Neural Network to predict node-level categories contained in the meta-data.
We first need to import a few modules.
import numpy as np
from matplotlib import pyplot as plt
from sklearn import metrics
from sklearn.decomposition import TruncatedSVD
import torch
from torch.nn import Linear, ReLU, Sigmoid, Parameter
import torch_geometric
from torch_geometric.nn import Sequential, GCNConv, SimpleConv, MessagePassing
import pathpyG as pp
pp.config['torch']['device'] = 'cpu'
Reading graphs from the netzschleuder repository¶
In the pathpy.io
module, there is a function that allows to read graph data from the API.
We can read a given networks from the netzschleuder database using its record name. Just browse the Netschleuder Online Repository to find the record names. In the following, we use a graph capturing co-purchase relationships between political books.
g = pp.io.read_netzschleuder_network('polbooks')
print(g)
Undirected graph with 105 nodes and 882 edges Node attributes node_label <class 'list'> node__pos <class 'list'> node_value <class 'list'> Graph attributes num_nodes <class 'int'> tags <class 'list'> citation <class 'str'> directed <class 'float'> name <class 'str'> url <class 'str'> description <class 'str'>
If we print the resulting Graph
instance, we find that the meta information at the node- and grah-level are automatically retrieved and added to the graph.
Let us read the famous karate club network. The record karate club
actually contains two networks with labels 77
and 78
, which refer to two different versions of the graph data. If multiple graph data sets exist in the same record, we need to specify the name of the graph as second argument.
g = pp.io.read_netzschleuder_network('karate', '77')
print(g)
Undirected graph with 34 nodes and 154 edges Node attributes node_name <class 'list'> node__pos <class 'list'> node_groups <class 'list'> Graph attributes num_nodes <class 'int'> tags <class 'list'> url <class 'str'> name <class 'str'> num_nodes <class 'int'> citation <class 'str'> name <class 'str'> url <class 'str'> description <class 'str'>
pp.plot(g, edge_color='gray');
We see that the nodes actually contain a node_group
property, which maps the nodes to two groups. Those groups are often used as ground truth
for communities in this simple illustrative graph. We will instead use it as ground truth categorical node label for a node classification experiment based on a Graph Neural Network.
print(g['node_groups'])
[[1], [1], [1], [1], [1], [1], [1], [1], [1], [2], [1], [1], [1], [1], [2], [2], [1], [1], [2], [1], [2], [1], [2], [2], [2], [2], [2], [2], [2], [2], [2], [2], [2], [2]]
We can plot categorical labels by passing them as node colors in the pathpy plot function.
pp.plot(g, node_color = [g['node_groups',v][0] for v in g.nodes])
<pathpyG.visualisations.network_plots.StaticNetworkPlot at 0x7fab2bf61720>
For convenience, let us shift the group labels to binary values 0 and 1:
g['node_groups'] = torch.tensor(g['node_groups']).float()
g['node_groups'] -= 1
print(g['node_groups'])
tensor([[0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [1.], [0.], [0.], [0.], [0.], [1.], [1.], [0.], [0.], [1.], [0.], [1.], [0.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]])
Applying Graph Neural Networks to Netzschleuder Data¶
We can retrieve a data object that contains the graph and its attributes:
print(g.data)
Data(edge_index=[2, 154], num_nodes=34, node_name=[34], node_groups=[34, 1], node__pos=[34], name='karate (77)', description='Network of friendships among members of a university karate club. Includes metadata for faction membership after a social partition. Note: there are two versions of this network, one with 77 edges and one with 78, due to an ambiguous typo in the original study. (The most commonly used is the one with 78 edges.)[^icon] [^icon]: Description obtained from the [ICON](https://icon.colorado.edu) project.', citation='['W. W. Zachary, "An information flow model for conflict and fission in small groups." Journal of Anthropological Research 33, 452-473 (1977)., https://doi.org/10.1086/jar.33.4.3629752']', url='https://aaronclauset.github.io/datacode.htm', tags=[3], node_feature=[34, 34], x=[34, 34], y=[34, 1])
Let's use a one-hot encoding of nodes as a simple feature x
, and let's use the node groups as target label y
.
data = g.data
g.add_node_ohe('node_feature')
data['x'] = data['node_feature']
data['y'] = data['node_groups']
It is easy to define a Graph Convolutional Network that ues the one-hot-encodings of nodes and the topology to predict binary node labels:
model = Sequential('node_ohe, edge_index', [
(GCNConv(in_channels=data.num_node_features, out_channels=8), 'node_ohe, edge_index -> hidden'),
ReLU(inplace=True),
(GCNConv(in_channels=8, out_channels=1), 'hidden, edge_index -> output'),
Sigmoid(),
])
model.to(pp.config['torch']['device'])
Sequential( (0) - GCNConv(34, 8): node_ohe, edge_index -> hidden (1) - ReLU(inplace=True): hidden -> hidden (2) - GCNConv(8, 1): hidden, edge_index -> output (3) - Sigmoid(): output -> output )
We next apply a RandomNodeSplit
transformation to split the nodes in a training and test set.
transform = torch_geometric.transforms.RandomNodeSplit(split='train_rest', num_val=0.5, num_test=0)
data = transform(data)
We then train our model for 1000 epochs on the training set.
epochs = 1000
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
losses = []
model.train()
for epoch in range(epochs):
optimizer.zero_grad()
out = model(data.x, data.edge_index)
loss = torch.nn.functional.binary_cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
losses.append(loss.cpu().detach().numpy())
plt.plot(range(epochs), losses)
plt.grid()
We evaluate the model in the test set and calculate the adjusted mutual information for the ground truth.
model.eval()
predicted_groups = model(data.x, data.edge_index).round().long()
metrics.adjusted_mutual_info_score(data.y[data.test_mask].squeeze().cpu().numpy(), predicted_groups[data.test_mask].squeeze().cpu().numpy())
1.0
We visualize node representations learned by the model. The training nodes are colored, while test nodes are greyed out.
# get activations in first-layer
embedding = model[0].forward(data.x, data.edge_index)
# dimensionality reduction
svd = TruncatedSVD()
low_dim = svd.fit_transform(embedding.cpu().detach().numpy())
# plot with colors corresponding to groups in validation set
colors = {}
for v in range(g.N):
if data.val_mask[v]:
colors[v] = 'grey'
else:
if data.y[v].item() == 0.0:
colors[v] = 'blue'
else:
colors[v] = 'orange'
plt.scatter(low_dim[:,0], low_dim[:,1], c=colors.values());
This simple code gives you thousands of networks with various meta information at your fingertips, to wich you can directly apply graph learning models provided in pyG, or deep graoh learning architectures defined by yourself.