Basic pathpyG Concepts¶
Prerequisites¶
First, we need to set up our Python environment that has PyTorch, PyTorch Geometric and PathpyG installed. Depending on where you are executing this notebook, this might already be (partially) done. E.g. Google Colab has PyTorch installed by default so we only need to install the remaining dependencies. The DevContainer that is part of our GitHub Repository on the other hand already has all of the necessary dependencies installed.
In the following, we install the packages for usage in Google Colab using Jupyter magic commands. For other environments comment in or out the commands as necessary. For more details on how to install pathpyG
especially if you want to install it with GPU-support, we refer to our documentation. Note that %%capture
discards the full output of the cell to not clutter this tutorial with unnecessary installation details. If you want to print the output, you can comment %%capture
out.
%%capture
# !pip install torch
!pip install torch_geometric
!pip install git+https://github.com/pathpy/pathpyG.git
Motivation and Learning Objectives¶
This first step of our multi-stage introductory tutorial introduces the key concepts of pathpyG
. While pathpyG
targets GPU-accelerated analysis and learning in time series data on temporal graphs, it is also a great tool to represent, analyze and visualize static graphs. For this, it provides a Graph
class that is implemented based on the torch_geometric.data.Data
object, which has the advantage that we can directly apply pyG
transforms and use the underlying data for graph learning tasks.
In this basic tutorial you will learn how we can use pathpyG
to represent static graphs. We start with basic features to create directed and undirected graphs with node-, edge-, and graph-level attributes. We further discuss how we can calculate node centralities, and how we can read graph data from the netzschleuder
database. We finally show how we can implement graph algorithms that are based on a traversal of edges.
We first import the modules torch
, torch_geometric
and pathpyG
. By setting the device used by torch
, we specify whether we want to run our code on the CPU or on the GPU. For a CPU-based execution, set the torch.device
configuration to cpu
. Set the device to cuda
if you want to run it on the GPU instead.
import torch
import torch_geometric as pyG
from torch_geometric.data import Data
import pathpyG as pp
pp.config['torch']['device'] = 'cpu'
Creating Graphs¶
Let's start by generating a simple, directed graph with three nodes a
, b
, c
and three edges (a,b)
, (b,c)
and (a,b)
. The three nodes a
, b
, and c
can be represented by integer indices $0, 1$ and $2$ respectively. Following the tensor-based representation in pyG
, we use an edge_index
tensor with shape (2,m)
to represent the m
edges of a graph.
We can then add this to a Data
object that can possibly hold additional node and edge attributes, and pass the Data
object to the constructor of the Graph
class.
Using the mapping of node names to indices specified above, the following code generates a directed Graph
with three edges (a,c)
, (b,c)
and (a,b)
.
d = Data(edge_index = torch.tensor([[0,1,0], [2,2,1]]))
g = pp.Graph(d)
print(g)
Directed graph with 3 nodes and 3 edges Graph attributes num_nodes <class 'int'>
If we do not need additional node or edge attributes, we can use the class function Graph.from_edge_index
to directly create a graph based on an edge index:
g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]))
print(g)
Directed graph with 3 nodes and 3 edges Graph attributes num_nodes <class 'int'>
We may want to inlude isolated nodes that do not have an edge. We can do so by passing a num_nodes
parameter.
g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]), num_nodes=4)
print(g)
Directed graph with 4 nodes and 3 edges Graph attributes num_nodes <class 'int'>
In both cases, the Graph
instance has a property g.data
that stores a pyG
Data
object that includes the edge index as well as any further node-, edge- or graph-level attributes.
print(g.data)
Data(edge_index=[2, 3], num_nodes=4)
Note that the edge_index
is actually of type pyG.EdgeIndex
, which is a subclass of torch.Tensor
. Any tensor passed as an edge index in the constructor of Graph
will automatically be converted to an EdgeIndex
instance, as this allows us to provide efficient edge traveral routines. To support this, the edge index will be automatically sorted by row when the Graph
object is created. If you want to avoid this, you can directly pass an already sorted EdgeIndex
object in the Data
object in the constructor or in the from_edge_index
class function.
print(g.data.edge_index)
EdgeIndex([[0, 0, 1], [2, 1, 2]], sparse_size=(4, 4), nnz=3, sort_order=row)
We can use the generators nodes
and edges
to iterate through the nodes and edges of a graph as follows:
for v in g.nodes:
print(v)
for e in g.edges:
print(e)
0 1 2 3 (0, 2) (0, 1) (1, 2)
While the index-based representation of nodes allows for efficient tensor-based operations, it is often convenient to use string identifiers for nodes. To simplify the handling of graphs with node labels, pathpyG
provides a class IndexMap
that transparently maps string identifiers to indices. For our small example graph, we can create an IndexMap
that associates node indices with string IDs. For our example, we can create a mapping as follows:
m = pp.IndexMap(['a', 'b', 'c', 'd'])
print(m)
a -> 0 b -> 1 c -> 2 d -> 3
We could now use the functions IndexMap.to_id
or IndexMap.to_idx
to map a node to an index or an ID:
m.to_id(0)
'a'
m.to_idx('b')
1
To make this mapping transparent for the user, we can add the mapping to Graph
object, either by passing it in the constructor or by setting the mapping
attribute of an existing instance.
g.mapping = m
If we now iterate through the nodes and edges of the graph, we get:
for v in g.nodes:
print(v)
for e in g.edges:
print(e)
a b c d ('a', 'c') ('a', 'b') ('b', 'c')
We can achieve the same result if we pass the IndexMap
object in the constructor of a graph. This transparently applies the mapping in all future function calls.
g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]), num_nodes = 4, mapping=m)
print(g)
Directed graph with 4 nodes and 3 edges Graph attributes num_nodes <class 'int'>
Above, we have created a graph based on an edge index tensor and we then additionally applied a mapping that we manually defined. However, we often have data in the form on an edge list, where edges are given as tuples of non-numeric node identifiers. The class function Graph.from_edge_list
simplifies the construction of a Graph
from such edge lists. This will automatically generate an internal integer-based representation of the edge index, as well as the associated IndexMap
, where the indices are created based on the lexicographic order of node IDs.
g = pp.Graph.from_edge_list([('a','b'), ('b','c'), ('a','c')])
print(g)
print(g.data.edge_index)
print(g.mapping)
Directed graph with 3 nodes and 3 edges Graph attributes num_nodes <class 'int'> EdgeIndex([[0, 0, 1], [1, 2, 2]], sparse_size=(3, 3), nnz=3, sort_order=row) a -> 0 b -> 1 c -> 2
We could alternatively pass our own index mapping, e.g. mapping node c
to idex 1 and node b
to index 2.
g = pp.Graph.from_edge_list([('a','b'), ('a','c'), ('b','c')], mapping = pp.IndexMap(['a', 'c', 'b']))
print(g.data.edge_index)
print(g.mapping)
EdgeIndex([[0, 0, 2], [2, 1, 1]], sparse_size=(3, 3), nnz=3, sort_order=row) a -> 0 c -> 1 b -> 2
Traversing Graphs¶
The Graph
object provides get_successors
and get_predecessors
functions, which return the indices of nodes that are connected to a node with a given index. Based on the CSR and CSC representations cached for the sorted EdgeIndex
, the access of the successors and predecessors is done in constant time, i.e. it does not require us to enumerate the edge_index
tensor.
For node a
with index $0$ in our directed network we obtain:
g.get_successors(0)
tensor([2, 1])
g.get_predecessors(0)
tensor([], dtype=torch.int64)
Note that, even if a mapping is defined, the get_successors
and get_predecessors
functions always return a tensor with node indices, rather than node IDs. This is useful to support fast tensor-based operations on the list of successors and predecessors. We could always manually map the node indices using the IndexMap
object defined in the mapping
attribute.
We can also use the successors
and predecessors
generators of the Graph
object, which -- if an ID-Index mapping is defined - yield the string labels of successor or predecessor nodes for a given node (also identified by its string label).
for v in g.successors('a'):
print(v)
b c
for v in g.predecessors('c'):
print(v)
a b
We can easily check, whether an edge exists in the graph:
g.is_edge('a', 'b')
True
Alternatively, we can use the following code to check whether node b
is a successor of a
'b' in g.successors('a')
True
By default, a graph object in pathpyG
is directed, i.e. for the graph above, the edge (b,a)
does not exist, which we can verify as follows:
'a' in g.successors('b')
False
To check the (directed) in- and out-degrees of nodes, we can use the properties in_degrees
and out_degrees
, which return a dictionary that maps node IDs to their degrees:
g.in_degrees
{'a': 0, 'c': 2, 'b': 1}
g.in_degrees['b']
1
g.in_degrees['c']
2
g.data
Data(edge_index=[2, 3], num_nodes=3)
Importantly, irrespective of how we have generated the graph object, the actual node and edge data are always stored as a pyG
data object. This allows us to use the full power of torch
and pyG
, including the application of transforms, splits, or any easy migration between CPU and GPU-based computation. In general, pathpyG
will use the device specified in the torch.device
configuration (see above) whenver it internally creates a torch tensor. Since above, we have specified the cpu
device, the data object of the graph generated above will reside in main memory:
g.data.is_cuda
False
If we instead set the device to cuda
, the Data
object will internally be created in main memory instead.
pp.config['torch']['device'] = 'cuda'
g = pp.Graph.from_edge_list([('a','b'), ('b','c'), ('a','c')])
g.data.is_cuda
True
Node-, Edge- or Graph-Level Attributes¶
Real-world graphs often have node-, edge-, or graph-level attributes. In pathpyG
, we can simply add attributes as tensors, either by directly assigning them to the pyG
data object of an existing graph (or by adding them to the Data
object that we pass in the constructor). Following the pyG
semantics of attribute names, we must use the prefixes node_
and edge_
to refer to node- and edge-level attributes. Attributes with other names will be assumed to refer to graph-level attributes.
g.data['node_class'] = torch.tensor([[0], [0], [1]], device=pp.config['torch']['device'])
g.data['edge_weight'] = torch.tensor([[1], [2], [3]], device=pp.config['torch']['device'])
g.data['feature'] = torch.tensor([3, 2], device=pp.config['torch']['device'])
Once we have added attributes to nodes, edges, or the graph, those attributes, along with their type and shape will be shown when you print a string representation of the graph object:
print(g)
Directed graph with 3 nodes and 3 edges Node attributes node_class <class 'torch.Tensor'> -> torch.Size([3, 1]) Edge attributes edge_weight <class 'torch.Tensor'> -> torch.Size([3, 1]) Graph attributes feature <class 'torch.Tensor'> -> torch.Size([2]) num_nodes <class 'int'>
To simplify the access to attribute values, the Graph
class provides getter and setter functions that allow an indexed access based on node identifiers. To access the feature node_feature
of node a
, we can write:
g['node_class', 'a']
tensor([0], device='cuda:0')
To access the weight of edge (a, b)
we can write:
g['edge_weight', 'a', 'b']
tensor([1], device='cuda:0')
And finally, graph-based attributes can accessed as follows:
g['feature']
tensor([3, 2], device='cuda:0')
We can also use the setter functions to change attributes:
g['node_class'] = torch.tensor([[7], [2], [3]], device='cuda')
g['node_class', 'a']
tensor([7], device='cuda:0')
To create a sparse adjacency matrix representations of the topology of a graph, we can use the following function:
print(g.get_sparse_adj_matrix())
(0, 1) 1.0 (0, 2) 1.0 (1, 2) 1.0
This returns a scipy.sparse.coo_matrix
object, which can be turned into a dense numpy
matrix as follows:
print(g.get_sparse_adj_matrix().todense())
[[0. 1. 1.] [0. 0. 1.] [0. 0. 0.]]
By passing the name of the attribute, we can also use edge attributes in the creation of the adjacency matrix. To create a sparse, weighted adjacency matrix that uses the edge_weight
attribute of our graph object we can simply write:
print(g.get_sparse_adj_matrix(edge_attr='edge_weight').todense())
[[0 1 2] [0 0 3] [0 0 0]]
It is easy to apply GNN models to a graph object. We can add attributes based on one-hot-encodings of nodes and edges as follows:
g.add_node_ohe(attr_name='node_ohe_feature_1')
g.add_node_ohe(attr_name='node_ohe_feature_2', dim=4)
g.add_edge_ohe(attr_name='edge_ohe_feature_1', dim=5)
print(g)
print(g.data['node_ohe_feature_1'])
print(g.data['node_ohe_feature_2'])
print(g.data['edge_ohe_feature_1'])
Directed graph with 3 nodes and 3 edges Node attributes node_class <class 'torch.Tensor'> -> torch.Size([3, 1]) node_ohe_feature_2 <class 'torch.Tensor'> -> torch.Size([3, 4]) node_ohe_feature_1 <class 'torch.Tensor'> -> torch.Size([3, 3]) Edge attributes edge_ohe_feature_1 <class 'torch.Tensor'> -> torch.Size([3, 5]) edge_weight <class 'torch.Tensor'> -> torch.Size([3, 1]) Graph attributes feature <class 'torch.Tensor'> -> torch.Size([2]) num_nodes <class 'int'> tensor([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]], device='cuda:0') tensor([[1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.]], device='cuda:0') tensor([[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], [0., 0., 1., 0., 0.]], device='cuda:0')
By default, all graphs in pathpyG
are directed. To represent undirected graphs, we must add all edges in both directions. We can use the to_undirected()
function to make a directed graph undirected, i.e. to add all (missing) edges that point in the opposite direction. This will automatically duplicate and assign the corresponding edge attributes to the newly formed (directed) edges, i.e. edges are assumed to have the same attributes in both directions.
g_u = g.to_undirected()
print(g_u)
Undirected graph with 3 nodes and 6 (directed) edges Node attributes node_class <class 'torch.Tensor'> -> torch.Size([3, 1]) node_ohe_feature_2 <class 'torch.Tensor'> -> torch.Size([3, 4]) node_ohe_feature_1 <class 'torch.Tensor'> -> torch.Size([3, 3]) Edge attributes edge_ohe_feature_1 <class 'torch.Tensor'> -> torch.Size([6, 5]) edge_weight <class 'torch.Tensor'> -> torch.Size([6, 1]) Graph attributes feature <class 'torch.Tensor'> -> torch.Size([2]) num_nodes <class 'int'>
By default, the Graph
object can contain multiple identical edges, so the following is possible:
g = pp.Graph.from_edge_list([('a', 'b'), ('b', 'c'), ('c', 'a'), ('a', 'b')])
print(g.data.edge_index)
EdgeIndex([[0, 0, 1, 2], [1, 1, 2, 0]], device='cuda:0', sparse_size=(3, 3), nnz=4, sort_order=row)
It is often convenient, to coalesce multi-edges into weighted single-edges, i.e. in the example above we may prefer a graph where each edge occurs once in the edge index, but the edge a->b
has a weight attribute of two, while the two other edges have one.
In pathpyG
we can do this as follows:
g_w = g.to_weighted_graph()
print(g_w.data.edge_index)
print(g_w['edge_weight', 'b', 'c'])
print(g_w['edge_weight', 'a', 'b'])
print(g_w['edge_weight', 'c', 'a'])
EdgeIndex([[0, 1, 2], [1, 2, 0]], device='cuda:0', sparse_size=(3, 3), nnz=3, sort_order=row) tensor(1., device='cuda:0') tensor(2., device='cuda:0') tensor(1., device='cuda:0')
As we will see in a separate notebook focussing on the advanced (temporal) graph visualization features of pathpyG
, it is easy to generate (interactive) HTML plots of graphs, that are embedded into jupyter notebooks. You can simply call the pp.plot
function on the Graph object:
pp.plot(g, edge_color='gray', node_label=[v for v in g.mapping.node_ids]);
Node Centralities¶
To calculate node centralities, we can use a networkx
delegate mechanism implemented in the module pathpyG.algorithms.centrality
. Simply speaking, you can call any function implented in the networkx centrality module that starts with the string centrality_
. The pathpyG
will be internally converted to a networkx.DiGraph
object, the corresponding centrality function (with all of its parameters) will be called, and the result will be mapped to the nodes based on their IDs.
In order to calculate the closeness centralities of all nodes for the graph above, we can call:
pp.algorithms.centrality.closeness_centrality(g)
{'a': 0.6666666666666666, 'b': 0.6666666666666666, 'c': 0.6666666666666666}
pp.algorithms.centrality.betweenness_centrality(g)
defaultdict(<function pathpyG.algorithms.centrality.betweenness_centrality.<locals>.<lambda>()>, {'c': 1.0, 'b': 2.0, 'a': 1.0})