Basic pathpyG Concepts¶

Prerequisites¶

First, we need to set up our Python environment that has PyTorch, PyTorch Geometric and PathpyG installed. Depending on where you are executing this notebook, this might already be (partially) done. E.g. Google Colab has PyTorch installed by default so we only need to install the remaining dependencies. The DevContainer that is part of our GitHub Repository on the other hand already has all of the necessary dependencies installed.

In the following, we install the packages for usage in Google Colab using Jupyter magic commands. For other environments comment in or out the commands as necessary. For more details on how to install pathpyG especially if you want to install it with GPU-support, we refer to our documentation. Note that %%capture discards the full output of the cell to not clutter this tutorial with unnecessary installation details. If you want to print the output, you can comment %%capture out.

In [1]:

Copied!





%%capture
# !pip install torch
!pip install torch_geometric
!pip install git+https://github.com/pathpy/pathpyG.git
%%capture
# !pip install torch
!pip install torch_geometric
!pip install git+https://github.com/pathpy/pathpyG.git

Motivation and Learning Objectives¶

This first step of our multi-stage introductory tutorial introduces the key concepts of pathpyG. While pathpyG targets GPU-accelerated analysis and learning in time series data on temporal graphs, it is also a great tool to represent, analyze and visualize static graphs. For this, it provides a Graph class that is implemented based on the torch_geometric.data.Data object, which has the advantage that we can directly apply pyG transforms and use the underlying data for graph learning tasks.

In this basic tutorial you will learn how we can use pathpyG to represent static graphs. We start with basic features to create directed and undirected graphs with node-, edge-, and graph-level attributes. We further discuss how we can calculate node centralities, and how we can read graph data from the netzschleuder database. We finally show how we can implement graph algorithms that are based on a traversal of edges.

We first import the modules torch, torch_geometric and pathpyG. By setting the device used by torch, we specify whether we want to run our code on the CPU or on the GPU. For a CPU-based execution, set the torch.device configuration to cpu. Set the device to cuda if you want to run it on the GPU instead.

In [9]:

Copied!





import torch
import torch_geometric as pyG
from torch_geometric.data import Data

import pathpyG as pp
pp.config['torch']['device'] = 'cpu'
import torch
import torch_geometric as pyG
from torch_geometric.data import Data

import pathpyG as pp
pp.config['torch']['device'] = 'cpu'

Creating Graphs¶

Let's start by generating a simple, directed graph with three nodes a, b, c and three edges (a,b), (b,c) and (a,b). The three nodes a, b, and c can be represented by integer indices $0, 1$ and $2$ respectively. Following the tensor-based representation in pyG, we use an edge_index tensor with shape (2,m) to represent the m edges of a graph. We can then add this to a Data object that can possibly hold additional node and edge attributes, and pass the Data object to the constructor of the Graph class.

Using the mapping of node names to indices specified above, the following code generates a directed Graph with three edges (a,c), (b,c) and (a,b).

In [10]:

Copied!

d = Data(edge_index = torch.tensor([[0,1,0], [2,2,1]]))
g = pp.Graph(d)
print(g)
d = Data(edge_index = torch.tensor([[0,1,0], [2,2,1]]))
g = pp.Graph(d)
print(g)

Directed graph with 3 nodes and 3 edges

Graph attributes
	num_nodes		<class 'int'>

If we do not need additional node or edge attributes, we can use the class function Graph.from_edge_index to directly create a graph based on an edge index:

In [11]:

Copied!

g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]))
print(g)
g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]))
print(g)

Directed graph with 3 nodes and 3 edges

Graph attributes
	num_nodes		<class 'int'>

We may want to inlude isolated nodes that do not have an edge. We can do so by passing a num_nodes parameter.

In [15]:

Copied!

g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]), num_nodes=4)
print(g)
g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]), num_nodes=4)
print(g)

Directed graph with 4 nodes and 3 edges

Graph attributes
	num_nodes		<class 'int'>

In both cases, the Graph instance has a property g.data that stores a pyG Data object that includes the edge index as well as any further node-, edge- or graph-level attributes.

In [16]:

Copied!

print(g.data)
print(g.data)

Data(edge_index=[2, 3], num_nodes=4)

Note that the edge_index is actually of type pyG.EdgeIndex, which is a subclass of torch.Tensor. Any tensor passed as an edge index in the constructor of Graph will automatically be converted to an EdgeIndex instance, as this allows us to provide efficient edge traveral routines. To support this, the edge index will be automatically sorted by row when the Graph object is created. If you want to avoid this, you can directly pass an already sorted EdgeIndex object in the Data object in the constructor or in the from_edge_index class function.

In [17]:

Copied!

print(g.data.edge_index)
print(g.data.edge_index)

EdgeIndex([[0, 0, 1],
           [2, 1, 2]], sparse_size=(4, 4), nnz=3, sort_order=row)

We can use the generators nodes and edges to iterate through the nodes and edges of a graph as follows:

In [18]:

Copied!

for v in g.nodes:
    print(v)

for e in g.edges:
    print(e)
for v in g.nodes:
    print(v)

for e in g.edges:
    print(e)

0
1
2
3
(0, 2)
(0, 1)
(1, 2)

While the index-based representation of nodes allows for efficient tensor-based operations, it is often convenient to use string identifiers for nodes. To simplify the handling of graphs with node labels, pathpyG provides a class IndexMap that transparently maps string identifiers to indices. For our small example graph, we can create an IndexMap that associates node indices with string IDs. For our example, we can create a mapping as follows:

In [19]:

Copied!

m = pp.IndexMap(['a', 'b', 'c', 'd'])
print(m)
m = pp.IndexMap(['a', 'b', 'c', 'd'])
print(m)

a -> 0
b -> 1
c -> 2
d -> 3

We could now use the functions IndexMap.to_id or IndexMap.to_idx to map a node to an index or an ID:

In [20]:

Copied!

m.to_id(0)
m.to_id(0)

Out[20]:

'a'

In [21]:

Copied!

m.to_idx('b')
m.to_idx('b')

Out[21]:

To make this mapping transparent for the user, we can add the mapping to Graph object, either by passing it in the constructor or by setting the mapping attribute of an existing instance.

In [22]:

Copied!

g.mapping = m
g.mapping = m

If we now iterate through the nodes and edges of the graph, we get:

In [23]:

Copied!

for v in g.nodes:
    print(v)

for e in g.edges:
    print(e)
for v in g.nodes:
    print(v)

for e in g.edges:
    print(e)

a
b
c
d
('a', 'c')
('a', 'b')
('b', 'c')

We can achieve the same result if we pass the IndexMap object in the constructor of a graph. This transparently applies the mapping in all future function calls.

In [25]:

Copied!

g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]), num_nodes = 4, mapping=m)
print(g)
g = pp.Graph.from_edge_index(torch.tensor([[0,1,0], [2,2,1]]), num_nodes = 4, mapping=m)
print(g)

Directed graph with 4 nodes and 3 edges

Graph attributes
	num_nodes		<class 'int'>

Above, we have created a graph based on an edge index tensor and we then additionally applied a mapping that we manually defined. However, we often have data in the form on an edge list, where edges are given as tuples of non-numeric node identifiers. The class function Graph.from_edge_list simplifies the construction of a Graph from such edge lists. This will automatically generate an internal integer-based representation of the edge index, as well as the associated IndexMap, where the indices are created based on the lexicographic order of node IDs.

In [26]:

Copied!





g = pp.Graph.from_edge_list([('a','b'), ('b','c'), ('a','c')])
print(g)
print(g.data.edge_index)
print(g.mapping)
g = pp.Graph.from_edge_list([('a','b'), ('b','c'), ('a','c')])
print(g)
print(g.data.edge_index)
print(g.mapping)

Directed graph with 3 nodes and 3 edges

Graph attributes
	num_nodes		<class 'int'>

EdgeIndex([[0, 0, 1],
           [1, 2, 2]], sparse_size=(3, 3), nnz=3, sort_order=row)
a -> 0
b -> 1
c -> 2

We could alternatively pass our own index mapping, e.g. mapping node c to idex 1 and node b to index 2.

In [27]:

Copied!

g = pp.Graph.from_edge_list([('a','b'), ('a','c'), ('b','c')], mapping = pp.IndexMap(['a', 'c', 'b']))
print(g.data.edge_index)
print(g.mapping)
g = pp.Graph.from_edge_list([('a','b'), ('a','c'), ('b','c')], mapping = pp.IndexMap(['a', 'c', 'b']))
print(g.data.edge_index)
print(g.mapping)

EdgeIndex([[0, 0, 2],
           [2, 1, 1]], sparse_size=(3, 3), nnz=3, sort_order=row)
a -> 0
c -> 1
b -> 2

Traversing Graphs¶

The Graph object provides get_successors and get_predecessors functions, which return the indices of nodes that are connected to a node with a given index. Based on the CSR and CSC representations cached for the sorted EdgeIndex, the access of the successors and predecessors is done in constant time, i.e. it does not require us to enumerate the edge_index tensor.

For node a with index $0$ in our directed network we obtain:

In [28]:

Copied!

g.get_successors(0)
g.get_successors(0)

Out[28]:

tensor([2, 1])

In [29]:

Copied!

g.get_predecessors(0)
g.get_predecessors(0)

Out[29]:

tensor([], dtype=torch.int64)

Note that, even if a mapping is defined, the get_successors and get_predecessors functions always return a tensor with node indices, rather than node IDs. This is useful to support fast tensor-based operations on the list of successors and predecessors. We could always manually map the node indices using the IndexMap object defined in the mapping attribute.

We can also use the successors and predecessors generators of the Graph object, which -- if an ID-Index mapping is defined - yield the string labels of successor or predecessor nodes for a given node (also identified by its string label).

In [30]:

Copied!

for v in g.successors('a'):
    print(v)
for v in g.successors('a'):
    print(v)

b
c

In [31]:

Copied!

for v in g.predecessors('c'):
    print(v)
for v in g.predecessors('c'):
    print(v)

a
b

We can easily check, whether an edge exists in the graph:

In [32]:

Copied!

g.is_edge('a', 'b')
g.is_edge('a', 'b')

Out[32]:

True

Alternatively, we can use the following code to check whether node b is a successor of a

In [33]:

Copied!

'b' in g.successors('a')
'b' in g.successors('a')

Out[33]:

True

By default, a graph object in pathpyG is directed, i.e. for the graph above, the edge (b,a) does not exist, which we can verify as follows:

In [34]:

Copied!

'a' in g.successors('b')
'a' in g.successors('b')

Out[34]:

False

To check the (directed) in- and out-degrees of nodes, we can use the properties in_degrees and out_degrees, which return a dictionary that maps node IDs to their degrees:

In [35]:

Copied!

g.in_degrees
g.in_degrees

Out[35]:

{'a': 0, 'c': 2, 'b': 1}

In [36]:

Copied!

g.in_degrees['b']
g.in_degrees['b']

Out[36]:

In [37]:

Copied!

g.in_degrees['c']
g.in_degrees['c']

Out[37]:

In [38]:

Copied!

g.data
g.data

Out[38]:

Data(edge_index=[2, 3], num_nodes=3)

Importantly, irrespective of how we have generated the graph object, the actual node and edge data are always stored as a pyG data object. This allows us to use the full power of torch and pyG, including the application of transforms, splits, or any easy migration between CPU and GPU-based computation. In general, pathpyG will use the device specified in the torch.device configuration (see above) whenver it internally creates a torch tensor. Since above, we have specified the cpu device, the data object of the graph generated above will reside in main memory:

In [39]:

Copied!

g.data.is_cuda
g.data.is_cuda

Out[39]:

False

If we instead set the device to cuda, the Data object will internally be created in main memory instead.

In [40]:

Copied!

pp.config['torch']['device'] = 'cuda'

g = pp.Graph.from_edge_list([('a','b'), ('b','c'), ('a','c')])
g.data.is_cuda
pp.config['torch']['device'] = 'cuda'

g = pp.Graph.from_edge_list([('a','b'), ('b','c'), ('a','c')])
g.data.is_cuda

Out[40]:

True

Node-, Edge- or Graph-Level Attributes¶

Real-world graphs often have node-, edge-, or graph-level attributes. In pathpyG, we can simply add attributes as tensors, either by directly assigning them to the pyG data object of an existing graph (or by adding them to the Data object that we pass in the constructor). Following the pyG semantics of attribute names, we must use the prefixes node_ and edge_ to refer to node- and edge-level attributes. Attributes with other names will be assumed to refer to graph-level attributes.

In [41]:

Copied!

g.data['node_class'] = torch.tensor([[0], [0], [1]], device=pp.config['torch']['device'])
g.data['edge_weight'] = torch.tensor([[1], [2], [3]], device=pp.config['torch']['device'])
g.data['feature'] = torch.tensor([3, 2], device=pp.config['torch']['device'])
g.data['node_class'] = torch.tensor([[0], [0], [1]], device=pp.config['torch']['device'])
g.data['edge_weight'] = torch.tensor([[1], [2], [3]], device=pp.config['torch']['device'])
g.data['feature'] = torch.tensor([3, 2], device=pp.config['torch']['device'])

Once we have added attributes to nodes, edges, or the graph, those attributes, along with their type and shape will be shown when you print a string representation of the graph object:

In [42]:

Copied!

print(g)
print(g)

Directed graph with 3 nodes and 3 edges

Node attributes
	node_class		<class 'torch.Tensor'> -> torch.Size([3, 1])

Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([3, 1])

Graph attributes
	feature		<class 'torch.Tensor'> -> torch.Size([2])
	num_nodes		<class 'int'>

To simplify the access to attribute values, the Graph class provides getter and setter functions that allow an indexed access based on node identifiers. To access the feature node_feature of node a, we can write:

In [43]:

Copied!

g['node_class', 'a']
g['node_class', 'a']

Out[43]:

tensor([0], device='cuda:0')

To access the weight of edge (a, b) we can write:

In [44]:

Copied!

g['edge_weight', 'a', 'b']
g['edge_weight', 'a', 'b']

Out[44]:

tensor([1], device='cuda:0')

And finally, graph-based attributes can accessed as follows:

In [45]:

Copied!

g['feature']
g['feature']

Out[45]:

tensor([3, 2], device='cuda:0')

We can also use the setter functions to change attributes:

In [46]:

Copied!

g['node_class'] = torch.tensor([[7], [2], [3]], device='cuda')
g['node_class'] = torch.tensor([[7], [2], [3]], device='cuda')

In [47]:

Copied!

g['node_class', 'a']
g['node_class', 'a']

Out[47]:

tensor([7], device='cuda:0')

To create a sparse adjacency matrix representations of the topology of a graph, we can use the following function:

In [48]:

Copied!

print(g.get_sparse_adj_matrix())
print(g.get_sparse_adj_matrix())

  (0, 1)	1.0
  (0, 2)	1.0
  (1, 2)	1.0

This returns a scipy.sparse.coo_matrix object, which can be turned into a dense numpy matrix as follows:

In [49]:

Copied!

print(g.get_sparse_adj_matrix().todense())
print(g.get_sparse_adj_matrix().todense())

[[0. 1. 1.]
 [0. 0. 1.]
 [0. 0. 0.]]

By passing the name of the attribute, we can also use edge attributes in the creation of the adjacency matrix. To create a sparse, weighted adjacency matrix that uses the edge_weight attribute of our graph object we can simply write:

In [50]:

Copied!

print(g.get_sparse_adj_matrix(edge_attr='edge_weight').todense())
print(g.get_sparse_adj_matrix(edge_attr='edge_weight').todense())

[[0 1 2]
 [0 0 3]
 [0 0 0]]

It is easy to apply GNN models to a graph object. We can add attributes based on one-hot-encodings of nodes and edges as follows:

In [51]:

Copied!





g.add_node_ohe(attr_name='node_ohe_feature_1')
g.add_node_ohe(attr_name='node_ohe_feature_2', dim=4)
g.add_edge_ohe(attr_name='edge_ohe_feature_1', dim=5)
print(g)

print(g.data['node_ohe_feature_1'])
print(g.data['node_ohe_feature_2'])
print(g.data['edge_ohe_feature_1'])
g.add_node_ohe(attr_name='node_ohe_feature_1')
g.add_node_ohe(attr_name='node_ohe_feature_2', dim=4)
g.add_edge_ohe(attr_name='edge_ohe_feature_1', dim=5)
print(g)

print(g.data['node_ohe_feature_1'])
print(g.data['node_ohe_feature_2'])
print(g.data['edge_ohe_feature_1'])

Directed graph with 3 nodes and 3 edges

Node attributes
	node_class		<class 'torch.Tensor'> -> torch.Size([3, 1])
	node_ohe_feature_2		<class 'torch.Tensor'> -> torch.Size([3, 4])
	node_ohe_feature_1		<class 'torch.Tensor'> -> torch.Size([3, 3])

Edge attributes
	edge_ohe_feature_1		<class 'torch.Tensor'> -> torch.Size([3, 5])
	edge_weight		<class 'torch.Tensor'> -> torch.Size([3, 1])

Graph attributes
	feature		<class 'torch.Tensor'> -> torch.Size([2])
	num_nodes		<class 'int'>

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]], device='cuda:0')
tensor([[1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.]], device='cuda:0')
tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.]], device='cuda:0')

By default, all graphs in pathpyG are directed. To represent undirected graphs, we must add all edges in both directions. We can use the to_undirected() function to make a directed graph undirected, i.e. to add all (missing) edges that point in the opposite direction. This will automatically duplicate and assign the corresponding edge attributes to the newly formed (directed) edges, i.e. edges are assumed to have the same attributes in both directions.

In [52]:

Copied!

g_u = g.to_undirected()
print(g_u)
g_u = g.to_undirected()
print(g_u)

Undirected graph with 3 nodes and 6 (directed) edges

Node attributes
	node_class		<class 'torch.Tensor'> -> torch.Size([3, 1])
	node_ohe_feature_2		<class 'torch.Tensor'> -> torch.Size([3, 4])
	node_ohe_feature_1		<class 'torch.Tensor'> -> torch.Size([3, 3])

Edge attributes
	edge_ohe_feature_1		<class 'torch.Tensor'> -> torch.Size([6, 5])
	edge_weight		<class 'torch.Tensor'> -> torch.Size([6, 1])

Graph attributes
	feature		<class 'torch.Tensor'> -> torch.Size([2])
	num_nodes		<class 'int'>

By default, the Graph object can contain multiple identical edges, so the following is possible:

In [53]:

Copied!

g = pp.Graph.from_edge_list([('a', 'b'), ('b', 'c'), ('c', 'a'), ('a', 'b')])
print(g.data.edge_index)
g = pp.Graph.from_edge_list([('a', 'b'), ('b', 'c'), ('c', 'a'), ('a', 'b')])
print(g.data.edge_index)

EdgeIndex([[0, 0, 1, 2],
           [1, 1, 2, 0]], device='cuda:0', sparse_size=(3, 3), nnz=4,
          sort_order=row)

It is often convenient, to coalesce multi-edges into weighted single-edges, i.e. in the example above we may prefer a graph where each edge occurs once in the edge index, but the edge a->b has a weight attribute of two, while the two other edges have one.

In pathpyG we can do this as follows:

In [54]:

Copied!





g_w = g.to_weighted_graph()
print(g_w.data.edge_index)
print(g_w['edge_weight', 'b', 'c'])
print(g_w['edge_weight', 'a', 'b'])
print(g_w['edge_weight', 'c', 'a'])
g_w = g.to_weighted_graph()
print(g_w.data.edge_index)
print(g_w['edge_weight', 'b', 'c'])
print(g_w['edge_weight', 'a', 'b'])
print(g_w['edge_weight', 'c', 'a'])

EdgeIndex([[0, 1, 2],
           [1, 2, 0]], device='cuda:0', sparse_size=(3, 3), nnz=3,
          sort_order=row)
tensor(1., device='cuda:0')
tensor(2., device='cuda:0')
tensor(1., device='cuda:0')

As we will see in a separate notebook focussing on the advanced (temporal) graph visualization features of pathpyG, it is easy to generate (interactive) HTML plots of graphs, that are embedded into jupyter notebooks. You can simply call the pp.plot function on the Graph object:

In [58]:

Copied!

pp.plot(g, edge_color='gray', node_label=[v for v in g.mapping.node_ids]);
pp.plot(g, edge_color='gray', node_label=[v for v in g.mapping.node_ids]);

Node Centralities¶

To calculate node centralities, we can use a networkx delegate mechanism implemented in the module pathpyG.algorithms.centrality. Simply speaking, you can call any function implented in the networkx centrality module that starts with the string centrality_. The pathpyG will be internally converted to a networkx.DiGraph object, the corresponding centrality function (with all of its parameters) will be called, and the result will be mapped to the nodes based on their IDs.

In order to calculate the closeness centralities of all nodes for the graph above, we can call:

In [59]:

Copied!

pp.algorithms.centrality.closeness_centrality(g)
pp.algorithms.centrality.closeness_centrality(g)

Out[59]:

{'a': 0.6666666666666666, 'b': 0.6666666666666666, 'c': 0.6666666666666666}

In [60]:

Copied!

pp.algorithms.centrality.betweenness_centrality(g)
pp.algorithms.centrality.betweenness_centrality(g)

Out[60]:

defaultdict(<function pathpyG.algorithms.centrality.betweenness_centrality.<locals>.<lambda>()>,
            {'c': 1.0, 'b': 2.0, 'a': 1.0})