Paths and Higher-Order De Bruijn Graph Models¶
Prerequisites¶
First, we need to set up our Python environment that has PyTorch, PyTorch Geometric and PathpyG installed. Depending on where you are executing this notebook, this might already be (partially) done. E.g. Google Colab has PyTorch installed by default so we only need to install the remaining dependencies. The DevContainer that is part of our GitHub Repository on the other hand already has all of the necessary dependencies installed.
In the following, we install the packages for usage in Google Colab using Jupyter magic commands. For other environments comment in or out the commands as necessary. For more details on how to install pathpyG
especially if you want to install it with GPU-support, we refer to our documentation. Note that %%capture
discards the full output of the cell to not clutter this tutorial with unnecessary installation details. If you want to print the output, you can comment %%capture
out.
%%capture
# !pip install torch
!pip install torch_geometric
!pip install git+https://github.com/pathpy/pathpyG.git
Motivation and Learning Objective¶
While pathpyG
is useful to handle and visualize static graphs - as the name suggests - its main advantage is that it facilitates the analysis of time series data that can be used to calculate paths in a graph. As we shall see in the following tutorial, there are various situations in which naturally have access to data on paths, including data on (random) walks or trajectories, traces of dynamical processes giving rise to node sequences or directed acyclic graphs, or temporal graph data with time-stamped edges. ``pathpyG` ca nbe used to model patterns in such data based on higher-order De Bruijn graph models, a modelling framework that captures patterns in time series data on graphs.
In this first unit, we will introduce classes to store data on walks and directed acyclic graphs. We will show how such data are internally stored as tensors, and how this approach facilitates a GPU-based generation of higher-order De Bruijn graph models.
We first import the modules torch
and pathpyG
. By setting the device used by torch
, we can specify whether we want to run our code on the CPU or on the GPU.
import torch
import pathpyG as pp
from torch_geometric.data import Data
from torch_geometric import EdgeIndex
pp.config['torch']['device'] = 'cpu'
For the following examples, we consider a simple directed graph with five nodes a
, b
, c
, d
, e
and four edges:
g = pp.Graph.from_edge_list([('a', 'c'),
('b', 'c'),
('c', 'd'),
('c', 'e')])
pp.plot(g, node_label=g.mapping.node_ids, edge_color='gray');
Using WalkData
to store walks in a graph¶
Assume that we have time series data that captures observations of walks in the graph above. For example, we could observe four walks of length two, two each of the following:
- 2 x
a
->c
->d
- 2 x
b
->c
->e
Note that we define the length of a walk as the number of edges that are traversed, i.e. a sequence that consists of a single node, e.g. a
, is considered a walk of length zero, while every edge in a graph is a walk of length one.
pathpyG
supports to store and model such data based on subclasses of the abstract class pp.PathData
. Generally, this class provides functions to store different types of pathway data of length $l$ in the form of tensor with shape $(2,l), which is essentially an ordered version of an pyG
edge index. To better understand this, let us consider the mapping of node IDs to node indices that is given by the graph above:
print(g.mapping)
a -> 0 c -> 1 b -> 2 d -> 3 e -> 4
With this mapping, we can use the following ordered edge_index
with source and target node indices to represent the walk a
-> c
-> d
:
torch.tensor([ [0,2], # node indices of `a` and `c`
[2,3] # node indices of `c` and `d`
])
Note that this representation naturally extends the edge_index
semantics of pyG
. This allows us to efficiently generate higher-order De Bruijn graph models for path data based on a convolution operation that can be executed on the GPU.
Let us now use this representation to generate a Walk data set that can be used in pathpyG
. We first create an instance of the WalkData
class, which is a subclass of PathData
. To consistently map node IDs to indices across Graph
and PathData
objects, we can pass the IndexMap
object from the Graph
above in the constructor. We then use the add
function to add observations of our two walks, where the freq
argument is used to indicate the number of times each path has been observed.
paths = pp.WalkData(g.mapping)
paths.add(torch.tensor([[0,1],[1,3]]), freq=2) # a -> c -> d
paths.add(torch.tensor([[2,1],[1,4]]), freq=2) # b -> c -> e
print(paths)
PathData with 2 walks and total weight 4
Let us inspect how the walks are internally stored in the WalkData
object. Each observation is internally assigned an integer identifier and there are two dictionaries that store the tensors capturing the observed path as well as their frequencies.
paths.paths
{0: tensor([[0, 1], [1, 3]]), 1: tensor([[2, 1], [1, 4]])}
paths.path_freq
{0: 2, 1: 2}
For convenience, there is a helper function that can be used to transform the tensor-based edge_index
representation of a given walk into a simple sequence of traversed nodes, i.e. for the walk with index 0 above we get the node index sequence [0,2,3]
:
pp.WalkData.walk_to_node_seq(paths.paths[0])
tensor([0, 1, 3])
Each PathData
object has an edge_index
property, which yields an edge index tensor of the aggregated graph representation of all paths. We can further get the number of nodes and the number of edges that are traversed by the paths.
print(f'Paths traverse {paths.num_nodes} nodes via {paths.num_edges} edges')
print(paths.edge_index)
Paths traverse 5 nodes via 4 edges tensor([[0, 1, 1, 2], [1, 3, 4, 1]])
We could use the Graph.from_edge_index
function to generate a (first-order) aggregate graph representation that can be visualized:
pp.plot(pp.Graph.from_edge_index(paths.edge_index.contiguous(), paths.mapping), edge_color='gray', node_label=paths.mapping.node_ids);
We can also create a weighted edge_index
where the edge weights capture the number of times each edge is traversed. This function returnes two tensors, the first being the edge index and the second being the edge weights. In our example, each edge is observed two times.
print(paths.edge_index_weighted)
(tensor([[0, 1, 1, 2], [1, 3, 4, 1]]), tensor([2., 2., 2., 2.]))
The internal tensor-based representation of walks directly corresponds to the pyG
representation of edges, which provides several advantages. However, it is often cumbersome to manually create those tensors for walks that can easily be represented as a sequence of nodes. The WalkData
class thus provides methods to add walks based on a sequencde of node IDs, i.e. we can write the following more readable code to obtain the same WalkData
instance as above:
paths_1 = pp.WalkData(g.mapping)
paths_1.add_walk_seq(('a', 'c', 'd'), freq=2)
paths_1.add_walk_seq(('b', 'c', 'e'), freq=2)
print(paths_1)
print(paths_1.edge_index_weighted)
PathData with 2 walks and total weight 4 (tensor([[0, 1, 1, 2], [1, 3, 4, 1]], dtype=torch.int32), tensor([2., 2., 2., 2.]))
At this point, you may ask why data on paths and walks are interesting in the first place. The answer is that they provide a lot of information on the causal topology of networked systems, i.e. which nodes can causally influence each other.
For this, let us assume that the four walks above tell us which paths information (or whatever you may be interested in) have taken in graph above. That is, we observe something moving from a
via c
to d
and from b
via c
to e
, and each of those events occur twice. However, we never observed that something moving from a
to c
ended up in d
. Neither did we observe that something moving from b
to c
ended up in e
. This means that - assuming that we have completely observed all walks or paths - there is no way how a
can causally influence e
or how b
could causally influence d
. Note that this is not what we would assume based on the topology of the underlying graph, where paths of length two exist between all fair pairs of nodes (a
, d
), (a
, e
), (b
, d
), (b
, e
).
As a contrast, consider the following four observations of walks in the same graph.
paths_2 = pp.WalkData(g.mapping)
paths_2.add_walk_seq(('a', 'c', 'd'), freq=1)
paths_2.add_walk_seq(('a', 'c', 'e'), freq=1)
paths_2.add_walk_seq(('b', 'c', 'd'), freq=1)
paths_2.add_walk_seq(('b', 'c', 'e'), freq=1)
print(paths_2)
PathData with 4 walks and total weight 4
Here we have observed walks along all four possible paths of length two. It is easy to see that the weighted edge index of this WalkData
instance is identical to the one before:
print(paths_2.edge_index_weighted)
(tensor([[0, 1, 1, 2], [1, 3, 4, 1]], dtype=torch.int32), tensor([2., 2., 2., 2.]))
This is a first-order graph representation, as it only captures the (weighted) edges in the underlying path data, i.e. we could say that we only count the frequency of paths (or walks) of length one. This naturally gives rise to an edge_index
tensor with shape $(2,m)$, where $m$ is the number of unique edges in the graph that are traversed by the paths.
From Graphs to Higher-Order De Bruijn Graph Models¶
A key feature of pathpyG
is it allows to generalize this first-order modelling perspective to $k$-th order De Bruijn graph models for paths, where the nodes in a $k$-th order De Bruijn graph model are sequences of $k$ nodes. Edges connect pairs of nodes that overlap in $k-1$ nodes and capture paths of length $k$.
A De Bruijn graph of order $k=1$ is simply a normal (weighted) graph consisting of nodes and edges. Pairs of nodes connected by edges overlap in $k-1=0$ nodes and capture paths of length $k=1$, i.e. simple edges in the underlying path data.
For a De Bruijn graph with order $k=2$, in our example above, an edge could connect a pair of nodes $(a,b)$ and $(b,c)$ that overlaps in the $k-1=1$ node $b$ and such an edge would represent the walk $a -> b -> c$ of length two.
All instances of PathData
provide methods to calculate an edge index of such a k-th order De Bruijn graph. We can pass the order of the model that we seek to create as the argument $k$. For $k=1$ we obtain the same edge index and weights as before.
edge_index, weights = paths_1.edge_index_k_weighted(k=1)
print('first-order edges =', edge_index)
print('weights =', weights)
first-order edges = tensor([[0, 1, 1, 2], [1, 3, 4, 1]], dtype=torch.int32) weights = tensor([2., 2., 2., 2.])
For $k=2$, we get the edge index of a second-order De Bruijn graph where the second-order nodes are first-order edges and second-order edges represent walks of length two in the original graph. The edge weights capture the observation frequencies of those walks.
edge_index, weights = paths_1.edge_index_k_weighted(k=2)
print('second-order edges =', edge_index)
print('weights =', weights)
second-order edges = tensor([[[0, 1], [2, 1]], [[1, 3], [1, 4]]], dtype=torch.int32) weights = tensor([2., 2.])
Naturally extending the pyG
-style edge_index
to a higher-dimensional representation, the edge_index of a k-th De Bruijn graph model with m edges has the shape [2,m,k], i.e. it consists of a src and dst tensor with $m$ entries, where each entry is a k-dimensional tensor that contains the $k$ nodes in the graph that constitute the higher-order node. For the example above, each node in this second-order model is actually represented by a tensor with two elements.
While this goes way beyond the scope of this tutorial, thanks to the tensor-based representation of paths, the construction of a higher-order De Bruijn graph model can actually be done based on efficient GPU operations, i.e. we can scale up the models for large graphs.
Let us have a closer look at our examples above. While the first-order edge indices of the two path objects paths_1
and paths_2
are the same, we find that the second-order edge indices are actually different:
edge_index, weights = paths_2.edge_index_k_weighted(k=2)
print('second-order edges =', edge_index)
print('weights =', weights)
second-order edges = tensor([[[0, 1], [0, 1], [2, 1], [2, 1]], [[1, 3], [1, 4], [1, 3], [1, 4]]], dtype=torch.int32) weights = tensor([1., 1., 1., 1.])
We thus find that the second-order De Bruijn graph representation of paths is sensitive to the differences in the causal topology, while a first-order graph is not. This is the basis to generalize network analysis and graph learning to causality-aware graph models for various kinds of time series data on graphs.
In particular, as we shall see in more detail in a later tutorial, we can use paths to generate k-th order graphs that can be used to generalize Graph Neural Networks. pathpyG
provides the class HigherOrderGraph
, which is a subclass of Graph
and provides additional functionality that simplifies the analysis of higher-order De Bruijn graphs. For convenience, we can directly create a $k$-th order De Bruijn graph from any instance of PathData
. For the two examples above, we can create and plot the associated second-order De Bruijn graph models as follows:
g2 = pp.HigherOrderGraph(paths_1, order=2)
print(g2)
HigherOrderGraph (k=2) with 4 nodes and 2 edges Total edge weight = 4.0 Edge attributes edge_weight <class 'torch.Tensor'> -> torch.Size([2]) Graph attributes num_nodes <class 'int'>
Just like for a "normal" first-order graph, we can iterate through the nodes of a higher-order graph, which are tuples with k elements. Just as for a normal Graph
object, the node indices are automatically mapped, yielding tuples of first-order node identifiers.
for n in g2.nodes:
print(n)
('a', 'c') ('c', 'd') ('c', 'e') ('b', 'c')
Edges are tuples with two elements, where each element is a k-th order node, i.e. a tuple of node IDs of length $k$.
for e in g2.edges:
print(e)
(('a', 'c'), ('c', 'd')) (('b', 'c'), ('c', 'e'))
The weight attribute stores a tensor whose entries capture the frequencies of edges, i.e. the frequencies of paths of length $k$.
for e in g2.edges:
print(e, g2['edge_weight', e[0], e[1]].item())
(('a', 'c'), ('c', 'd')) 2.0 (('b', 'c'), ('c', 'e')) 2.0
We can finally plot a higher-order De Bruijn graph in the same way as a first-order graph.
pp.plot(g2, node_label=[g2.mapping.to_id(x) for x in range(g2.N)], edge_color='gray');
Let us compare this to a second-order graph model of the second path data set from above, which corresponds to a system where all paths of length two are actually realized in terms of walks.
g2 = pp.HigherOrderGraph(paths_2, order=2)
print(g2)
pp.plot(g2, node_label=[g2.mapping.to_id(x) for x in range(g2.N)], edge_color='gray');
HigherOrderGraph (k=2) with 4 nodes and 4 edges Total edge weight = 4.0 Edge attributes edge_weight <class 'torch.Tensor'> -> torch.Size([4]) Graph attributes num_nodes <class 'int'>
Loading empirical walks from N-Gram Files¶
For real data on walks in graphs it is not convenient to manually construct and add walks based on edge tensors. We can instead use the from_csv
function of class WalkData
to load such data from an n-gram file, i.e. a text file where each line corresponds to one observed walk consisting of comma-separated node IDs. If we set the argument freq=True
, the last component of each line is considered to be the observation frequency of that particular walk.
As an example, the file data/tube_paths_train.ngram
contains observed passenger itineraries between nodes in a graph that representes the network of London Tube stations. Each of those itineraries is associated with an observation frequencies. The following is an excerpt from that file:
Southwark,Waterloo,212.0
Liverpool Street,Bank / Monument,1271.0
Barking,West Ham,283.0
Tufnell Park,Kentish Town,103.0
...
Let us now read this file:
paths_tube = pp.WalkData.from_csv('../data/tube_paths_train.ngram', sep=',', freq=True)
print(paths_tube)
print(f'London Tube network has {paths_tube.num_nodes} nodes and {paths_tube.num_edges} edges.')
PathData with 61748 walks and total weight 2147865 London Tube network has 268 nodes and 646 edges.
Note that this will automatically create an internal mapping of node IDs to indices.
From Walks to Directed Acyclic Graphs¶
An important feature of pathpyG
is that it allows us to model the causal topology of temporal graphs, i.e. the topology of time-stamped edges by which nodes can causally influence each other via time-respecting paths, i.e. paths that must (minimally) follow the arrow of time.
As we shall see in the following detailed tutorial, time-respecting paths in temporal graphs naturally give rise to directed acyclic graphs (DAGs), where the directionality of edges is due to the directionality of the arrow of time. Thus, the analysis of time-respecting paths in temporal graphs naturally gives rise to a (possibly large) collection of DAGs, which can be stored using the class DAGData
, another subclass of the base class PathData
.
A very simple example for a DAG is one that consists of the following edges:
a
-> b
b
-> c
b
-> d
This DAG captures that node a
causally influences node b
, which in turn causally influences the two nodes c
and d
(potentially at a later point in time). In pathpyG
, such a DAG can be represented by a topologically ordered edge index, where the order of edges corresponds to the topological ordering. For the example above, and assuming the same node ID mapping as in the example before, we can add two observations of this DAG to a DAGData
object as follows:
paths = pp.DAGData(mapping = pp.IndexMap(['a', 'b', 'c', 'd']))
paths.add(torch.tensor([[0, 1, 1],
[1, 2, 3]]),
freq=2)
We can now again inspect the internal dictionaries holding our data:
print(paths.paths)
print(paths.path_freq)
{0: tensor([[0, 1, 1], [1, 2, 3]])} {0: 2}
At first glance, it may seem unnecessary to distinguish between walks and DAGs, as a walk is simply a special type of a DAG, where all nodes have in- and out-degrees smaller or equal than one. And indeed, you could simply ignore this distinction and store both walks and DAGs as a DAG. Nevertheless, pathpyG
explicitly distinguishes between the two types of path data, since some downstream operations - specifically the creation of higher-order De Bruijn graph models - are much more straight-forward and thus faster for walks (which are essentially just sequences of nodes) than for DAGs (which can have arbitrarily complex branching structures).
As shown before, we can use the PathData
class to easily generate an edge_index
tensor of a weighted graph representation, which essentially aggregates all of the observed walks or DAGs into a weighted static graph. For the DAG above, this graph has four nodes connected by three edges that have weights two (because we have observed the DAG twice).
edge_index, edge_weight = paths.edge_index_weighted
print(edge_index)
print(edge_weight)
tensor([[0, 1, 1], [1, 2, 3]]) tensor([2., 2., 2.])
Let's have a look at a visualization of this graph:
g = pp.Graph.from_edge_index(edge_index.contiguous(), mapping=paths.mapping)
pp.plot(g, edge_color='gray', node_label=paths.mapping.node_ids);
for e in g.edges:
print(e)
('a', 'b') ('b', 'c') ('b', 'd')
g2 = pp.HigherOrderGraph(paths, order=2)
print(g2)
pp.plot(g2, edge_color='gray', node_label=[g2.mapping.to_id(x) for x in range(g2.N)]);
HigherOrderGraph (k=2) with 3 nodes and 2 edges Total edge weight = 4.0 Edge attributes edge_weight <class 'torch.Tensor'> -> torch.Size([2]) Graph attributes num_nodes <class 'int'>
As we shall see in the following tutorial, the PathData
and the HigherOrderGraph
classes are the basis for the GPU-based analysis and modelling of causal structures in temporal graphs. In particular, the underlying generalization of first-order static graph models to higher-order De Bruijn graphs allows us to easily build causality-aware graph neural network architectures that consider both the topology and the temoral ordering of time-stamped edges in a temporal graph.