## Understanding Graph Attention Networks (GAT)

Tuesday, Jan 21, 2020| Tags: Graph Representation Learning # Understanding Graph Attention Networks (GAT)

This is 4th in the series of blogs Explained: Graph Representation Learning. Let’s dive right in, assuming you have read the first three. GAT (Graph Attention Network), is a novel neural network architecture that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods’ features, the method enables (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, GAT addresses several key challenges of spectral-based graph neural networks simultaneously, and make the model readily applicable to inductive as well as transductive problems.

Analyzing and Visualizing the learned attentional weights also lead to a more interpretable model in terms of importance of neighbors.

But before getting into the meat of this method, I want you to be familiar and thorough with the Attention Mechanism, because we’ll be building GATs on the concept of Self Attention and Multi-Head Attention introduced by Vaswani et al. If not, you may read this blog, The Illustrated Transformer by Jay Alamar.

# Can we do better than GCNs?

From Graph Convolutional Network (GCN), we learnt that combining local graph structure and node-level features yields good performance on node classification task. However, the way GCN aggregates messages is structure-dependent, which may hurt its generalizability.

The fundamental novelty that GAT brings to the table is how the information from the one-hop neighborhood is aggregated. For GCN, a graph convolution operation produces the normalized sum of neighbors’ node features as follows:

$$h_i^{(l+1)}=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\right)$$

where $\mathcal{N}(i)$ is the set of its one-hop neighbors (to include $v_{i}$ in the set, we simply added a self-loop to each node), $c_{ij}=\sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|}$ is a normalization constant based on graph structure, $\sigma$ is an activation function (GCN uses ReLU), and $W^{l}$ is a shared weight matrix for node-wise feature transformation.

GAT introduces the attention mechanism as a substitute for the statically normalized convolution operation. The figure below clearly illustrates the key difference.

GCN vs GAT # How does the GAT layer work?

The particular attentional setup utilized by GAT closely follows the work of Bahdanau et al. (2015) i.e Additive Attention, but the framework is agnostic to the particular choice of attention mechanism.

The input to the layer is a set of node features, $\mathbf{h} = {\vec{h}_1,\vec{h}_2,…,\vec{h}_N}, \vec{h}_i ∈ \mathbb{R}^{F}$ , where $N$ is the number of nodes, and $F$ is the number of features in each node. The layer produces a new set of node features (of potentially different cardinality $F’$ ), $\mathbf{h} = {\vec{h’}_1,\vec{h’}_2,…,\vec{h’}_N}, \vec{h’}_i ∈ \mathbb{R}^{F’}$, as its output.

### The Attentional Layer broken into 4 separate parts:

1) Simple linear transformation: In order to obtain sufficient expressive power to transform the input features into higher level features, atleast one learnable linear transformation is required. To that end, as an initial step, a shared linear transformation, parametrized by a weight matrix, $W ∈ \mathbb{R}^{F′×F}$ , is applied to every node.

\begin{split}\begin{align} z_i^{(l)}&=W^{(l)}h_i^{(l)} \ \end{align}\end{split} 2) Attention Coefficients: We then compute a pair-wise un-normalized attention score between two neighbors. Here, it first concatenates the $z$ embeddings of the two nodes, where $||$ denotes concatenation, then takes a dot product of it with a learnable weight vector $\vec a^{(l)}$, and applies a LeakyReLU in the end. This form of attention is usually called additive attention, in contrast with the dot-product attention used for the Transformer model. We then perform self-attention on the nodes, a shared attentional mechanism $a$ : $\mathbb{R}^{F′} × \mathbb{R}^{F′} → \mathbb{R}$ to compute attention coefficients \begin{split}\begin{align} e_{ij}^{(l)}&=\text{LeakyReLU}(\vec a^{(l)^T}(z_i^{(l)}||z_j^{(l)}))\\ \end{align}\end{split}

Q. Is this step the most important step?

Ans. Yes! This indicates the importance of node $j’s$ features to node $i$. This step allows every node to attend on every other node, dropping all structural information.

NOTE: The graph structure is injected into the mechanism by performing masked attention, we only compute $e_{ij}$ for nodes $j$ ∈ $N_{i}$, where $N_{i}$ is some neighborhood of node $i$ in the graph. In all the experiments, these will be exactly the first-order neighbors of $i$ (including $i$).

3) Softmax: This makes coefficients easily comparable across different nodes, we normalize them across all choices of $j$ using the softmax function

\begin{split}\begin{align} \alpha_{ij}^{(l)}&=\frac{\exp(e_{ij}^{(l)})}{\sum_{k\in \mathcal{N}(i)}^{}\exp(e_{ik}^{(l)})}\ \end{align}\end{split}

4) Aggregation: This step is similar to GCN. The embeddings from neighbors are aggregated together, scaled by the attention scores.

\begin{split}\begin{align} h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha^{(l)}_{ij} z^{(l)}_j }\right) \end{align}\end{split} An illustration of multi-head attention (with K = 3 heads) by node 1 on its neighborhood. Different arrow styles and colors denote independent attention computations. The aggregated features from each head are concatenated or averaged to obtain $\vec{h'}_{1}$.

Analogous to multiple channels in a Convolutional Net, GAT uses multi-head attention to enrich the model capacity and to stabilize the learning process. Specifically, K independent attention mechanisms execute the transformation of Equation 4, and then their outputs can be combined in 2 ways depending on the use:

$$\textbf{ \color{red}{Average} }: h_{i}^{(l+1)}=\sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}{j}\right)$$ $$\textbf{ \color{green}{Concatenation} }: h^{(l+1)}{i}=||{k=1}^{K}\sigma\left(\sum{j\in \mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right)$$

1) Concatenation As can be seen in this setting, the final returned output, $h′$, will consist of $KF′$ features (rather than F′) for each node.

2) Averaging

If we perform multi-head attention on the final (prediction) layer of the network, concatenation is no longer sensible and instead, averaging is employed, and delay applying the final nonlinearity (usually a softmax or logistic sigmoid for classification problems).

Thus concatenation for intermediary layers and average for the final layer are used.

# Implementing GAT Layer in PyTorch

## Imports

 1 2 3 4 5 6 7   import numpy as np import torch import torch.nn as nn import torch.nn.functional as F torch.manual_seed(2020) # seed for reproducible numbers 

## GAT Layer

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44  class GATLayer(nn.Module): """ Simple PyTorch Implementation of the Graph Attention layer. """ def __init__(self, in_features, out_features, dropout, alpha, concat=True): super(GATLayer, self).__init__() self.dropout = dropout # drop prob = 0.6 self.in_features = in_features # self.out_features = out_features # self.alpha = alpha # LeakyReLU with negative input slope, alpha = 0.2 self.concat = concat # conacat = True for all layers except the output layer. # Xavier Initialization of Weights # Alternatively use weights_init to apply weights of choice self.W = nn.Parameter(torch.zeros(size=(in_features, out_features))) nn.init.xavier_uniform_(self.W.data, gain=1.414) self.a = nn.Parameter(torch.zeros(size=(2*out_features, 1))) nn.init.xavier_uniform_(self.a.data, gain=1.414) # LeakyReLU self.leakyrelu = nn.LeakyReLU(self.alpha) def forward(self, input, adj): # Linear Transformation h = torch.mm(input, self.W) N = h.size() # Attention Mechanism a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * self.out_features) e = self.leakyrelu(torch.matmul(a_input, self.a).squeeze(2)) # Masked Attention zero_vec = -9e15*torch.ones_like(e) attention = torch.where(adj > 0, e, zero_vec) attention = F.softmax(attention, dim=1) attention = F.dropout(attention, self.dropout, training=self.training) h_prime = torch.matmul(attention, h) if self.concat: return F.elu(h_prime) else: return h_prime

# Implementing GAT on Citation Datasets using PyTorch Geometric

### PyG Imports

  1 2 3 4 5 6 7 8 9 10  from torch_geometric.data import Data from torch_geometric.nn import GATConv from torch_geometric.datasets import Planetoid import torch_geometric.transforms as T import matplotlib.pyplot as plt %matplotlib notebook import warnings warnings.filterwarnings("ignore") 
 1 2 3 4 5 6  name_data = 'Cora' dataset = Planetoid(root= '/tmp/' + name_data, name = name_data) dataset.transform = T.NormalizeFeatures() print(f"Number of Classes in {name_data}:", dataset.num_classes) print(f"Number of Node Features in {name_data}:", dataset.num_node_features) 

### Model

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  class GAT(torch.nn.Module): def __init__(self): super(GAT, self).__init__() self.hid = 8 self.in_head = 8 self.out_head = 1 self.conv1 = GATConv(dataset.num_features, self.hid, heads=self.in_head, dropout=0.6) self.conv2 = GATConv(self.hid*self.in_head, dataset.num_classes, concat=False, heads=self.out_head, dropout=0.6) def forward(self, data): x, edge_index = data.x, data.edge_index # Dropout before the GAT layer is used to avoid overfitting in small datasets like Cora. # One can skip them if the dataset is sufficiently large. x = F.dropout(x, p=0.6, training=self.training) x = self.conv1(x, edge_index) x = F.elu(x) x = F.dropout(x, p=0.6, training=self.training) x = self.conv2(x, edge_index) return F.log_softmax(x, dim=1) 

### Train

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = GAT().to(device) data = dataset.to(device) optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4) model.train() for epoch in range(1000): model.train() optimizer.zero_grad() out = model(data) loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask]) if epoch%200 == 0: print(loss) loss.backward() optimizer.step() 

### Evaluate

 1 2 3 4 5  model.eval() _, pred = model(data).max(dim=1) correct = float(pred[data.test_mask].eq(data.y[data.test_mask]).sum().item()) acc = correct / data.test_mask.sum().item() print('Accuracy: {:.4f}'.format(acc)) 

You can find our implementation made using PyTorch Geometric atGAT_PyG with GAT trained on a Citation Network, the Cora Dataset.

Code & GitHub Repository

Graph Attention Networks

Graph attention network, DGL by Zhang et al.

Attention Is All You Need

The Illustrated Transformer

Mechanics of Seq2seq Models With Attention

Attention? Attention!

### Written By

• Anirudh Dagar