Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (2024)

Razieh Shirzadkhani¹¹¹footnotemark: 1 Tran Gia Bao Ngo²¹¹footnotemark: 1 Kiarash Shamsi²
Shenyang Huang^1,3 Farimah Poursafaei^1,3 Poupak Azad² Reihaneh Rabbany^1,3,7
Baris Coskunuzer⁴ Guillaume Rabusseau^1,5,6 Cuneyt Gurcan Akcora⁸
¹Mila - Quebec AI Institute, ²Department of Computer Science, University of Manitoba,
³School of Computer Science, McGill University⁴University of Texas at Dallas,
⁶DIRO, Université de Montréal,⁷CIFAR AI Chair
⁸AI Initiative - University of Central FloridaEqual contribution

Abstract

The field of temporal graph learning aims to learn from evolving network data to forecast future interactions. Given a collection of observed temporal graphs, is it possible to predict the evolution of an unseen network from the same domain?To answer this question, we first present the Temporal Graph Scaling(TGS) dataset, a large collection of temporal graphs consisting of eighty-four ERC20 token transaction networks collected from 2017 to 2023. Next, we evaluate the transferability of Temporal Graph Neural Networks(TGNNs) for the temporal graph property prediction task by pre-training on a collection of up to sixty-four token transaction networks and then evaluating the downstream performance on twenty unseen token networks. We find that the neural scaling law observed in NLP and Computer Vision also applies in temporal graph learning, where pre-training on greater number of networks leads to improved downstream performance. To the best of our knowledge, this is the first empirical demonstration of the transferability of temporal graphs learning. On downstream token networks, the largest pre-trained model outperforms single model TGNNs on thirteen unseen test networks. Therefore, we believe that this is a promising first step towards building foundation models for temporal graphs.

1 Introduction

Many real world relations can be modeled as temporal graphs where nodes represent entities and edges represent interactions between entities that evolve over time. Examples include social networks[1, 2, 3], financial transaction networks[4], contact networks[5], and biological systems[6].

Recently, foundation models have revolutionized various fields such as natural language processing (NLP)[7, 8, 9] and computer vision (CV)[10, 11] by providing robust pre-trained architectures that can be transferred to a multitude of tasks. Foundational Models (FMs) aim to learn from large amounts of pre-training data and transfer the knowledge to downstream unseen tasks. These models have been recognized for their remarkable transfer capabilities and promising efficacy with few-shot and zero-shot learning on novel datasets and tasks [12, 13, 9].

Despite advances in the fields of NLP and CV, foundation models in graph representation learning remain relatively unexplored. For example, there has been some notable work on foundational models for graph neural networks (GNNs) that demonstrate the potential of these models[14, 15, 16, 17]. However, the majority of research to date has focused on static graph learning, leaving the exploration of temporal graph neural networks largely untapped.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (1)

Furthermore, to effectively train foundation models, a large collection of datasets is essential. Networks within the same domain often exhibit similar trends and statistics[18]. These datasets are crucial for assessing the performance of TGNNs, driving innovation, and ensuring that new methods can be generalized across various applications. To facilitate research on foundation models for temporal graphs, we introduce the Temporal Graph Scaling(TGS) benchmark, a comprehensive dataset containing 84 novel temporal graphs derived from Ethereum transaction networks. TGS offers temporal networks with sizes up to 128k nodes and 0.5 million edges, with a total of 3 million nodes and 19 million edges across all networks with novel duration and evolving recent activities, which enables the training of foundation models for temporal graph learning. In addition, we train the first foundation model on temporal graphs and demonstrate that training on a large number of temporal graphs result in surprisingly strong downstream performance. Figure1 shows the scaling behavior of our foundation model. The performance of the foundation model on twenty unseen token networks increases as the number of training networks increases.Notably, without fine-tuning on the test networks, the FMs achieves significant performance advantages over models trained on individual test networks. This demonstrates the high potential of transferability of foundation models on temporal graphs.

Our main contributions can be summarized as follows:

•
Novel Collection of Temporal Networks. We release a comprehensive collection of $84$ datasets derived from token transaction networks with labels for the graph property prediction task. These datasets provide the foundation for studying scaling behavior, transferability and multi-network learning on temporal graphs.
•
Neural Scaling Law on TGNNs. We explore the potential of foundation models on temporal graphs by showing that neural scaling law also applies on temporal graphs: training TGNNs with more temporal graphs (up to $64$ ), offers significant performance boost in downstream test networks.
•
Transferability Across Networks. We demonstrate that by pre-training on a large number of temporal graphs, our foundation model is directly transferable to $20$ downstream unseen token networks while outperforming single models trained on the test networks. This shows that it is possible to learn an overall distribution across temporal graphs and transfer to novel networks.

Reproducibility. Our code is on Github and TGS datasets are publicly available¹¹1https://zenodo.org/doi/10.5281/zenodo.11455827. The TGS website provides detailed documentation.

2 Related Work

Temporal Graph Benchmarks.Numerous graph benchmark datasets have been introduced to advance research within the temporal graph learning community, Poursafaei et al.[19] introduced six dynamic graph datasets while proposing visualization techniques and novel negative edge sample strategies to facilitate link prediction tasks of dynamic graphs. Following the good practice from OGB[20], Huang et al. introduced TGB[21], which provides automated and reproducible results with a novel standardized evaluation pipeline for both link and node property prediction tasks. However, these datasets belong to different domains, making them unsuitable for studying the scaling laws of neural network models trained with a large number of datasets from the same domain.Li et al.[22] provide a temporal benchmark for evaluating graph neural networks in link prediction tasks, though their focus does not extend to multi-networks. Conversely, the Live Graph Lab dataset by Zhang et al.[23] offers a temporal dataset and benchmark, employed for tasks like temporal node classification using TGNNs. In this work, we aim to explore multi-network training as well as understand the transferability across temporal graphs thus we curate a collection of temporal graphs instead of individual ones in prior work.

Discrete Time Dynamic Graphs. A common approach in discrete time models treats each snapshot individually and captures spatial characteristics, then adopts an RNN-based method to learn temporal dependencies[24, 25, 26, 27, 28]. GCRN stacks a graph CNN for feature extraction and an LSTM cell for temporal reasoning[24]. Differentiating from GCRN, EvolveGCN[3] uses RNN to control the parameters of a GCN at each snapshot. Employing two attention blocks, DySat first generates static node embeddings at each snapshot by running a GAT style GNN, and then computes new embeddings using a temporal self-attention block[25]. In the most recent work, GraphPulse[29] leverages Mapper, a key tool in topological data analysis to extract essential information from temporal graphs. However, in all previous studies, the training process of every model was limited to a single dataset, and the effectiveness of training TGNs with diverse networks to enhance their generalization capabilities is unexplored.

Neural Scaling Laws. Neural scaling laws[30, 31, 32] characterize the relationship between model performance and three main factors: number of parameters, size of training datasets and amount of computation. This relationship is usually described as a power law, which can be understood by observing learning as a movement on a smooth data manifold[33]. Bahri et al. exhibited all four scaling regimes with respect to the number of model parameters as well as the dataset size, underscoring different mechanisms driving improvement in loss[33]. Aghajanyan et al.[34] provided valuable insights into the design and training of mixed-model generative models by studying mixed-modal scaling laws, indicating the generality of scaling laws across different domains and applications.Recently, Liu et al.[35] investigated neural scaling laws for static graphs by observing the performance of GNNs given increases in the model’s size, defined by the number of layers and parameters, and training set size, defined by the number of edges. To the best of our knowledge, we are the first to investigate neural scaling laws for temporal graphs.

Foundation Models.The foundation model is an emerging paradigm that aims to develop models capable of generalization across different domains and tasks by the knowledge obtained from massive data in the pre-trained stage. Recently, Rasul et al. introduced Lag-Llama[9], a general-purpose foundation model for univariate probabilistic time series forecasting based on a simple decoder-only transformer architecture that uses lags as covariates.Galkin et al. introduced ULTRA, a foundation model for knowledge graphs, which handles complex relational data and support diverse downstream tasks effectively[36]. Similarly, Beaini et al. presented Graphium, a collection of molecule graph datasets that facilitate the development of foundation models for molecular applications, highlighting the importance of domain-specific datasets in enhancing the performance and generalizability of foundation models[16]. Lastly, Xia et al. proposed OpenGraph, an initiative towards open foundation models for graphs, emphasizing the need for transparency, reproducibility, and community-driven advancements in the field of graph representation learning[37]. These works underscore the growing recognition of the importance of foundation models and their transformative potential across various domains such as molecular graphs. However, foundation models for temporal graphs remain unexplored.

3 Preliminaries

Temporal Graphs are generally categorized into two types: Continous Time Dynamic Graphs(CTDGs) and Discrete Time Dynamic Graphs(DTDGs)[38].We focus on DTDGs because this approach aligns well with our objective of capturing and analyzing the graph’s dynamics at specific time intervals, such as on a weekly basis. In DTDGs, the graph’s temporal evolution is represented in discrete time steps, simplifying the analysis and modeling of large-scale temporal multi networks. Each time step provides a snapshot of the graph at a specific moment, facilitating straightforward comparisons and the identification of temporal patterns.

Definition 1 (Discrete Time Dynamic Graphs).

Formally, DTDGs represent the network as a sequence of graph snapshots denoted as $\mathcal{G}=\{\mathcal{G}_{t_{1}},\mathcal{G}_{t_{2}},\mathcal{G}_{t_{3}},%\ldots,\mathcal{G}_{t_{n}}\}$ where $t_{i}<t_{j}$ . Each $\mathcal{G}_{t_{i}}=(\mathcal{V}_{t_{i}},\mathcal{E}_{t_{i}},\mathbf{X}_{t_{i}%},\mathbf{Y}_{t_{i}})$ is the graph at timestamp $t_{i}$ , where $\mathcal{V}_{t_{i}}$ and $\mathcal{E}_{t_{i}}$ represent the set of nodes and edges, $\mathbf{{X}}_{t_{i}}$ denotes the node feature matrix, and $\mathbf{Y}_{t_{i}}$ represents the edge feature matrix in graph $\mathcal{G}_{t_{i}}$ . Therefore, a collection of discrete time dynamic graphs is defined as $D=\{\mathcal{G}^{1},\mathcal{G}^{2},\ldots,\mathcal{G}^{m}\}$ , where $m$ is the number of DTDGs.

Temporal Graph Property Prediction.For the task of temporal graph property prediction, we aim to forecast a temporal graph property within a future time interval in a DTDG.More specifically, given a DTDG $\mathcal{G}$ , we consider a time duration $[t_{\delta_{1}},t_{\delta_{2}}]$ , where $\delta_{1}$ and $\delta_{2}$ are non-negative integers with $\delta_{1}\leq\delta_{2}$ . Then at a specific time $t_{k}$ , the goal is to predict the target graph property within the specified future interval $[t_{k+\delta_{1}},t_{k+\delta_{2}}]$ . As for the graph properties, we can consider characteristics such as temporal global efficiency, temporal-correlation coefficient, and temporal betweenness centrality can also be explored in the future.

Hyperbolic Graph Neural Networks.Hyperbolic geometry has been increasingly recognized for its ability to achieve state-of-the-art performance in several static graph embedding tasks[39]. HTGN is a recent hyperbolic work that shows strong performance in learning over dynamic graphs in a DTDG manner.The model employs a hyperbolic graph neural network (HGNN) to learn the topological dependencies of the nodes and a hyperbolic-gated recurrent unit (HGRU) to capture the temporal dependencies. Given feature vectors $X^{E}_{t}$ of snapshot $t$ in Euclidean space, an HGNN layer first adopts an exponential map to project Euclidean space vectors to hyperbolic space as follows $X^{\mathcal{H}}_{t}=exp^{c}{(X^{E}_{t})}$ , and then performs aggregation and activation similar to GNN but in a hyperbolic manner, $\tilde{X}_{t}^{\mathcal{H}}$ = $\mathbf{HGNN}(X_{t}^{\mathcal{H}})$ . To prevent recurrent neural networks from only emphasizing the most nearby time and to ensure stability along with generalization of the embedding, HTGN uses temporal contextual attention (HTA) to generalize the lastest $w$ hidden states such that $\tilde{H}_{t-1}^{\mathcal{H}}$ = $\mathbf{HTA}(H_{t-w};...;H_{t-1})$ [39]. HGRU takes the outputs from HGNN, $\tilde{X}_{t}^{\mathcal{H}}$ , and the attentive hidden state, $\tilde{H}_{t-1}^{\mathcal{H}}$ , from HTA as input to update gates and memory cells and then provides the latest hidden state as the output, $H_{t}^{\mathcal{H}}=\mathbf{HGRU}(\tilde{X}_{t}^{\mathcal{H}},\tilde{H}_{t-1}^%{\mathcal{H}})$ . In addition, HTGN enables updating the model’s state at the test time to incorporate new information, which makes it a good candidate for learning the scaling law of TGNNs. We further describe the HTGN in Appendix Section E.

4 Dataset

We utilize a dataset of temporal graphs sourced from the leading Ethereum blockchain[40]. In this section, we will describe Ethereum, explain our data pipeline and conclude with defining characteristics of the resulting dataset.

Ethereum and ERC20 Token Networks. Blockchain[41] is a decentralized and secure database technology composed of blocks of transactions that can be verified and confirmed without the need for a central authority. Ethereum is one of the most popular blockchains designed to store and execute complicated structures like software code, known as smart contracts. A smart contract is a computerized transaction protocol that executes the terms of a pre-defined agreement[42]. Typically implemented on the Ethereum blockchain, smart contracts ensure that the terms of the contract are automatically enforced and executed when certain conditions are met[43]. These contracts have their own account addresses, which can be called to perform actions such as buying or selling digital tokens[43]. As contracts proliferated, code standards[44], such as ERC20, have been created to define required functions (e.g., transfer()) for sales of assets, which are called tokens. The most widely used standard, ERC20, defines asset networks over non-fungible tokens, which form our dataset. Fungible tokens are interchangeable and uniform; each token is identical in value and functionality to another token of the same type, similar to how one unit of currency is equivalent to another unit of the same currency.

Block to Graph Data. We create our transaction network data by first installing an Ethereum node and accessing the P2P network by using the Ethereum client Geth (https://github.com/ethereum/go-ethereum). Then, we use Etherum-ETL(https://github.com/blockchain-etl/ethereum-etl) to parse all ERC20 tokens and extract asset transactions. We extracted more than sixty thousand ERC20 tokens from the entire history of the Ethereum blockchain. However, during the lifespans of most token networks, there are interim periods without any transactions. Additionally, a significant number of tokens live for only a short time span. To avoid training data quality challenges, we use 84 token networks that have at least one transaction every day during their lifespan and are large enough to be used as a benchmark dataset for foundation model training.

5 Methodology

In this work, we use Temporal Graph Neural Networks(TGNNs) as the foundation model architecture. We choose the state-of-the-art Hyperbolic Temporal Graph Network (HTGN)[39] as an example architecture for experiments. This section explains our choice and details our training algorithm on multi networks.

5.1 Multi-network Training on Temporal Graphs

Existing temporal graph learning models typically train on a single temporal graph, limiting their ability to capture similar behaviors and generalize across different networks[1, 39]. We introduce TGS-train, the pioneering algorithm designed to train across multiple temporal graphs by modifying a state-of-the-art single network training model with two crucial steps: shuffling and resets. These steps, as we describe below, render the algorithm network-agnostic, capable of learning from various temporal graphs to generalize effectively to unseen networks.

Algorithm1 shows TGS-train in detail. As the first step, we load a list of $m$ temporal graphs $D=\{\mathcal{G}^{1},\mathcal{G}^{2},\ldots,\mathcal{G}^{m}\}$ , where each temporal graph $\mathcal{G}^{i}$ is represented as a sequence of snapshot $\{\mathcal{G}^{i}_{t_{1}},\mathcal{G}^{i}_{t_{2}},\ldots,\mathcal{G}^{i}_{t_{n%}}\}$ . For each epoch, we shuffle the orders of the list of datasets $D$ to preserve the Independent and Identically Distributed (IID) assumption of neural network training.

IID training. To preserve the IID assumption in neural network training, we include a shuffling step at each epoch. The randomized ordering of networks during training at each epoch is important because it helps prevent the model from learning spurious correlations that could arise if the data were presented in a fixed order. By shuffling the datasets, we promote randomness in the training process, which contributes to more robust and generalizable model performance.

Sequentially, for each dataset $\mathcal{G}^{i}$ , we first initialize the historical embeddings, then train the complete model (i.e. encoder-decoder) on each dataset $\mathcal{G}^{i}$ in a similar manner of training a single model and evaluate the performance on the corresponding validation set of dataset $\mathcal{G}^{i}$ . After training on $m$ datasets $D$ , we compute the average validation test results across these datasets. This average is used to select the best model, which is then saved for inference. Early stopping is applied if needed.

Context switching. Many TGNNs stores and utilizes node embeddings from previous timestamps at later timestamps, we refer to those embeddings as historical embeddings[39, 26, 3].Resetting historical embeddings at the beginning of each epoch is a key step in training a temporal model across multiple networks for several reasons. First, it helps prevent the model from carrying over biases or assumptions from one network to another, ensuring that it can adapt effectively to the unique characteristics of each network. Starting with fresh historical embeddings at the beginning of each epoch enables the models to learn the most relevant and up-to-date information from the current network, leading to improved performance and generalization across different networks. Additionally, resetting historical embeddings can help mitigate the issue of catastrophic forgetting, where the model may gradually lose information about previous networks as it learns new ones.

Input: A Temporal Graph Dataset $D=\{\mathcal{G}^{1},\mathcal{G}^{2},\ldots,\mathcal{G}^{m}\}$ , where $\mathcal{G}^{i}=\{\mathcal{G}_{t_{1}}^{i},\mathcal{G}_{t_{2}}^{i},\ldots,%\mathcal{G}_{t_{n}}^{i}\}$

$\emph{m}=$ Number of networks in training, $\mathbf{TGNN}$ and $\mathbf{Decoder}$

foreach $epoch$ do

Shuffled ( $D$ ) // IID training

foreach network $\mathcal{G}^{i}\in D$ do

Initialize historical embeddings (reset) // context switching

foreach training snapshot $\mathcal{G}^{i}_{t_{j}}\in\mathcal{G}^{i}$ do

$\mathcal{H}_{t_{i}}=\mathbf{TGNN}(\mathcal{G}^{i}_{t_{j}})$

$\hat{y}_{t_{i}}=\mathbf{Decoder}(\mathcal{H}_{t_{j}})$

$\mathcal{L}=\mathbf{Loss}(y_{t_{i}},\hat{y}_{t_{j}})$

Backpropagation

Update historical embeddings with $\mathcal{H}_{t_{j}}$

Evaluate on the validation snapshots of $\mathcal{G}^{i}$

Average validation results across all datasets to select the best model

Save the best model for inference

Inference on an unseen network.To evaluate the transferability of each foundation model, we test the model on unseen datasets. We begin by loading all the weights of foundation models, including the pre-trained encoder and decoder parameters, while initializing fresh historical embeddings. Then, we perform a single forward pass over the train and validation split to adapt the historical embeddings specific to the testing dataset.

6 Experiments

Weekly forecasts are common in the financial context for facilitating financial decisions[45]. Similarly, for the temporal graph property prediction task(defined in Section3), we set $\delta_{1}=3$ and $\delta_{2}=10$ , thus predicting the graph property over weekly snapshots. For the experiments, we use the network growth[28] in terms of edge count as the predicted graph property. See AppendixC for the dataset documentation, hosting and maintenance plan.

6.1 Prediction Baselines

Persistence forecast model.For our basic baseline model, we employ a naive setting similar to deterministic heuristics techniques, persistence forecast[46], for label generation. In this approach, we use data from the previous and current weeks to predict the next week’s property. If we observe an increasing trend in the number of transactions in the current week compared to the previous week, we predict a similar increasing trend for the following week. This simple model is based on the assumption that trends in transaction networks can persist over time.

Single model.We adopt the standard training process for HTGN[39] over a single dataset and make predictions for the same dataset. For each epoch, the training model process all snapshots in chronological order, with the node embeddings reset at the end of every epoch. To address graph-level tasks, we add an extra graph pooling layer as the final layer. This layer, such as a Multi-Layer Perceptron (MLP), takes the mean of all node embeddings, concatenating with four snapshot features at graph level, including mean of in-degree, weight of in-degree, out-degree and weight of out-degree, and then outputs binary classification prediction. We use Binary Cross-Entropy Loss (BCE) for performance measurement and Adam [47] as the optimization algorithm. It is important to note that the graph pooling layer, performance measurement and optimization algorithm are also shared by the Foundation Model Training setup.

We train every single model for $250$ epochs with a learning rate set to $15\times 10^{-4}$ . We adopt a $70\%-15\%-15\%$ ratio for the train, validation, and test split respectively for each training token network. The best model is selected based on the AUC results on validation sets, and then the model’s performance is evaluated using test sets. To reduce the time complexity of training HTGN, we applied early stopping, with patience and tolerance set to $20$ and $5\times 10^{-2}$ , respectively. Notably, the best model selection and early stopping are only applied after a minimum of 100 training epochs.

6.2 Foundation Model Training Setup

While following a similar training approach as in the single model training, we make specific adjustments for the foundational model training. We set the number of epochs to $300$ with a learning rate of $10^{-4}$ and a train-validation-test chronological split ratio of $70-15-15$ . Early stopping is applied based on the validation loss with a tolerance of $5\times 10^{-2}$ and the patience is set to $30$ . The best model is selected based on the validation AUC and used to predict the unseen test dataset. We train six foundation models, each with a different number of networks corresponding to $2^{n}$ datasets, where $n\in[1,6]$ . We name each foundation model based on the number of datasets used in training; for example, FM-16 is trained with 16 datasets.

For graph classification tasks on foundation models we ran all experiments on NVIDIA Quadro RTX 8000 (48G memory) with 4 standard CPU nodes. We repeated each experiment three times and reported the average and standard deviation of different runs. In Appendix Figure7 we report the time per epoch for each of the foundation models.

6.3 Results

In this section, we present the performance of our foundation models trained with datasets of varying sizes on $20$ unseen test datasets. We compare our results with the persistence forecast and single model baselines as explained in Section6.1. For visual clarity, Figure4 shows the AUC on test data results for FM-4, FM-16 and FM-64 only, while we show the performance of all six foundation models in appendix Figure6. Overall, an upward trend is observed in most datasets from Foundation $2$ to $64$ , such as QOM, MIR and BEPRO datasets, highlighting the power of larger foundation models in temporal graph learning.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (4)

Model	Top rank $\uparrow$	Avg. rank $\downarrow$	Win ratio $\uparrow$
Persist. forecast	0	7.7	0.05
Single model	6	4.5	-
FM-2	0	6.1	0.40
FM-4	0	5.5	0.50
FM-8	3	4.3	0.60
FM-16	2	2.9	0.65
FM-32	3	2.7	0.70
FM-64	6	2.6	0.65

In Figure4, the FM-64 yield the best AUC in $13$ out of $20$ test datasets. This result is significant, because the foundation models outperform the single models that are specifically trained on these datasets. We detail the prediction performance in Table1 where we rank all the foundation models and the baselines based on their AUC values in each test dataset, and report the average rank for each model. Average rank improve with increasing number of training networks and up to the foundation model 64. We observe a steep decrease in the average rank from foundation model 2, which has a rank of 6.1 out of 8, to FM-64, which has a rank of 2.6. In other words, training on sixty-four networks compared to two has improved the performance of the foundation model by $50\%$ . In Table1, we also present the win ratio of models over single model. FM-32 has the best win ratio of $0.7$ ; however, its rank is lower than that of model FM-64.

7 Conclusion

In this work, we aim to answer the question: given a collection of observed temporal graphs, is it possible to predict the evolution of an unseen network from the same domain? The answer is yes, it is possible to learn from temporal networks within the same domain and forecast future trends on unseen networks. First, we collected and released a collection of $84$ temporal networks for the temporal graph property prediction task. These datasets serve as the foundation for studying neural scaling laws and foundation models on temporal graphs. Next, to learn from a large number of temporal graphs, we present TGS-train, the first algorithm for training TGNNs across multiple temporal networks.Experimentally, we show that neural scaling law also applies on temporal graphs, in particular, the more training networks are used, the better the model performance on unseen test networks. In addition, our trained foundation models can outperform single models trained on individual test networks. Our empirical observations shows the high potential of training foundational models on temporal graphs. We believe our TGS benchmark will enable future work to develop novel foundation models for temporal graphs and study transferability across networks.

References

[1]E.Rossi, B.Chamberlain, F.Frasca, D.Eynard, F.Monti, and M.M. Bronstein, “Temporal graph networks for deep learning on dynamic graphs,” CoRR, vol.abs/2006.10637, 2020.
[2]T.Bai, Y.Zhang, B.Wu, and J.Nie, “Temporal graph neural networks for social recommendation,” in 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020 (X.Wu, C.Jermaine, L.Xiong, X.Hu, O.Kotevska, S.Lu, W.Xu, S.Aluru, C.Zhai, E.Al-Masri, Z.Chen, and J.Saltz, eds.), pp.898–903, IEEE, 2020.
[3]A.Pareja, G.Domeniconi, J.Chen, T.Ma, T.Suzumura, H.Kanezashi, T.Kaler, T.B. Schardl, and C.E. Leiserson, “Evolvegcn: Evolving graph convolutional networks for dynamic graphs,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.5363–5370, AAAI Press, 2020.
[4]K.Shamsi, F.Victor, M.Kantarcioglu, Y.Gel, and C.G. Akcora, “Chartalist: Labeled graph datasets for utxo and account-based blockchains,” Advances in Neural Information Processing Systems, vol.35, pp.34926–34939, 2022.
[5]S.Huang, F.Poursafaei, J.Danovitch, M.Fey, W.Hu, E.Rossi, J.Leskovec, M.Bronstein, G.Rabusseau, and R.Rabbany, “Temporal graph benchmark for machine learning on temporal graphs,” Advances in Neural Information Processing Systems, vol.36, 2024.
[6]Y.You, T.Chen, Y.Sui, T.Chen, Z.Wang, and Y.Shen, “Graph contrastive learning with augmentations,” Advances in neural information processing systems, vol.33, pp.5812–5823, 2020.
[7]S.Bubeck, V.Chandrasekaran, R.Eldan, J.Gehrke, E.Horvitz, E.Kamar, P.Lee, Y.T. Lee, Y.Li, S.M. Lundberg, H.Nori, H.Palangi, M.T. Ribeiro, and Y.Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,” CoRR, vol.abs/2303.12712, 2023.
[8]T.B. Brown, B.Mann, N.Ryder, M.Subbiah, J.Kaplan, P.Dhariwal, A.Neelakantan, P.Shyam, G.Sastry, A.Askell, S.Agarwal, A.Herbert-Voss, G.Krueger, T.Henighan, R.Child, A.Ramesh, D.M. Ziegler, J.Wu, C.Winter, C.Hesse, M.Chen, E.Sigler, M.Litwin, S.Gray, B.Chess, J.Clark, C.Berner, S.McCandlish, A.Radford, I.Sutskever, and D.Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (H.Larochelle, M.Ranzato, R.Hadsell, M.Balcan, and H.Lin, eds.), 2020.
[9]K.Rasul, A.Ashok, A.R. Williams, H.Ghonia, R.Bhagwatkar, A.Khorasani, M.J.D. Bayazi, G.Adamopoulos, R.Riachi, N.Hassen, M.Biloš, S.Garg, A.Schneider, N.Chapados, A.Drouin, V.Zantedeschi, Y.Nevmyvaka, and I.Rish, “Lag-llama: Towards foundation models for probabilistic time series forecasting,” 2024.
[10]A.Radford, J.W. Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J.Clark, G.Krueger, and I.Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (M.Meila and T.Zhang, eds.), vol.139 of Proceedings of Machine Learning Research, pp.8748–8763, PMLR, 2021.
[11]M.Awais, M.Naseer, S.Khan, R.M. Anwer, H.Cholakkal, M.Shah, M.-H. Yang, and F.S. Khan, “Foundational models defining a new era in vision: A survey and outlook,” arXiv preprint arXiv:2307.13721, 2023.
[12]R.Bommasani, D.A. Hudson, E.Adeli, R.B. Altman, S.Arora, S.von Arx, M.S. Bernstein, J.Bohg, A.Bosselut, E.Brunskill, E.Brynjolfsson, S.Buch, D.Card, R.Castellon, N.S. Chatterji, A.S. Chen, K.Creel, J.Q. Davis, D.Demszky, C.Donahue, M.Doumbouya, E.Durmus, S.Ermon, J.Etchemendy, K.Ethayarajh, L.Fei-Fei, C.Finn, T.Gale, L.Gillespie, K.Goel, N.D. Goodman, S.Grossman, N.Guha, T.Hashimoto, P.Henderson, J.Hewitt, D.E. Ho, J.Hong, K.Hsu, J.Huang, T.Icard, S.Jain, D.Jurafsky, P.Kalluri, S.Karamcheti, G.Keeling, F.Khani, O.Khattab, P.W. Koh, M.S. Krass, R.Krishna, R.Kuditipudi, and etal., “On the opportunities and risks of foundation models,” CoRR, vol.abs/2108.07258, 2021.
[13]Q.Dong, L.Li, D.Dai, C.Zheng, Z.Wu, B.Chang, X.Sun, J.Xu, L.Li, and Z.Sui, “A survey for in-context learning,” CoRR, vol.abs/2301.00234, 2023.
[14]H.Mao, Z.Chen, W.Tang, J.Zhao, Y.Ma, T.Zhao, N.Shah, M.Galkin, and J.Tang, “Graph foundation models,” 2024.
[15]M.Galkin, X.Yuan, H.Mostafa, J.Tang, and Z.Zhu, “Towards foundation models for knowledge graph reasoning,” 2024.
[16]D.Beaini, S.Huang, J.A. Cunha, Z.Li, G.Moisescu-Pareja, O.Dymov, S.Maddrell-Mander, C.McLean, F.Wenkel, L.Müller, etal., “Towards foundational models for molecular learning on large-scale multi-task datasets,” in The Twelfth International Conference on Learning Representations, 2023.
[17]O.Méndez-Lucio, C.Nicolaou, and B.Earnshaw, “Mole: a molecular foundation model for drug discovery,” arXiv preprint arXiv:2211.02657, 2022.
[18]S.Jin and R.Zafarani, “The spectral zoo of networks: Embedding and visualizing networks with spectral moments,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.1426–1434, 2020.
[19]F.Poursafaei, S.Huang, K.Pelrine, and R.Rabbany, “Towards better evaluation for dynamic link prediction,” in Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (S.Koyejo, S.Mohamed, A.Agarwal, D.Belgrave, K.Cho, and A.Oh, eds.), 2022.
[20]W.Hu, M.Fey, M.Zitnik, Y.Dong, H.Ren, B.Liu, M.Catasta, and J.Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (H.Larochelle, M.Ranzato, R.Hadsell, M.Balcan, and H.Lin, eds.), 2020.
[21]S.Huang, F.Poursafaei, J.Danovitch, M.Fey, W.Hu, E.Rossi, J.Leskovec, M.M. Bronstein, G.Rabusseau, and R.Rabbany, “Temporal graph benchmark for machine learning on temporal graphs,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, eds.), 2023.
[22]J.Li, H.Shomer, H.Mao, S.Zeng, Y.Ma, N.Shah, J.Tang, and D.Yin, “Evaluating graph neural networks for link prediction: Current pitfalls and new benchmarking,” Advances in Neural Information Processing Systems, vol.36, 2024.
[23]Z.Zhang, B.Luo, S.Lu, and B.He, “Live graph lab: Towards open, dynamic and real transaction graphs with NFT,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, eds.), 2023.
[24]Y.Seo, M.Defferrard, P.Vandergheynst, and X.Bresson, “Structured sequence modeling with graph convolutional recurrent networks,” 2016.
[25]A.Sankar, Y.Wu, L.Gou, W.Zhang, and H.Yang, “Dynamic graph representation learning via self-attention networks,” 2019.
[26]J.Chen, X.Wang, and X.Xu, “GC-LSTM: graph convolution embedded LSTM for dynamic network link prediction,” Appl. Intell., vol.52, no.7, pp.7513–7528, 2022.
[27]J.Li, Z.Han, H.Cheng, J.Su, P.Wang, J.Zhang, and L.Pan, “Predicting path failure in time-evolving graphs,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019 (A.Teredesai, V.Kumar, Y.Li, R.Rosales, E.Terzi, and G.Karypis, eds.), pp.1279–1289, ACM, 2019.
[28]K.Shamsi, F.Poursafaei, S.Huang, B.T.G. Ngo, B.Coskunuzer, and C.G. Akcora, “Graphpulse: Topological representations for temporal graph property prediction,” in The Twelfth International Conference on Learning Representations, 2024.
[29]K.Shamsi, F.Poursafaei, S.Huang, B.T.G. Ngo, B.Coskunuzer, and C.G. Akcora, “Graphpulse: Topological representations for temporal graph property prediction,” in The Twelfth International Conference on Learning Representations, 2023.
[30]J.S. Rosenfeld, A.Rosenfeld, Y.Belinkov, and N.Shavit, “A constructive prediction of the generalization error across scales,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020.
[31]J.Kaplan, S.McCandlish, T.Henighan, T.B. Brown, B.Chess, R.Child, S.Gray, A.Radford, J.Wu, and D.Amodei, “Scaling laws for neural language models,” CoRR, vol.abs/2001.08361, 2020.
[32]S.Abnar, M.Dehghani, B.Neyshabur, and H.Sedghi, “Exploring the limits of large scale pre-training,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022.
[33]Y.Bahri, E.Dyer, J.Kaplan, J.Lee, and U.Sharma, “Explaining neural scaling laws,” CoRR, vol.abs/2102.06701, 2021.
[34]A.Aghajanyan, L.Yu, A.Conneau, W.Hsu, K.Hambardzumyan, S.Zhang, S.Roller, N.Goyal, O.Levy, and L.Zettlemoyer, “Scaling laws for generative mixed-modal language models,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (A.Krause, E.Brunskill, K.Cho, B.Engelhardt, S.Sabato, and J.Scarlett, eds.), vol.202 of Proceedings of Machine Learning Research, pp.265–279, PMLR, 2023.
[35]J.Liu, H.Mao, Z.Chen, T.Zhao, N.Shah, and J.Tang, “Neural scaling laws on graphs,” CoRR, vol.abs/2402.02054, 2024.
[36]M.Galkin, X.Yuan, H.Mostafa, J.Tang, and Z.Zhu, “Towards foundation models for knowledge graph reasoning,” in The Twelfth International Conference on Learning Representations, 2023.
[37]L.Xia, B.Kao, and C.Huang, “Opengraph: Towards open graph foundation models,” arXiv preprint arXiv:2403.01121, 2024.
[38]S.M. Kazemi, R.Goel, K.Jain, I.Kobyzev, A.Sethi, P.Forsyth, and P.Poupart, “Representation learning for dynamic graphs: A survey,” Journal of Machine Learning Research, vol.21, no.70, pp.1–73, 2020.
[39]M.Yang, M.Zhou, M.Kalander, Z.Huang, and I.King, “Discrete-time temporal network embedding via implicit hierarchical learning in hyperbolic space,” in KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021 (F.Zhu, B.C. Ooi, and C.Miao, eds.), pp.1975–1985, ACM, 2021.
[40]G.Wood etal., “Ethereum: A secure decentralised generalised transaction ledger,” Ethereum project yellow paper, vol.151, no.2014, pp.1–32, 2014.
[41]C.G. Akcora, Y.R. Gel, and M.Kantarcioglu, “Blockchain: A graph primer,” CoRR, vol.abs/1708.08749, 2017.
[42]N.Szabo, “The idea of smart contracts,” Nick Szabo’s Papers and Concise Tutorials, 1997.
[43]Z.Zheng, S.Xie, H.Dai, W.Chen, X.Chen, J.Weng, and M.Imran, “An overview on smart contracts: Challenges, advances and platforms,” Future Gener. Comput. Syst., vol.105, pp.475–491, 2020.
[44]M.DiAngelo and G.Salzer, “Tokens, types, and standards: identification and utilization in ethereum,” in 2020 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPS), pp.1–10, IEEE, 2020.
[45]H.-M. Kim, G.-W. Bock, and G.Lee, “Predicting ethereum prices with machine learning based on blockchain information,” Expert Systems with Applications, vol.184, p.115480, 2021.
[46]S.Salcedo-Sanz, D.Casillas-Pérez, J.D. Ser, C.Casanova-Mateo, L.Cuadra, M.Piles, and G.Camps-Valls, “Persistence in complex systems,” Physics Reports, vol.957, pp.1–73, 2022.
[47]D.P. Kingma and J.Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Y.Bengio and Y.LeCun, eds.), 2015.

Appendix A Broader Impact

Foundation models have broad applications across various domains, including the financial sector, which is our primary focus. Our goal is to develop a temporal foundation model capable of predicting future trends with minimal training. This model can identify similar behaviors and be utilized in real-time scenarios such as network trend analysis or token price prediction.

Negative Impact.Although this work aims to pave the way for significant advancement in temporal graph learning, there might be some potential negative impacts requiring meticulous consideration.First, the focus on pre-training models on the TGS large collection of token transaction networks may inadvertently bias the models towards specific types of data, reducing their generalizability and effectiveness when applied to other domains or types of temporal graphs.Second, the observed neural scaling law, which indicates that larger pre-training datasets lead to better performance, requires significant computational resources.On one hand, extensive model pre-training leads to potential energy consumption and environmental degradation.On the other hand, such requirements on the computational resources could lead to a concentration of advancements in well-funded institutions, potentially stifling innovation and diversity of thought in the field.Finally, the emphasis on model performance might overshadow the importance of interpretability and transparency. Addressing these potential negative impacts is crucial to ensure the responsible development and deployment of TGL.

Appendix B Limitations

Our work has the following limitations. i) Our scaling results indicate that training with a larger number of networks enhances model generalizability. Although we limited the foundation model to sixty-four networks due to resource constraints, training on a larger number of networks could further improve performance. ii) While we used the Discretized Temporal Directed Graph setting as our benchmark, this approach can be generalized to continuous time graphs, representing a promising area for future research. iii) Although our current focus is on financial networks, the temporal scaling law should also be studied for other domains, such as social media or transportation networks, which we plan to explore in future work.

Appendix C Dataset Documentation and Intended Use

All datasets introduced by TGS are intended for academic usage under MIT license. We, as authors, bear all responsibility in case of violation of rights. Here are the relevant links for code, dataset and website:

•
TGS’s webpage is available at:https://tgs-benchmark.github.io/TGS-Benchmark/
•
TGS’s datasets are maintained and hosted at:https://zenodo.org/doi/10.5281/zenodo.11455827
•
TGS’s Croissant metadata record is available at:https://huggingface.co/api/datasets/ntgbaoo/Temporal_Graph_Scaling_TGS_Benchmark/croissant
•
Implementation of proposed foundation model is available at:https://github.com/benjaminnNgo/ScalingTGNs

Maintenance plan. To create a comprehensive, reliable, and reproducible benchmark for temporal graph scaling, we plan to continuously develop and maintain TGS with input and involvement from the community. Our objective is to expand the dataset by extracting and adding more token networks to support larger foundation model training in the future. TGS dataset is hosted and maintained by the Digital Research Alliance of Canada, funded by the Government of Canada.

Appendix D Additional Dataset Statistics

We summarize detailed statistics of each token network in TGS datasets in Table 2. In the table, the growth rate is the ratio of label $1$ indicating the increase in the the number of edge counts with respect to the problem definition defined in Section 3. In addition, the novelty score, the average ratio of new edges in each timestamp, and the surprise score, the ratio of edges that only appear in the test set, introduced by Poursafaei et al. [19], are defined as followed:


	$\displaystyle novelty=\frac{1}{T}\sum_{t=1}^{T}\frac{\|E^{t}\setminus E^{t}_{%seen}\|}{\|E^{t}\|},$		(1a)
	$\displaystyle surprise=\frac{\|E_{test}\setminus E_{train}\|}{\|E_{test}\|}.$		(1g)

where $E^{t}$ and $E^{t}_{seen}$ denotes the set of edges present only in timestamp $t$ and seen in previous timestamps, respectively. $E_{test}$ represent edges that appear in the test set and edges appearing in the train set are represented as $E_{train}$ .

Comparison between training and testing set.To learn the scaling law of TGNs, we divide TGS into two disjoint sets, where one set is used for training obtained by randomly selecting 64 token networks and the remaining 20 token networks are used to evaluate the performance. Nodes, transactions and length (in days) distribution over the training and testing sets are shown in Figure 5. Training sets well-support the foundation model to generalize characteristics of the entire TGS dataset due to the similarity between nodes, edge and length in days distributions shown in Figures 5(a), 5(b), 5(c) and those distributions across $84$ token networks of TGS datasets. In addition, the variance of datasets’ characteristics of the testing set is shown in Figures 5(d), 5(e) and 5(f).

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (5)

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (6)

Token	Node	Transaction	Timestamp (days)	Growth rate	Novelty	Surprise
ARC	11325	70968	606	0.43	0.32	0.88
CELR	65350	235807	1691	0.49	0.56	0.96
CMT	86895	205961	309	0.45	0.72	0.92
DRGN	113453	341849	2164	0.44	0.57	0.97
GHST	35156	180955	1146	0.43	0.51	0.93
INU	8556	66315	154	0.27	0.41	0.59
IOTX	63079	288469	1993	0.45	0.56	0.99
QSP	117977	299671	2178	0.45	0.67	0.99
REP	83282	224843	346	0.46	0.69	0.96
RFD	23208	173695	169	0.3	0.39	0.6
TNT	88247	316352	1216	0.43	0.55	0.93
TRAC	71667	299181	2110	0.46	0.54	0.97
RLB	28033	240291	129	0.43	0.49	0.76
steCRV	19079	211538	1033	0.45	0.53	0.9
ALBT	63042	434881	1152	0.43	0.44	0.89
POLS	128159	554705	1132	0.45	0.61	0.94
SWAP	69230	509769	1213	0.46	0.45	0.79
SUPER	83299	502030	986	0.47	0.46	0.85
RARI	87186	502960	1207	0.43	0.47	0.91
KP3R	39323	493258	1102	0.43	0.33	0.88
MIR	79984	444998	1066	0.45	0.43	0.92
aUSDC	23742	475680	1067	0.46	0.4	0.73
LUSD	25852	430473	943	0.48	0.36	0.87
PICKLE	28498	430262	1149	0.48	0.34	0.69
DODO	47046	390443	1131	0.47	0.45	0.91
YFII	43964	391984	1196	0.44	0.44	0.96
STARL	71590	369913	856	0.46	0.48	0.86
LQTY	34687	374230	943	0.45	0.34	0.91
FEG	118294	367584	1007	0.4	0.62	0.92
AUDIO	91218	362685	1108	0.45	0.58	0.95
OHM	45728	377068	690	0.43	0.46	0.88
WOOL	16874	351178	716	0.41	0.18	0.41
Metis	52586	343141	907	0.44	0.48	0.89
cDAI	52753	358050	1437	0.45	0.46	0.9
BITCOIN	34051	347054	178	0.48	0.39	0.63
INJ	60472	312822	1113	0.46	0.52	0.98
MIM	23038	269366	885	0.44	0.4	0.89
GLM	53385	234912	1080	0.5	0.53	0.96
Mog	14590	240680	107	0.37	0.38	0.55
DPI	40627	234246	1150	0.49	0.5	0.86
LINA	45342	227147	1144	0.45	0.46	0.95
Yf-DAI	22466	226875	1158	0.42	0.31	0.87
BOB	42806	212099	199	0.35	0.48	0.73
RGT	35277	211932	1110	0.44	0.46	0.98
TVK	42539	208082	1062	0.41	0.48	0.93
RSR	50645	205906	659	0.47	0.62	0.91
WOJAK	34341	198653	201	0.37	0.48	0.73
ANT	36517	200262	1107	0.47	0.46	0.93
LADYS	37486	192176	181	0.37	0.52	0.79
ETH2x-FLI	11008	199088	965	0.47	0.28	0.84
TURBO	38638	189048	189	0.33	0.48	0.72
REPv2	39061	191367	1194	0.48	0.5	0.97
NOIA	29798	185528	1133	0.46	0.37	0.7
0x0	21531	182430	283	0.51	0.46	0.81
PSYOP	25450	168896	169	0.32	0.39	0.59
ShibDoge	40023	134697	680	0.43	0.53	0.8
ADX	14567	123755	1188	0.44	0.4	0.91
BAG	11860	122634	298	0.31	0.44	0.87
QOM	21757	118292	598	0.46	0.41	0.81
BEPRO	26521	120261	1132	0.46	0.48	0.87
AIOZ	29231	119926	947	0.43	0.49	0.89
PRE	40476	118625	1113	0.5	0.55	0.86
CRU	19990	117712	1144	0.5	0.43	0.95
POOH	27245	111641	193	0.26	0.49	0.69
DERC	24277	111205	824	0.45	0.49	0.83
stkAAVE	37355	110924	1128	0.42	0.57	0.71
BTRFLY	8450	108371	453	0.48	0.34	0.44
SDEX	9127	104869	240	0.41	0.44	0.75
XCN	20085	104185	607	0.46	0.42	0.84
HOP	37004	102650	514	0.41	0.6	0.88
MAHA	18401	96180	749	0.43	0.47	0.91
DINO	15837	94140	358	0.44	0.44	0.74
bendWETH	1454	96898	593	0.51	0.21	0.51
PUSH	14501	93103	936	0.46	0.38	0.83
SPONGE	25852	90468	184	0.31	0.66	0.81
sILV2	12838	92905	611	0.4	0.34	0.48
SLP	6675	95368	1151	0.43	0.36	0.91
crvUSD	2950	88647	174	0.61	0.37	0.73
MUTE	12426	82345	977	0.43	0.46	0.95
EVERMOON	7552	79868	163	0.24	0.35	0.52
HOICHI	5075	77361	436	0.36	0.32	0.71
DOGE2.0	7664	79047	123	0.45	0.38	0.66
ORN	44010	239451	1134	0.46	0.47	0.87
aDAI	13648	187050	1068	0.45	0.46	0.82

Appendix E Hyperbolic Temporal Graph Network (HTGN)

To interpret hyperbolic embeddings, Yang et al adopt Poincaré ball model with negative curve $-c$ , given $c>0$ , coresponds to the Riemannian manifold $(\mathbb{H}^{n,c})=\{x\in\mathbb{R}^{n}:c||x||^{2}<1\}$ is an open n-dimensional ball. Given an Euclidean space vector $x_{i}^{E}\in\mathbb{R}^{d}$ , we consider it as a point in tangent space $\mathcal{T}_{x^{\prime}}\mathbb{H}^{d,c}$ and adopt the exponential map to project it into hyperbolic space :

x_{i}^{\mathcal{H}}=exp_{x^{\prime}}^{c}(x_{i}^{E})

(2)

Resulting in $x_{i}^{\mathcal{H}}\in\mathbb{H}^{d,c}$ , which is then served as input to HGNN layer as follow [39]:

$\displaystyle\mathbf{m}_{i}^{\mathcal{H}}$	$\displaystyle=W\otimes^{c}\mathbf{x}_{i}^{\mathcal{H}}\oplus^{c}\mathbf{b},$	(3a)
$\displaystyle\tilde{\mathbf{m}}_{i}^{\mathcal{H}}$	$\displaystyle=\exp_{\mathbf{x^{\prime}}}^{c}(\sum_{j\in\mathcal{N}(i)}\alpha_{%ij}\log_{\mathbf{\mathbf{x}^{\prime}}}^{c}(\mathbf{m}_{i}^{\mathcal{H}})),$	(3b)
$\displaystyle\tilde{\mathbf{x}}_{i}^{\mathcal{H}}$	$\displaystyle=\exp_{\mathbf{x^{\prime}}}^{c}(\sigma({\log_{\mathbf{x^{\prime}}%}^{c}}(\tilde{\mathbf{m}}_{i}^{\mathcal{H}})).$	(3c)

where $W$ , $b$ are learnable parameters and hyperbolic activation function $\sigma$ achieved by applying logarithmic and exponential mapping. HGNN leverages attention-based aggregation by assigning attention score $\alpha_{ij}$ to indicate the importance of neighbour $j$ to node $i$ , computed as followed:

	$\displaystyle\alpha_{ij}$	$\displaystyle=softmax_{(j\in\mathcal{N}(i))}(s_{ij})=\frac{\exp(s_{ij})}{\sum_%{j^{\prime}\in\mathcal{N}_{i}}\exp(s_{ij^{\prime}})},$		(4)
	$\displaystyle s_{ij}$	$\displaystyle=\mathrm{LeakReLU}(a^{T}[\log_{0}^{c}(m_{i}^{l})\\|\log_{0}^{c}(m_%{j}^{l})]),$		(4)

where $a$ is trainable vector and $||$ denotes concatenation operation.

Output of HGNN, $\tilde{X}_{t}^{\mathcal{H}}$ , is then used as input to HGRU along with attentive hidden state $\tilde{H}_{t-1}^{\mathcal{H}}$ obtained by HTA , which generalize $H_{t-1}$ to lastest $w$ snapshots $\{H_{t-w},...,H_{t-1}\}$ [39]. Operations behind HGRU is characterized by following equation [39]:


	$\displaystyle X_{t}^{E}=\log_{\mathbf{x^{\prime}}}^{c}(\tilde{X}_{t}^{\mathcal%{H}}),$		(5a)
	$\displaystyle H_{t-1}^{E}=\log_{\mathbf{x^{\prime}}}^{c}(\tilde{H}_{t-1}^{%\mathcal{H}}),$		(5b)
	$\displaystyle P_{t}^{E}=\sigma(W_{z}X_{t}^{E}+U_{z}H_{t-1}^{E})$		(5c)
	$\displaystyle R_{t}^{E}=\sigma(W_{r}X_{t}^{E}+U_{r}H_{t-1}^{E}),$		(5d)
	$\displaystyle\tilde{H}_{t}^{E}=\tanh(W_{h}X_{t}^{E}+U_{h}(R_{t}\odot H_{t-1}^{%E})),$		(5e)
	$\displaystyle H_{t}^{E}=(1-P_{t}^{E})\odot\tilde{H}_{t}^{E}+P_{t}^{E}\odot H_{%t-1}^{E},$		(5f)
	$\displaystyle H_{t}^{\mathcal{H}}=\exp_{\mathbf{x^{\prime}}}^{c}(H_{t}^{E}).$		(5g)

where $W_{z},W_{r},W_{h},U_{z},U_{r},U_{h}$ are the trainable weight matrices, $P_{t}^{E}$ is the update gate to control the output and $R_{t}^{E}$ is the reset gate to balance the input and memory. [39]

Appendix F Additional Results

Here we present the test results for the six foundation models trained on different network sizes as well as the single model and persistence forecast results. Figure6 illustrates the AUC of these models on test set. In most datasets, foundation models outperform the single model and in all datasets outperform persistence forecast.

In Table3 the average and standard deviation of AUC is presented for all models. FM-64 shows the highest performance in six datasets and second best in five, and FM-32 has the highest performance in three datasets and second best in eight datasets. These results shows the power of foundation models in performing downstream tasks on unseen datasets.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (11)

Token	Per. Fore.	Single Model	FM-2	FM-4	FM-8	FM 16	FM-32	FM-64
WOJAK	0.378	0.479 $\pm$ 0.005	0.766 $\pm$ 0.007	0.769 $\pm$ 0.001	0.807 $\pm$ 0.004	0.794 $\pm$ 0.010	0.777 $\pm$ 0.008	0.774 $\pm$ 0.022
DOGE2.0	0.250	0.590 $\pm$ 0.059	0.763 $\pm$ 0.077	0.734 $\pm$ 0.025	0.759 $\pm$ 0.022	0.796 $\pm$ 0.036	0.808 $\pm$ 0.032	0.823 $\pm$ 0.011
EVERMOON	0.241	0.512 $\pm$ 0.023	0.598 $\pm$ 0.028	0.645 $\pm$ 0.019	0.683 $\pm$ 0.009	0.666 $\pm$ 0.006	0.662 $\pm$ 0.032	0.671 $\pm$ 0.020
QOM	0.334	0.633 $\pm$ 0.017	0.665 $\pm$ 0.066	0.691 $\pm$ 0.041	0.714 $\pm$ 0.038	0.725 $\pm$ 0.013	0.751 $\pm$ 0.024	0.755 $\pm$ 0.023
SDEX	0.423	0.762 $\pm$ 0.034	0.712 $\pm$ 0.106	0.745 $\pm$ 0.046	0.790 $\pm$ 0.051	0.758 $\pm$ 0.081	0.839 $\pm$ 0.062	0.883 $\pm$ 0.016
ETH2x-FLI	0.355	0.610 $\pm$ 0.059	0.582 $\pm$ 0.092	0.598 $\pm$ 0.013	0.661 $\pm$ 0.025	0.715 $\pm$ 0.020	0.710 $\pm$ 0.015	0.721 $\pm$ 0.006
BEPRO	0.393	0.655 $\pm$ 0.038	0.668 $\pm$ 0.016	0.696 $\pm$ 0.010	0.716 $\pm$ 0.002	0.731 $\pm$ 0.024	0.735 $\pm$ 0.009	0.750 $\pm$ 0.014
XCN	0.592	0.668 $\pm$ 0.099	0.761 $\pm$ 0.017	0.737 $\pm$ 0.042	0.733 $\pm$ 0.024	0.769 $\pm$ 0.021	0.770 $\pm$ 0.024	0.763 $\pm$ 0.038
BAG	0.792	0.673 $\pm$ 0.227	0.719 $\pm$ 0.072	0.751 $\pm$ 0.060	0.781 $\pm$ 0.056	0.779 $\pm$ 0.019	0.799 $\pm$ 0.022	0.750 $\pm$ 0.045
TRAC	0.400	0.712 $\pm$ 0.071	0.743 $\pm$ 0.029	0.761 $\pm$ 0.007	0.774 $\pm$ 0.010	0.786 $\pm$ 0.009	0.781 $\pm$ 0.002	0.779 $\pm$ 0.013
DERC	0.353	0.683 $\pm$ 0.013	0.659 $\pm$ 0.013	0.675 $\pm$ 0.016	0.688 $\pm$ 0.011	0.732 $\pm$ 0.027	0.716 $\pm$ 0.029	0.739 $\pm$ 0.030
Metis	0.423	0.715 $\pm$ 0.122	0.713 $\pm$ 0.043	0.727 $\pm$ 0.010	0.713 $\pm$ 0.034	0.734 $\pm$ 0.003	0.744 $\pm$ 0.008	0.743 $\pm$ 0.005
REPv2	0.321	0.760 $\pm$ 0.012	0.728 $\pm$ 0.017	0.756 $\pm$ 0.007	0.751 $\pm$ 0.011	0.785 $\pm$ 0.014	0.782 $\pm$ 0.016	0.780 $\pm$ 0.012
DINO	0.431	0.730 $\pm$ 0.195	0.654 $\pm$ 0.023	0.751 $\pm$ 0.012	0.760 $\pm$ 0.015	0.749 $\pm$ 0.036	0.748 $\pm$ 0.010	0.728 $\pm$ 0.005
HOICHI	0.374	0.808 $\pm$ 0.047	0.739 $\pm$ 0.083	0.793 $\pm$ 0.024	0.788 $\pm$ 0.008	0.794 $\pm$ 0.018	0.787 $\pm$ 0.035	0.804 $\pm$ 0.011
MUTE	0.536	0.649 $\pm$ 0.015	0.580 $\pm$ 0.015	0.600 $\pm$ 0.018	0.593 $\pm$ 0.007	0.620 $\pm$ 0.017	0.622 $\pm$ 0.005	0.635 $\pm$ 0.014
GLM	0.427	0.830 $\pm$ 0.029	0.653 $\pm$ 0.080	0.724 $\pm$ 0.025	0.749 $\pm$ 0.045	0.798 $\pm$ 0.038	0.823 $\pm$ 0.027	0.807 $\pm$ 0.036
MIR	0.327	0.750 $\pm$ 0.005	0.552 $\pm$ 0.069	0.568 $\pm$ 0.015	0.652 $\pm$ 0.039	0.715 $\pm$ 0.018	0.711 $\pm$ 0.007	0.725 $\pm$ 0.016
stkAAVE	0.426	0.702 $\pm$ 0.042	0.626 $\pm$ 0.029	0.597 $\pm$ 0.020	0.637 $\pm$ 0.028	0.658 $\pm$ 0.022	0.685 $\pm$ 0.016	0.667 $\pm$ 0.024
ADX	0.362	0.769 $\pm$ 0.018	0.702 $\pm$ 0.011	0.701 $\pm$ 0.003	0.685 $\pm$ 0.042	0.701 $\pm$ 0.009	0.700 $\pm$ 0.004	0.696 $\pm$ 0.012

Appendix G Computing Resources

For graph classification tasks on foundation models, we ran all experiments on NVIDIA Quadro RTX 8000 (48G memory) with 4 standard CPU nodes(either Milan Zen 3 2.8 GHz and 768GB of memory each or Rome Zen 2, 2.5GHz and 256GB of memory each). We repeated each experiment three times and reported the average and standard deviation of different runs. In Appendix Figure7 we report the time per epoch for each foundation model.