Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (2024)

Razieh Shirzadkhani111footnotemark: 1 Tran Gia Bao Ngo211footnotemark: 1 Kiarash Shamsi2
Shenyang Huang1,3Farimah Poursafaei1,3Poupak Azad2Reihaneh Rabbany1,3,7
Baris Coskunuzer4Guillaume Rabusseau1,5,6Cuneyt Gurcan Akcora8
1Mila - Quebec AI Institute, 2Department of Computer Science, University of Manitoba,
3School of Computer Science, McGill University4University of Texas at Dallas,
6DIRO, Université de Montréal,7CIFAR AI Chair
8AI Initiative - University of Central Florida
Equal contribution

Abstract

The field of temporal graph learning aims to learn from evolving network data to forecast future interactions. Given a collection of observed temporal graphs, is it possible to predict the evolution of an unseen network from the same domain?To answer this question, we first present the Temporal Graph Scaling(TGS) dataset, a large collection of temporal graphs consisting of eighty-four ERC20 token transaction networks collected from 2017 to 2023. Next, we evaluate the transferability of Temporal Graph Neural Networks(TGNNs) for the temporal graph property prediction task by pre-training on a collection of up to sixty-four token transaction networks and then evaluating the downstream performance on twenty unseen token networks. We find that the neural scaling law observed in NLP and Computer Vision also applies in temporal graph learning, where pre-training on greater number of networks leads to improved downstream performance. To the best of our knowledge, this is the first empirical demonstration of the transferability of temporal graphs learning. On downstream token networks, the largest pre-trained model outperforms single model TGNNs on thirteen unseen test networks. Therefore, we believe that this is a promising first step towards building foundation models for temporal graphs.

1 Introduction

Many real world relations can be modeled as temporal graphs where nodes represent entities and edges represent interactions between entities that evolve over time. Examples include social networks[1, 2, 3], financial transaction networks[4], contact networks[5], and biological systems[6].

Recently, foundation models have revolutionized various fields such as natural language processing (NLP)[7, 8, 9] and computer vision (CV)[10, 11] by providing robust pre-trained architectures that can be transferred to a multitude of tasks. Foundational Models (FMs) aim to learn from large amounts of pre-training data and transfer the knowledge to downstream unseen tasks. These models have been recognized for their remarkable transfer capabilities and promising efficacy with few-shot and zero-shot learning on novel datasets and tasks [12, 13, 9].

Despite advances in the fields of NLP and CV, foundation models in graph representation learning remain relatively unexplored. For example, there has been some notable work on foundational models for graph neural networks (GNNs) that demonstrate the potential of these models[14, 15, 16, 17]. However, the majority of research to date has focused on static graph learning, leaving the exploration of temporal graph neural networks largely untapped.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (1)

Furthermore, to effectively train foundation models, a large collection of datasets is essential. Networks within the same domain often exhibit similar trends and statistics[18]. These datasets are crucial for assessing the performance of TGNNs, driving innovation, and ensuring that new methods can be generalized across various applications. To facilitate research on foundation models for temporal graphs, we introduce the Temporal Graph Scaling(TGS) benchmark, a comprehensive dataset containing 84 novel temporal graphs derived from Ethereum transaction networks. TGS offers temporal networks with sizes up to 128k nodes and 0.5 million edges, with a total of 3 million nodes and 19 million edges across all networks with novel duration and evolving recent activities, which enables the training of foundation models for temporal graph learning. In addition, we train the first foundation model on temporal graphs and demonstrate that training on a large number of temporal graphs result in surprisingly strong downstream performance. Figure1 shows the scaling behavior of our foundation model. The performance of the foundation model on twenty unseen token networks increases as the number of training networks increases.Notably, without fine-tuning on the test networks, the FMs achieves significant performance advantages over models trained on individual test networks. This demonstrates the high potential of transferability of foundation models on temporal graphs.

Our main contributions can be summarized as follows:

  • Novel Collection of Temporal Networks. We release a comprehensive collection of 84848484 datasets derived from token transaction networks with labels for the graph property prediction task. These datasets provide the foundation for studying scaling behavior, transferability and multi-network learning on temporal graphs.

  • Neural Scaling Law on TGNNs. We explore the potential of foundation models on temporal graphs by showing that neural scaling law also applies on temporal graphs: training TGNNs with more temporal graphs (up to 64646464), offers significant performance boost in downstream test networks.

  • Transferability Across Networks. We demonstrate that by pre-training on a large number of temporal graphs, our foundation model is directly transferable to 20202020 downstream unseen token networks while outperforming single models trained on the test networks. This shows that it is possible to learn an overall distribution across temporal graphs and transfer to novel networks.

Reproducibility. Our code is on Github and TGS datasets are publicly available111https://zenodo.org/doi/10.5281/zenodo.11455827. The TGS website provides detailed documentation.

2 Related Work

Temporal Graph Benchmarks.Numerous graph benchmark datasets have been introduced to advance research within the temporal graph learning community, Poursafaei et al.[19] introduced six dynamic graph datasets while proposing visualization techniques and novel negative edge sample strategies to facilitate link prediction tasks of dynamic graphs. Following the good practice from OGB[20], Huang et al. introduced TGB[21], which provides automated and reproducible results with a novel standardized evaluation pipeline for both link and node property prediction tasks. However, these datasets belong to different domains, making them unsuitable for studying the scaling laws of neural network models trained with a large number of datasets from the same domain.Li et al.[22] provide a temporal benchmark for evaluating graph neural networks in link prediction tasks, though their focus does not extend to multi-networks. Conversely, the Live Graph Lab dataset by Zhang et al.[23] offers a temporal dataset and benchmark, employed for tasks like temporal node classification using TGNNs. In this work, we aim to explore multi-network training as well as understand the transferability across temporal graphs thus we curate a collection of temporal graphs instead of individual ones in prior work.

Discrete Time Dynamic Graphs. A common approach in discrete time models treats each snapshot individually and captures spatial characteristics, then adopts an RNN-based method to learn temporal dependencies[24, 25, 26, 27, 28]. GCRN stacks a graph CNN for feature extraction and an LSTM cell for temporal reasoning[24]. Differentiating from GCRN, EvolveGCN[3] uses RNN to control the parameters of a GCN at each snapshot. Employing two attention blocks, DySat first generates static node embeddings at each snapshot by running a GAT style GNN, and then computes new embeddings using a temporal self-attention block[25]. In the most recent work, GraphPulse[29] leverages Mapper, a key tool in topological data analysis to extract essential information from temporal graphs. However, in all previous studies, the training process of every model was limited to a single dataset, and the effectiveness of training TGNs with diverse networks to enhance their generalization capabilities is unexplored.

Neural Scaling Laws. Neural scaling laws[30, 31, 32] characterize the relationship between model performance and three main factors: number of parameters, size of training datasets and amount of computation. This relationship is usually described as a power law, which can be understood by observing learning as a movement on a smooth data manifold[33]. Bahri et al. exhibited all four scaling regimes with respect to the number of model parameters as well as the dataset size, underscoring different mechanisms driving improvement in loss[33]. Aghajanyan et al.[34] provided valuable insights into the design and training of mixed-model generative models by studying mixed-modal scaling laws, indicating the generality of scaling laws across different domains and applications.Recently, Liu et al.[35] investigated neural scaling laws for static graphs by observing the performance of GNNs given increases in the model’s size, defined by the number of layers and parameters, and training set size, defined by the number of edges. To the best of our knowledge, we are the first to investigate neural scaling laws for temporal graphs.

Foundation Models.The foundation model is an emerging paradigm that aims to develop models capable of generalization across different domains and tasks by the knowledge obtained from massive data in the pre-trained stage. Recently, Rasul et al. introduced Lag-Llama[9], a general-purpose foundation model for univariate probabilistic time series forecasting based on a simple decoder-only transformer architecture that uses lags as covariates.Galkin et al. introduced ULTRA, a foundation model for knowledge graphs, which handles complex relational data and support diverse downstream tasks effectively[36]. Similarly, Beaini et al. presented Graphium, a collection of molecule graph datasets that facilitate the development of foundation models for molecular applications, highlighting the importance of domain-specific datasets in enhancing the performance and generalizability of foundation models[16]. Lastly, Xia et al. proposed OpenGraph, an initiative towards open foundation models for graphs, emphasizing the need for transparency, reproducibility, and community-driven advancements in the field of graph representation learning[37]. These works underscore the growing recognition of the importance of foundation models and their transformative potential across various domains such as molecular graphs. However, foundation models for temporal graphs remain unexplored.

3 Preliminaries

Temporal Graphs are generally categorized into two types: Continous Time Dynamic Graphs(CTDGs) and Discrete Time Dynamic Graphs(DTDGs)[38].We focus on DTDGs because this approach aligns well with our objective of capturing and analyzing the graph’s dynamics at specific time intervals, such as on a weekly basis. In DTDGs, the graph’s temporal evolution is represented in discrete time steps, simplifying the analysis and modeling of large-scale temporal multi networks. Each time step provides a snapshot of the graph at a specific moment, facilitating straightforward comparisons and the identification of temporal patterns.

Definition 1 (Discrete Time Dynamic Graphs).

Formally, DTDGs represent the network as a sequence of graph snapshots denoted as 𝒢={𝒢t1,𝒢t2,𝒢t3,,𝒢tn}𝒢subscript𝒢subscript𝑡1subscript𝒢subscript𝑡2subscript𝒢subscript𝑡3subscript𝒢subscript𝑡𝑛\mathcal{G}=\{\mathcal{G}_{t_{1}},\mathcal{G}_{t_{2}},\mathcal{G}_{t_{3}},%\ldots,\mathcal{G}_{t_{n}}\}caligraphic_G = { caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } where ti<tjsubscript𝑡𝑖subscript𝑡𝑗t_{i}<t_{j}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Each 𝒢ti=(𝒱ti,ti,𝐗ti,𝐘ti)subscript𝒢subscript𝑡𝑖subscript𝒱subscript𝑡𝑖subscriptsubscript𝑡𝑖subscript𝐗subscript𝑡𝑖subscript𝐘subscript𝑡𝑖\mathcal{G}_{t_{i}}=(\mathcal{V}_{t_{i}},\mathcal{E}_{t_{i}},\mathbf{X}_{t_{i}%},\mathbf{Y}_{t_{i}})caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is the graph at timestamp tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where 𝒱tisubscript𝒱subscript𝑡𝑖\mathcal{V}_{t_{i}}caligraphic_V start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and tisubscriptsubscript𝑡𝑖\mathcal{E}_{t_{i}}caligraphic_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represent the set of nodes and edges, 𝐗tisubscript𝐗subscript𝑡𝑖\mathbf{{X}}_{t_{i}}bold_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the node feature matrix, and 𝐘tisubscript𝐘subscript𝑡𝑖\mathbf{Y}_{t_{i}}bold_Y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the edge feature matrix in graph 𝒢tisubscript𝒢subscript𝑡𝑖\mathcal{G}_{t_{i}}caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Therefore, a collection of discrete time dynamic graphs is defined as D={𝒢1,𝒢2,,𝒢m}𝐷superscript𝒢1superscript𝒢2superscript𝒢𝑚D=\{\mathcal{G}^{1},\mathcal{G}^{2},\ldots,\mathcal{G}^{m}\}italic_D = { caligraphic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT }, where m𝑚mitalic_m is the number of DTDGs.

Temporal Graph Property Prediction.For the task of temporal graph property prediction, we aim to forecast a temporal graph property within a future time interval in a DTDG.More specifically, given a DTDG 𝒢𝒢\mathcal{G}caligraphic_G, we consider a time duration [tδ1,tδ2]subscript𝑡subscript𝛿1subscript𝑡subscript𝛿2[t_{\delta_{1}},t_{\delta_{2}}][ italic_t start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], where δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and δ2subscript𝛿2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are non-negative integers with δ1δ2subscript𝛿1subscript𝛿2\delta_{1}\leq\delta_{2}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then at a specific time tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the goal is to predict the target graph property within the specified future interval [tk+δ1,tk+δ2]subscript𝑡𝑘subscript𝛿1subscript𝑡𝑘subscript𝛿2[t_{k+\delta_{1}},t_{k+\delta_{2}}][ italic_t start_POSTSUBSCRIPT italic_k + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. As for the graph properties, we can consider characteristics such as temporal global efficiency, temporal-correlation coefficient, and temporal betweenness centrality can also be explored in the future.

Hyperbolic Graph Neural Networks.Hyperbolic geometry has been increasingly recognized for its ability to achieve state-of-the-art performance in several static graph embedding tasks[39]. HTGN is a recent hyperbolic work that shows strong performance in learning over dynamic graphs in a DTDG manner.The model employs a hyperbolic graph neural network (HGNN) to learn the topological dependencies of the nodes and a hyperbolic-gated recurrent unit (HGRU) to capture the temporal dependencies. Given feature vectors XtEsubscriptsuperscript𝑋𝐸𝑡X^{E}_{t}italic_X start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of snapshot t𝑡titalic_t in Euclidean space, an HGNN layer first adopts an exponential map to project Euclidean space vectors to hyperbolic space as follows Xt=expc(XtE)subscriptsuperscript𝑋𝑡𝑒𝑥superscript𝑝𝑐subscriptsuperscript𝑋𝐸𝑡X^{\mathcal{H}}_{t}=exp^{c}{(X^{E}_{t})}italic_X start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_e italic_x italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and then performs aggregation and activation similar to GNN but in a hyperbolic manner, X~tsuperscriptsubscript~𝑋𝑡\tilde{X}_{t}^{\mathcal{H}}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT = 𝐇𝐆𝐍𝐍(Xt)𝐇𝐆𝐍𝐍superscriptsubscript𝑋𝑡\mathbf{HGNN}(X_{t}^{\mathcal{H}})bold_HGNN ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ). To prevent recurrent neural networks from only emphasizing the most nearby time and to ensure stability along with generalization of the embedding, HTGN uses temporal contextual attention (HTA) to generalize the lastest w𝑤witalic_w hidden states such that H~t1superscriptsubscript~𝐻𝑡1\tilde{H}_{t-1}^{\mathcal{H}}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT = 𝐇𝐓𝐀(Htw;;Ht1)𝐇𝐓𝐀subscript𝐻𝑡𝑤subscript𝐻𝑡1\mathbf{HTA}(H_{t-w};...;H_{t-1})bold_HTA ( italic_H start_POSTSUBSCRIPT italic_t - italic_w end_POSTSUBSCRIPT ; … ; italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) [39]. HGRU takes the outputs from HGNN, X~tsuperscriptsubscript~𝑋𝑡\tilde{X}_{t}^{\mathcal{H}}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT, and the attentive hidden state, H~t1superscriptsubscript~𝐻𝑡1\tilde{H}_{t-1}^{\mathcal{H}}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT, from HTA as input to update gates and memory cells and then provides the latest hidden state as the output, Ht=𝐇𝐆𝐑𝐔(X~t,H~t1)superscriptsubscript𝐻𝑡𝐇𝐆𝐑𝐔superscriptsubscript~𝑋𝑡superscriptsubscript~𝐻𝑡1H_{t}^{\mathcal{H}}=\mathbf{HGRU}(\tilde{X}_{t}^{\mathcal{H}},\tilde{H}_{t-1}^%{\mathcal{H}})italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT = bold_HGRU ( over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT , over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ). In addition, HTGN enables updating the model’s state at the test time to incorporate new information, which makes it a good candidate for learning the scaling law of TGNNs. We further describe the HTGN in Appendix Section E.

4 Dataset

We utilize a dataset of temporal graphs sourced from the leading Ethereum blockchain[40]. In this section, we will describe Ethereum, explain our data pipeline and conclude with defining characteristics of the resulting dataset.

Ethereum and ERC20 Token Networks. Blockchain[41] is a decentralized and secure database technology composed of blocks of transactions that can be verified and confirmed without the need for a central authority. Ethereum is one of the most popular blockchains designed to store and execute complicated structures like software code, known as smart contracts. A smart contract is a computerized transaction protocol that executes the terms of a pre-defined agreement[42]. Typically implemented on the Ethereum blockchain, smart contracts ensure that the terms of the contract are automatically enforced and executed when certain conditions are met[43]. These contracts have their own account addresses, which can be called to perform actions such as buying or selling digital tokens[43]. As contracts proliferated, code standards[44], such as ERC20, have been created to define required functions (e.g., transfer()) for sales of assets, which are called tokens. The most widely used standard, ERC20, defines asset networks over non-fungible tokens, which form our dataset. Fungible tokens are interchangeable and uniform; each token is identical in value and functionality to another token of the same type, similar to how one unit of currency is equivalent to another unit of the same currency.

Block to Graph Data. We create our transaction network data by first installing an Ethereum node and accessing the P2P network by using the Ethereum client Geth (https://github.com/ethereum/go-ethereum). Then, we use Etherum-ETL(https://github.com/blockchain-etl/ethereum-etl) to parse all ERC20 tokens and extract asset transactions. We extracted more than sixty thousand ERC20 tokens from the entire history of the Ethereum blockchain. However, during the lifespans of most token networks, there are interim periods without any transactions. Additionally, a significant number of tokens live for only a short time span. To avoid training data quality challenges, we use 84 token networks that have at least one transaction every day during their lifespan and are large enough to be used as a benchmark dataset for foundation model training.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (2)

Temporal Networks. Each token network is inherently a temporal graph due to the time-stamped nature of transactions. In these graphs, nodes (addresses), edges (transactions), and edge weights (transaction values) change over time, reflecting the dynamic evolution of the network. This temporal aspect allows for capturing patterns, trends, and anomalies in transaction behavior effectively. Collecting a group of temporal graphs from different ERC20 token networks offers significant advantages, enabling comparative analysis to identify common patterns and unique behaviors across various tokens. This enhances the robustness and generalizability of models trained on the data. Additionally, different tokens often share addresses, i.e., unique account identifiers, across networks as the same investors participate in multiple token networks.These similarities across token networks facilitate transfer learning for various temporal tasks, enabling models to be adapted from one network to another and enhancing our understanding of the ecosystem.Figure 2 illustrates the TGS overview from dataset extraction to the foundation model training step.

Ethics and Privacy. All Ethereum transaction data is publicly available to users who have the necessary resources, such as fast SSDs, large RAM, and ample disk space, to synchronize Ethereum clients and manually extract blocks. Additionally, all Ethereum data is accessible on numerous Ethereum explorer sites such as etherscan.io. An Ethereum user’s privacy depends on whether personally identifiable information (PII) is associated with any of their blockchain addresses, which serve as account handles and are considered pseudonymous. If such PII were obtained from other sources, our datasets could potentially be used to link Ethereum addresses. However, real-life identities can only be discovered using IP tracking information, which we neither have nor share. Our data does not contain any PII. Furthermore, we have developed a request form to exclude an address from the dataset.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (3)

Dataset Statistics.Our TGS dataset is a collection of 84848484 ERC20 token networks derived from Ethereum from 2017 to 2023. Each token network is represented as a dynamic graph, in which each address and transaction between addresses are a node and directed edge, respectively. The biggest TGS token network contains 128,159128159128,159128 , 159 unique addresses and 554,705554705554,705554 , 705 transactions, while the smallest token network has 1,45414541,4541 , 454 nodes. TGS contains a diversity of dynamic graphs in terms of nodes, edges and timestamps, which are shown in Figure3. Details on statistics are given in AppendixD.The figure shows that most networks have more than 10k nodes and over 100k edges. The lifespan of TGS networks varies from 107107107107 days to 6666 years, and there exists at least one transaction each day. Figure3.a shows the novelty scores, i.e., the average ratio of unseen edges in each timestamp, introduced by Poursafaei et al.[19]. The figure shows that most of the 84848484 networks have novelty scores greater than 0.30.30.30.3, indicating that each day sees a considerable proportion of new edges in these token networks. We adopt a 70-15-15 split of train-test-validation for each token network and calculate the surprise score[19], which indicates the number of edges that appear only in the test data. As AppendixTable2 shows, the token networks have quite high surprise values with an average of 0.820.820.820.82. We also provide the node, edge and length distribution for train and test sets seperately AppendixFigure5. Overall, train set datasets mostly have more nodes compared to those is test set, while the number of edges and days are in the same range for both of them. A more detailed overview of characteristics of the TGS datasets are presented in Appendix D.

5 Methodology

In this work, we use Temporal Graph Neural Networks(TGNNs) as the foundation model architecture. We choose the state-of-the-art Hyperbolic Temporal Graph Network (HTGN)[39] as an example architecture for experiments. This section explains our choice and details our training algorithm on multi networks.

5.1 Multi-network Training on Temporal Graphs

Existing temporal graph learning models typically train on a single temporal graph, limiting their ability to capture similar behaviors and generalize across different networks[1, 39]. We introduce TGS-train, the pioneering algorithm designed to train across multiple temporal graphs by modifying a state-of-the-art single network training model with two crucial steps: shuffling and resets. These steps, as we describe below, render the algorithm network-agnostic, capable of learning from various temporal graphs to generalize effectively to unseen networks.

Algorithm1 shows TGS-train in detail. As the first step, we load a list of m𝑚mitalic_m temporal graphs D={𝒢1,𝒢2,,𝒢m}𝐷superscript𝒢1superscript𝒢2superscript𝒢𝑚D=\{\mathcal{G}^{1},\mathcal{G}^{2},\ldots,\mathcal{G}^{m}\}italic_D = { caligraphic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT }, where each temporal graph 𝒢isuperscript𝒢𝑖\mathcal{G}^{i}caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is represented as a sequence of snapshot {𝒢t1i,𝒢t2i,,𝒢tni}subscriptsuperscript𝒢𝑖subscript𝑡1subscriptsuperscript𝒢𝑖subscript𝑡2subscriptsuperscript𝒢𝑖subscript𝑡𝑛\{\mathcal{G}^{i}_{t_{1}},\mathcal{G}^{i}_{t_{2}},\ldots,\mathcal{G}^{i}_{t_{n%}}\}{ caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. For each epoch, we shuffle the orders of the list of datasets D𝐷Ditalic_D to preserve the Independent and Identically Distributed (IID) assumption of neural network training.

IID training. To preserve the IID assumption in neural network training, we include a shuffling step at each epoch. The randomized ordering of networks during training at each epoch is important because it helps prevent the model from learning spurious correlations that could arise if the data were presented in a fixed order. By shuffling the datasets, we promote randomness in the training process, which contributes to more robust and generalizable model performance.

Sequentially, for each dataset 𝒢isuperscript𝒢𝑖\mathcal{G}^{i}caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we first initialize the historical embeddings, then train the complete model (i.e. encoder-decoder) on each dataset 𝒢isuperscript𝒢𝑖\mathcal{G}^{i}caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in a similar manner of training a single model and evaluate the performance on the corresponding validation set of dataset 𝒢isuperscript𝒢𝑖\mathcal{G}^{i}caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. After training on m𝑚mitalic_m datasets D𝐷Ditalic_D, we compute the average validation test results across these datasets. This average is used to select the best model, which is then saved for inference. Early stopping is applied if needed.

Context switching. Many TGNNs stores and utilizes node embeddings from previous timestamps at later timestamps, we refer to those embeddings as historical embeddings[39, 26, 3].Resetting historical embeddings at the beginning of each epoch is a key step in training a temporal model across multiple networks for several reasons. First, it helps prevent the model from carrying over biases or assumptions from one network to another, ensuring that it can adapt effectively to the unique characteristics of each network. Starting with fresh historical embeddings at the beginning of each epoch enables the models to learn the most relevant and up-to-date information from the current network, leading to improved performance and generalization across different networks. Additionally, resetting historical embeddings can help mitigate the issue of catastrophic forgetting, where the model may gradually lose information about previous networks as it learns new ones.

Input: A Temporal Graph Dataset D={𝒢1,𝒢2,,𝒢m}𝐷superscript𝒢1superscript𝒢2superscript𝒢𝑚D=\{\mathcal{G}^{1},\mathcal{G}^{2},\ldots,\mathcal{G}^{m}\}italic_D = { caligraphic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , caligraphic_G start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT }, where 𝒢i={𝒢t1i,𝒢t2i,,𝒢tni}superscript𝒢𝑖superscriptsubscript𝒢subscript𝑡1𝑖superscriptsubscript𝒢subscript𝑡2𝑖superscriptsubscript𝒢subscript𝑡𝑛𝑖\mathcal{G}^{i}=\{\mathcal{G}_{t_{1}}^{i},\mathcal{G}_{t_{2}}^{i},\ldots,%\mathcal{G}_{t_{n}}^{i}\}caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }

m=mabsent\emph{m}=m = Number of networks in training,𝐓𝐆𝐍𝐍𝐓𝐆𝐍𝐍\mathbf{TGNN}bold_TGNN and 𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝐃𝐞𝐜𝐨𝐝𝐞𝐫\mathbf{Decoder}bold_Decoder

foreach epoch𝑒𝑝𝑜𝑐epochitalic_e italic_p italic_o italic_c italic_hdo

Shuffled (D𝐷Ditalic_D) // IID training

foreach network 𝒢iDsuperscript𝒢𝑖𝐷\mathcal{G}^{i}\in Dcaligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_Ddo

Initialize historical embeddings (reset) // context switching

foreach training snapshot 𝒢tji𝒢isubscriptsuperscript𝒢𝑖subscript𝑡𝑗superscript𝒢𝑖\mathcal{G}^{i}_{t_{j}}\in\mathcal{G}^{i}caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPTdo

ti=𝐓𝐆𝐍𝐍(𝒢tji)subscriptsubscript𝑡𝑖𝐓𝐆𝐍𝐍subscriptsuperscript𝒢𝑖subscript𝑡𝑗\mathcal{H}_{t_{i}}=\mathbf{TGNN}(\mathcal{G}^{i}_{t_{j}})caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_TGNN ( caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

y^ti=𝐃𝐞𝐜𝐨𝐝𝐞𝐫(tj)subscript^𝑦subscript𝑡𝑖𝐃𝐞𝐜𝐨𝐝𝐞𝐫subscriptsubscript𝑡𝑗\hat{y}_{t_{i}}=\mathbf{Decoder}(\mathcal{H}_{t_{j}})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_Decoder ( caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

=𝐋𝐨𝐬𝐬(yti,y^tj)𝐋𝐨𝐬𝐬subscript𝑦subscript𝑡𝑖subscript^𝑦subscript𝑡𝑗\mathcal{L}=\mathbf{Loss}(y_{t_{i}},\hat{y}_{t_{j}})caligraphic_L = bold_Loss ( italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

Backpropagation

Update historical embeddings with tjsubscriptsubscript𝑡𝑗\mathcal{H}_{t_{j}}caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT

Evaluate on the validation snapshots of 𝒢isuperscript𝒢𝑖\mathcal{G}^{i}caligraphic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT

Average validation results across all datasets to select the best model

Save the best model for inference

Inference on an unseen network.To evaluate the transferability of each foundation model, we test the model on unseen datasets. We begin by loading all the weights of foundation models, including the pre-trained encoder and decoder parameters, while initializing fresh historical embeddings. Then, we perform a single forward pass over the train and validation split to adapt the historical embeddings specific to the testing dataset.

6 Experiments

Weekly forecasts are common in the financial context for facilitating financial decisions[45]. Similarly, for the temporal graph property prediction task(defined in Section3), we set δ1=3subscript𝛿13\delta_{1}=3italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3 and δ2=10subscript𝛿210\delta_{2}=10italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 10, thus predicting the graph property over weekly snapshots. For the experiments, we use the network growth[28] in terms of edge count as the predicted graph property. See AppendixC for the dataset documentation, hosting and maintenance plan.

6.1 Prediction Baselines

Persistence forecast model.For our basic baseline model, we employ a naive setting similar to deterministic heuristics techniques, persistence forecast[46], for label generation. In this approach, we use data from the previous and current weeks to predict the next week’s property. If we observe an increasing trend in the number of transactions in the current week compared to the previous week, we predict a similar increasing trend for the following week. This simple model is based on the assumption that trends in transaction networks can persist over time.

Single model.We adopt the standard training process for HTGN[39] over a single dataset and make predictions for the same dataset. For each epoch, the training model process all snapshots in chronological order, with the node embeddings reset at the end of every epoch. To address graph-level tasks, we add an extra graph pooling layer as the final layer. This layer, such as a Multi-Layer Perceptron (MLP), takes the mean of all node embeddings, concatenating with four snapshot features at graph level, including mean of in-degree, weight of in-degree, out-degree and weight of out-degree, and then outputs binary classification prediction. We use Binary Cross-Entropy Loss (BCE) for performance measurement and Adam [47] as the optimization algorithm. It is important to note that the graph pooling layer, performance measurement and optimization algorithm are also shared by the Foundation Model Training setup.

We train every single model for 250250250250 epochs with a learning rate set to 15×10415superscript10415\times 10^{-4}15 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We adopt a 70%15%15%percent70percent15percent1570\%-15\%-15\%70 % - 15 % - 15 % ratio for the train, validation, and test split respectively for each training token network. The best model is selected based on the AUC results on validation sets, and then the model’s performance is evaluated using test sets. To reduce the time complexity of training HTGN, we applied early stopping, with patience and tolerance set to 20202020 and 5×1025superscript1025\times 10^{-2}5 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, respectively. Notably, the best model selection and early stopping are only applied after a minimum of 100 training epochs.

6.2 Foundation Model Training Setup

While following a similar training approach as in the single model training, we make specific adjustments for the foundational model training. We set the number of epochs to 300300300300 with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a train-validation-test chronological split ratio of 70151570151570-15-1570 - 15 - 15. Early stopping is applied based on the validation loss with a tolerance of 5×1025superscript1025\times 10^{-2}5 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and the patience is set to 30303030. The best model is selected based on the validation AUC and used to predict the unseen test dataset. We train six foundation models, each with a different number of networks corresponding to 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT datasets, where n[1,6]𝑛16n\in[1,6]italic_n ∈ [ 1 , 6 ]. We name each foundation model based on the number of datasets used in training; for example, FM-16 is trained with 16 datasets.

For graph classification tasks on foundation models we ran all experiments on NVIDIA Quadro RTX 8000 (48G memory) with 4 standard CPU nodes. We repeated each experiment three times and reported the average and standard deviation of different runs. In Appendix Figure7 we report the time per epoch for each of the foundation models.

6.3 Results

In this section, we present the performance of our foundation models trained with datasets of varying sizes on 20202020 unseen test datasets. We compare our results with the persistence forecast and single model baselines as explained in Section6.1. For visual clarity, Figure4 shows the AUC on test data results for FM-4, FM-16 and FM-64 only, while we show the performance of all six foundation models in appendix Figure6. Overall, an upward trend is observed in most datasets from Foundation 2222 to 64646464, such as QOM, MIR and BEPRO datasets, highlighting the power of larger foundation models in temporal graph learning.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (4)

ModelTop rank \uparrowAvg. rank \downarrowWin ratio \uparrow
Persist. forecast07.70.05
Single model64.5-
FM-206.10.40
FM-405.50.50
FM-834.30.60
FM-1622.90.65
FM-3232.70.70
FM-6462.60.65

In Figure4, the FM-64 yield the best AUC in 13131313 out of 20202020 test datasets. This result is significant, because the foundation models outperform the single models that are specifically trained on these datasets. We detail the prediction performance in Table1 where we rank all the foundation models and the baselines based on their AUC values in each test dataset, and report the average rank for each model. Average rank improve with increasing number of training networks and up to the foundation model 64. We observe a steep decrease in the average rank from foundation model 2, which has a rank of 6.1 out of 8, to FM-64, which has a rank of 2.6. In other words, training on sixty-four networks compared to two has improved the performance of the foundation model by 50%percent5050\%50 %. In Table1, we also present the win ratio of models over single model. FM-32 has the best win ratio of 0.70.70.70.7; however, its rank is lower than that of model FM-64.

7 Conclusion

In this work, we aim to answer the question: given a collection of observed temporal graphs, is it possible to predict the evolution of an unseen network from the same domain? The answer is yes, it is possible to learn from temporal networks within the same domain and forecast future trends on unseen networks. First, we collected and released a collection of 84848484 temporal networks for the temporal graph property prediction task. These datasets serve as the foundation for studying neural scaling laws and foundation models on temporal graphs. Next, to learn from a large number of temporal graphs, we present TGS-train, the first algorithm for training TGNNs across multiple temporal networks.Experimentally, we show that neural scaling law also applies on temporal graphs, in particular, the more training networks are used, the better the model performance on unseen test networks. In addition, our trained foundation models can outperform single models trained on individual test networks. Our empirical observations shows the high potential of training foundational models on temporal graphs. We believe our TGS benchmark will enable future work to develop novel foundation models for temporal graphs and study transferability across networks.

References

  • [1]E.Rossi, B.Chamberlain, F.Frasca, D.Eynard, F.Monti, and M.M. Bronstein, “Temporal graph networks for deep learning on dynamic graphs,” CoRR, vol.abs/2006.10637, 2020.
  • [2]T.Bai, Y.Zhang, B.Wu, and J.Nie, “Temporal graph neural networks for social recommendation,” in 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020 (X.Wu, C.Jermaine, L.Xiong, X.Hu, O.Kotevska, S.Lu, W.Xu, S.Aluru, C.Zhai, E.Al-Masri, Z.Chen, and J.Saltz, eds.), pp.898–903, IEEE, 2020.
  • [3]A.Pareja, G.Domeniconi, J.Chen, T.Ma, T.Suzumura, H.Kanezashi, T.Kaler, T.B. Schardl, and C.E. Leiserson, “Evolvegcn: Evolving graph convolutional networks for dynamic graphs,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.5363–5370, AAAI Press, 2020.
  • [4]K.Shamsi, F.Victor, M.Kantarcioglu, Y.Gel, and C.G. Akcora, “Chartalist: Labeled graph datasets for utxo and account-based blockchains,” Advances in Neural Information Processing Systems, vol.35, pp.34926–34939, 2022.
  • [5]S.Huang, F.Poursafaei, J.Danovitch, M.Fey, W.Hu, E.Rossi, J.Leskovec, M.Bronstein, G.Rabusseau, and R.Rabbany, “Temporal graph benchmark for machine learning on temporal graphs,” Advances in Neural Information Processing Systems, vol.36, 2024.
  • [6]Y.You, T.Chen, Y.Sui, T.Chen, Z.Wang, and Y.Shen, “Graph contrastive learning with augmentations,” Advances in neural information processing systems, vol.33, pp.5812–5823, 2020.
  • [7]S.Bubeck, V.Chandrasekaran, R.Eldan, J.Gehrke, E.Horvitz, E.Kamar, P.Lee, Y.T. Lee, Y.Li, S.M. Lundberg, H.Nori, H.Palangi, M.T. Ribeiro, and Y.Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,” CoRR, vol.abs/2303.12712, 2023.
  • [8]T.B. Brown, B.Mann, N.Ryder, M.Subbiah, J.Kaplan, P.Dhariwal, A.Neelakantan, P.Shyam, G.Sastry, A.Askell, S.Agarwal, A.Herbert-Voss, G.Krueger, T.Henighan, R.Child, A.Ramesh, D.M. Ziegler, J.Wu, C.Winter, C.Hesse, M.Chen, E.Sigler, M.Litwin, S.Gray, B.Chess, J.Clark, C.Berner, S.McCandlish, A.Radford, I.Sutskever, and D.Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (H.Larochelle, M.Ranzato, R.Hadsell, M.Balcan, and H.Lin, eds.), 2020.
  • [9]K.Rasul, A.Ashok, A.R. Williams, H.Ghonia, R.Bhagwatkar, A.Khorasani, M.J.D. Bayazi, G.Adamopoulos, R.Riachi, N.Hassen, M.Biloš, S.Garg, A.Schneider, N.Chapados, A.Drouin, V.Zantedeschi, Y.Nevmyvaka, and I.Rish, “Lag-llama: Towards foundation models for probabilistic time series forecasting,” 2024.
  • [10]A.Radford, J.W. Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J.Clark, G.Krueger, and I.Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (M.Meila and T.Zhang, eds.), vol.139 of Proceedings of Machine Learning Research, pp.8748–8763, PMLR, 2021.
  • [11]M.Awais, M.Naseer, S.Khan, R.M. Anwer, H.Cholakkal, M.Shah, M.-H. Yang, and F.S. Khan, “Foundational models defining a new era in vision: A survey and outlook,” arXiv preprint arXiv:2307.13721, 2023.
  • [12]R.Bommasani, D.A. Hudson, E.Adeli, R.B. Altman, S.Arora, S.von Arx, M.S. Bernstein, J.Bohg, A.Bosselut, E.Brunskill, E.Brynjolfsson, S.Buch, D.Card, R.Castellon, N.S. Chatterji, A.S. Chen, K.Creel, J.Q. Davis, D.Demszky, C.Donahue, M.Doumbouya, E.Durmus, S.Ermon, J.Etchemendy, K.Ethayarajh, L.Fei-Fei, C.Finn, T.Gale, L.Gillespie, K.Goel, N.D. Goodman, S.Grossman, N.Guha, T.Hashimoto, P.Henderson, J.Hewitt, D.E. Ho, J.Hong, K.Hsu, J.Huang, T.Icard, S.Jain, D.Jurafsky, P.Kalluri, S.Karamcheti, G.Keeling, F.Khani, O.Khattab, P.W. Koh, M.S. Krass, R.Krishna, R.Kuditipudi, and etal., “On the opportunities and risks of foundation models,” CoRR, vol.abs/2108.07258, 2021.
  • [13]Q.Dong, L.Li, D.Dai, C.Zheng, Z.Wu, B.Chang, X.Sun, J.Xu, L.Li, and Z.Sui, “A survey for in-context learning,” CoRR, vol.abs/2301.00234, 2023.
  • [14]H.Mao, Z.Chen, W.Tang, J.Zhao, Y.Ma, T.Zhao, N.Shah, M.Galkin, and J.Tang, “Graph foundation models,” 2024.
  • [15]M.Galkin, X.Yuan, H.Mostafa, J.Tang, and Z.Zhu, “Towards foundation models for knowledge graph reasoning,” 2024.
  • [16]D.Beaini, S.Huang, J.A. Cunha, Z.Li, G.Moisescu-Pareja, O.Dymov, S.Maddrell-Mander, C.McLean, F.Wenkel, L.Müller, etal., “Towards foundational models for molecular learning on large-scale multi-task datasets,” in The Twelfth International Conference on Learning Representations, 2023.
  • [17]O.Méndez-Lucio, C.Nicolaou, and B.Earnshaw, “Mole: a molecular foundation model for drug discovery,” arXiv preprint arXiv:2211.02657, 2022.
  • [18]S.Jin and R.Zafarani, “The spectral zoo of networks: Embedding and visualizing networks with spectral moments,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.1426–1434, 2020.
  • [19]F.Poursafaei, S.Huang, K.Pelrine, and R.Rabbany, “Towards better evaluation for dynamic link prediction,” in Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (S.Koyejo, S.Mohamed, A.Agarwal, D.Belgrave, K.Cho, and A.Oh, eds.), 2022.
  • [20]W.Hu, M.Fey, M.Zitnik, Y.Dong, H.Ren, B.Liu, M.Catasta, and J.Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (H.Larochelle, M.Ranzato, R.Hadsell, M.Balcan, and H.Lin, eds.), 2020.
  • [21]S.Huang, F.Poursafaei, J.Danovitch, M.Fey, W.Hu, E.Rossi, J.Leskovec, M.M. Bronstein, G.Rabusseau, and R.Rabbany, “Temporal graph benchmark for machine learning on temporal graphs,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, eds.), 2023.
  • [22]J.Li, H.Shomer, H.Mao, S.Zeng, Y.Ma, N.Shah, J.Tang, and D.Yin, “Evaluating graph neural networks for link prediction: Current pitfalls and new benchmarking,” Advances in Neural Information Processing Systems, vol.36, 2024.
  • [23]Z.Zhang, B.Luo, S.Lu, and B.He, “Live graph lab: Towards open, dynamic and real transaction graphs with NFT,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, eds.), 2023.
  • [24]Y.Seo, M.Defferrard, P.Vandergheynst, and X.Bresson, “Structured sequence modeling with graph convolutional recurrent networks,” 2016.
  • [25]A.Sankar, Y.Wu, L.Gou, W.Zhang, and H.Yang, “Dynamic graph representation learning via self-attention networks,” 2019.
  • [26]J.Chen, X.Wang, and X.Xu, “GC-LSTM: graph convolution embedded LSTM for dynamic network link prediction,” Appl. Intell., vol.52, no.7, pp.7513–7528, 2022.
  • [27]J.Li, Z.Han, H.Cheng, J.Su, P.Wang, J.Zhang, and L.Pan, “Predicting path failure in time-evolving graphs,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019 (A.Teredesai, V.Kumar, Y.Li, R.Rosales, E.Terzi, and G.Karypis, eds.), pp.1279–1289, ACM, 2019.
  • [28]K.Shamsi, F.Poursafaei, S.Huang, B.T.G. Ngo, B.Coskunuzer, and C.G. Akcora, “Graphpulse: Topological representations for temporal graph property prediction,” in The Twelfth International Conference on Learning Representations, 2024.
  • [29]K.Shamsi, F.Poursafaei, S.Huang, B.T.G. Ngo, B.Coskunuzer, and C.G. Akcora, “Graphpulse: Topological representations for temporal graph property prediction,” in The Twelfth International Conference on Learning Representations, 2023.
  • [30]J.S. Rosenfeld, A.Rosenfeld, Y.Belinkov, and N.Shavit, “A constructive prediction of the generalization error across scales,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020.
  • [31]J.Kaplan, S.McCandlish, T.Henighan, T.B. Brown, B.Chess, R.Child, S.Gray, A.Radford, J.Wu, and D.Amodei, “Scaling laws for neural language models,” CoRR, vol.abs/2001.08361, 2020.
  • [32]S.Abnar, M.Dehghani, B.Neyshabur, and H.Sedghi, “Exploring the limits of large scale pre-training,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022.
  • [33]Y.Bahri, E.Dyer, J.Kaplan, J.Lee, and U.Sharma, “Explaining neural scaling laws,” CoRR, vol.abs/2102.06701, 2021.
  • [34]A.Aghajanyan, L.Yu, A.Conneau, W.Hsu, K.Hambardzumyan, S.Zhang, S.Roller, N.Goyal, O.Levy, and L.Zettlemoyer, “Scaling laws for generative mixed-modal language models,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (A.Krause, E.Brunskill, K.Cho, B.Engelhardt, S.Sabato, and J.Scarlett, eds.), vol.202 of Proceedings of Machine Learning Research, pp.265–279, PMLR, 2023.
  • [35]J.Liu, H.Mao, Z.Chen, T.Zhao, N.Shah, and J.Tang, “Neural scaling laws on graphs,” CoRR, vol.abs/2402.02054, 2024.
  • [36]M.Galkin, X.Yuan, H.Mostafa, J.Tang, and Z.Zhu, “Towards foundation models for knowledge graph reasoning,” in The Twelfth International Conference on Learning Representations, 2023.
  • [37]L.Xia, B.Kao, and C.Huang, “Opengraph: Towards open graph foundation models,” arXiv preprint arXiv:2403.01121, 2024.
  • [38]S.M. Kazemi, R.Goel, K.Jain, I.Kobyzev, A.Sethi, P.Forsyth, and P.Poupart, “Representation learning for dynamic graphs: A survey,” Journal of Machine Learning Research, vol.21, no.70, pp.1–73, 2020.
  • [39]M.Yang, M.Zhou, M.Kalander, Z.Huang, and I.King, “Discrete-time temporal network embedding via implicit hierarchical learning in hyperbolic space,” in KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021 (F.Zhu, B.C. Ooi, and C.Miao, eds.), pp.1975–1985, ACM, 2021.
  • [40]G.Wood etal., “Ethereum: A secure decentralised generalised transaction ledger,” Ethereum project yellow paper, vol.151, no.2014, pp.1–32, 2014.
  • [41]C.G. Akcora, Y.R. Gel, and M.Kantarcioglu, “Blockchain: A graph primer,” CoRR, vol.abs/1708.08749, 2017.
  • [42]N.Szabo, “The idea of smart contracts,” Nick Szabo’s Papers and Concise Tutorials, 1997.
  • [43]Z.Zheng, S.Xie, H.Dai, W.Chen, X.Chen, J.Weng, and M.Imran, “An overview on smart contracts: Challenges, advances and platforms,” Future Gener. Comput. Syst., vol.105, pp.475–491, 2020.
  • [44]M.DiAngelo and G.Salzer, “Tokens, types, and standards: identification and utilization in ethereum,” in 2020 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPS), pp.1–10, IEEE, 2020.
  • [45]H.-M. Kim, G.-W. Bock, and G.Lee, “Predicting ethereum prices with machine learning based on blockchain information,” Expert Systems with Applications, vol.184, p.115480, 2021.
  • [46]S.Salcedo-Sanz, D.Casillas-Pérez, J.D. Ser, C.Casanova-Mateo, L.Cuadra, M.Piles, and G.Camps-Valls, “Persistence in complex systems,” Physics Reports, vol.957, pp.1–73, 2022.
  • [47]D.P. Kingma and J.Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Y.Bengio and Y.LeCun, eds.), 2015.

Appendix A Broader Impact

Foundation models have broad applications across various domains, including the financial sector, which is our primary focus. Our goal is to develop a temporal foundation model capable of predicting future trends with minimal training. This model can identify similar behaviors and be utilized in real-time scenarios such as network trend analysis or token price prediction.

Negative Impact.Although this work aims to pave the way for significant advancement in temporal graph learning, there might be some potential negative impacts requiring meticulous consideration.First, the focus on pre-training models on the TGS large collection of token transaction networks may inadvertently bias the models towards specific types of data, reducing their generalizability and effectiveness when applied to other domains or types of temporal graphs.Second, the observed neural scaling law, which indicates that larger pre-training datasets lead to better performance, requires significant computational resources.On one hand, extensive model pre-training leads to potential energy consumption and environmental degradation.On the other hand, such requirements on the computational resources could lead to a concentration of advancements in well-funded institutions, potentially stifling innovation and diversity of thought in the field.Finally, the emphasis on model performance might overshadow the importance of interpretability and transparency. Addressing these potential negative impacts is crucial to ensure the responsible development and deployment of TGL.

Appendix B Limitations

Our work has the following limitations. i) Our scaling results indicate that training with a larger number of networks enhances model generalizability. Although we limited the foundation model to sixty-four networks due to resource constraints, training on a larger number of networks could further improve performance. ii) While we used the Discretized Temporal Directed Graph setting as our benchmark, this approach can be generalized to continuous time graphs, representing a promising area for future research. iii) Although our current focus is on financial networks, the temporal scaling law should also be studied for other domains, such as social media or transportation networks, which we plan to explore in future work.

Appendix C Dataset Documentation and Intended Use

All datasets introduced by TGS are intended for academic usage under MIT license. We, as authors, bear all responsibility in case of violation of rights. Here are the relevant links for code, dataset and website:

Maintenance plan. To create a comprehensive, reliable, and reproducible benchmark for temporal graph scaling, we plan to continuously develop and maintain TGS with input and involvement from the community. Our objective is to expand the dataset by extracting and adding more token networks to support larger foundation model training in the future. TGS dataset is hosted and maintained by the Digital Research Alliance of Canada, funded by the Government of Canada.

Appendix D Additional Dataset Statistics

We summarize detailed statistics of each token network in TGS datasets in Table 2. In the table, the growth rate is the ratio of label 1111 indicating the increase in the the number of edge counts with respect to the problem definition defined in Section 3. In addition, the novelty score, the average ratio of new edges in each timestamp, and the surprise score, the ratio of edges that only appear in the test set, introduced by Poursafaei et al. [19], are defined as followed:

novelty=1Tt=1T|EtEseent||Et|,𝑛𝑜𝑣𝑒𝑙𝑡𝑦1𝑇superscriptsubscript𝑡1𝑇superscript𝐸𝑡subscriptsuperscript𝐸𝑡𝑠𝑒𝑒𝑛superscript𝐸𝑡\displaystyle novelty=\frac{1}{T}\sum_{t=1}^{T}\frac{|E^{t}\setminus E^{t}_{%seen}|}{|E^{t}|},italic_n italic_o italic_v italic_e italic_l italic_t italic_y = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG | italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∖ italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_e italic_e italic_n end_POSTSUBSCRIPT | end_ARG start_ARG | italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | end_ARG ,(1a)
surprise=|EtestEtrain||Etest|.𝑠𝑢𝑟𝑝𝑟𝑖𝑠𝑒subscript𝐸𝑡𝑒𝑠𝑡subscript𝐸𝑡𝑟𝑎𝑖𝑛subscript𝐸𝑡𝑒𝑠𝑡\displaystyle surprise=\frac{|E_{test}\setminus E_{train}|}{|E_{test}|}.italic_s italic_u italic_r italic_p italic_r italic_i italic_s italic_e = divide start_ARG | italic_E start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ∖ italic_E start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT | end_ARG start_ARG | italic_E start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT | end_ARG .(1g)

where Etsuperscript𝐸𝑡E^{t}italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Eseentsubscriptsuperscript𝐸𝑡𝑠𝑒𝑒𝑛E^{t}_{seen}italic_E start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_e italic_e italic_n end_POSTSUBSCRIPT denotes the set of edges present only in timestamp t𝑡titalic_t and seen in previous timestamps, respectively. Etestsubscript𝐸𝑡𝑒𝑠𝑡E_{test}italic_E start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT represent edges that appear in the test set and edges appearing in the train set are represented as Etrainsubscript𝐸𝑡𝑟𝑎𝑖𝑛E_{train}italic_E start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT.

Comparison between training and testing set.To learn the scaling law of TGNs, we divide TGS into two disjoint sets, where one set is used for training obtained by randomly selecting 64 token networks and the remaining 20 token networks are used to evaluate the performance. Nodes, transactions and length (in days) distribution over the training and testing sets are shown in Figure 5. Training sets well-support the foundation model to generalize characteristics of the entire TGS dataset due to the similarity between nodes, edge and length in days distributions shown in Figures 5(a), 5(b), 5(c) and those distributions across 84848484 token networks of TGS datasets. In addition, the variance of datasets’ characteristics of the testing set is shown in Figures 5(d), 5(e) and 5(f).

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (5)
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (6)
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (7)
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (8)
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (9)
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (10)

TokenNodeTransactionTimestamp (days)Growth rateNoveltySurprise
ARC11325709686060.430.320.88
CELR6535023580716910.490.560.96
CMT868952059613090.450.720.92
DRGN11345334184921640.440.570.97
GHST3515618095511460.430.510.93
INU8556663151540.270.410.59
IOTX6307928846919930.450.560.99
QSP11797729967121780.450.670.99
REP832822248433460.460.690.96
RFD232081736951690.30.390.6
TNT8824731635212160.430.550.93
TRAC7166729918121100.460.540.97
RLB280332402911290.430.490.76
steCRV1907921153810330.450.530.9
ALBT6304243488111520.430.440.89
POLS12815955470511320.450.610.94
SWAP6923050976912130.460.450.79
SUPER832995020309860.470.460.85
RARI8718650296012070.430.470.91
KP3R3932349325811020.430.330.88
MIR7998444499810660.450.430.92
aUSDC2374247568010670.460.40.73
LUSD258524304739430.480.360.87
PICKLE2849843026211490.480.340.69
DODO4704639044311310.470.450.91
YFII4396439198411960.440.440.96
STARL715903699138560.460.480.86
LQTY346873742309430.450.340.91
FEG11829436758410070.40.620.92
AUDIO9121836268511080.450.580.95
OHM457283770686900.430.460.88
WOOL168743511787160.410.180.41
Metis525863431419070.440.480.89
cDAI5275335805014370.450.460.9
BITCOIN340513470541780.480.390.63
INJ6047231282211130.460.520.98
MIM230382693668850.440.40.89
GLM5338523491210800.50.530.96
Mog145902406801070.370.380.55
DPI4062723424611500.490.50.86
LINA4534222714711440.450.460.95
Yf-DAI2246622687511580.420.310.87
BOB428062120991990.350.480.73
RGT3527721193211100.440.460.98
TVK4253920808210620.410.480.93
RSR506452059066590.470.620.91
WOJAK343411986532010.370.480.73
ANT3651720026211070.470.460.93
LADYS374861921761810.370.520.79
ETH2x-FLI110081990889650.470.280.84
TURBO386381890481890.330.480.72
REPv23906119136711940.480.50.97
NOIA2979818552811330.460.370.7
0x0215311824302830.510.460.81
PSYOP254501688961690.320.390.59
ShibDoge400231346976800.430.530.8
ADX1456712375511880.440.40.91
BAG118601226342980.310.440.87
QOM217571182925980.460.410.81
BEPRO2652112026111320.460.480.87
AIOZ292311199269470.430.490.89
PRE4047611862511130.50.550.86
CRU1999011771211440.50.430.95
POOH272451116411930.260.490.69
DERC242771112058240.450.490.83
stkAAVE3735511092411280.420.570.71
BTRFLY84501083714530.480.340.44
SDEX91271048692400.410.440.75
XCN200851041856070.460.420.84
HOP370041026505140.410.60.88
MAHA18401961807490.430.470.91
DINO15837941403580.440.440.74
bendWETH1454968985930.510.210.51
PUSH14501931039360.460.380.83
SPONGE25852904681840.310.660.81
sILV212838929056110.40.340.48
SLP66759536811510.430.360.91
crvUSD2950886471740.610.370.73
MUTE12426823459770.430.460.95
EVERMOON7552798681630.240.350.52
HOICHI5075773614360.360.320.71
DOGE2.07664790471230.450.380.66
ORN4401023945111340.460.470.87
aDAI1364818705010680.450.460.82

Appendix E Hyperbolic Temporal Graph Network (HTGN)

To interpret hyperbolic embeddings, Yang et al adopt Poincaré ball model with negative curve c𝑐-c- italic_c, given c>0𝑐0c>0italic_c > 0, coresponds to the Riemannian manifold (n,c)={xn:cx2<1}superscript𝑛𝑐conditional-set𝑥superscript𝑛𝑐superscriptnorm𝑥21(\mathbb{H}^{n,c})=\{x\in\mathbb{R}^{n}:c||x||^{2}<1\}( blackboard_H start_POSTSUPERSCRIPT italic_n , italic_c end_POSTSUPERSCRIPT ) = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_c | | italic_x | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < 1 } is an open n-dimensional ball. Given an Euclidean space vector xiEdsuperscriptsubscript𝑥𝑖𝐸superscript𝑑x_{i}^{E}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we consider it as a point in tangent space 𝒯xd,csubscript𝒯superscript𝑥superscript𝑑𝑐\mathcal{T}_{x^{\prime}}\mathbb{H}^{d,c}caligraphic_T start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_H start_POSTSUPERSCRIPT italic_d , italic_c end_POSTSUPERSCRIPT and adopt the exponential map to project it into hyperbolic space :

xi=expxc(xiE)superscriptsubscript𝑥𝑖𝑒𝑥superscriptsubscript𝑝superscript𝑥𝑐superscriptsubscript𝑥𝑖𝐸x_{i}^{\mathcal{H}}=exp_{x^{\prime}}^{c}(x_{i}^{E})italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT = italic_e italic_x italic_p start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT )(2)

Resulting in xid,csuperscriptsubscript𝑥𝑖superscript𝑑𝑐x_{i}^{\mathcal{H}}\in\mathbb{H}^{d,c}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ∈ blackboard_H start_POSTSUPERSCRIPT italic_d , italic_c end_POSTSUPERSCRIPT, which is then served as input to HGNN layer as follow [39]:

𝐦isuperscriptsubscript𝐦𝑖\displaystyle\mathbf{m}_{i}^{\mathcal{H}}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT=Wc𝐱ic𝐛,absentsuperscriptdirect-sum𝑐superscripttensor-product𝑐𝑊superscriptsubscript𝐱𝑖𝐛\displaystyle=W\otimes^{c}\mathbf{x}_{i}^{\mathcal{H}}\oplus^{c}\mathbf{b},= italic_W ⊗ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ⊕ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT bold_b ,(3a)
𝐦~isuperscriptsubscript~𝐦𝑖\displaystyle\tilde{\mathbf{m}}_{i}^{\mathcal{H}}over~ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT=exp𝐱c(j𝒩(i)αijlog𝐱c(𝐦i)),absentsuperscriptsubscriptsuperscript𝐱𝑐subscript𝑗𝒩𝑖subscript𝛼𝑖𝑗superscriptsubscriptsuperscript𝐱𝑐superscriptsubscript𝐦𝑖\displaystyle=\exp_{\mathbf{x^{\prime}}}^{c}(\sum_{j\in\mathcal{N}(i)}\alpha_{%ij}\log_{\mathbf{\mathbf{x}^{\prime}}}^{c}(\mathbf{m}_{i}^{\mathcal{H}})),= roman_exp start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_i ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) ) ,(3b)
𝐱~isuperscriptsubscript~𝐱𝑖\displaystyle\tilde{\mathbf{x}}_{i}^{\mathcal{H}}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT=exp𝐱c(σ(log𝐱c(𝐦~i)).\displaystyle=\exp_{\mathbf{x^{\prime}}}^{c}(\sigma({\log_{\mathbf{x^{\prime}}%}^{c}}(\tilde{\mathbf{m}}_{i}^{\mathcal{H}})).= roman_exp start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_σ ( roman_log start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) ) .(3c)

where W𝑊Witalic_W, b𝑏bitalic_b are learnable parameters and hyperbolic activation function σ𝜎\sigmaitalic_σ achieved by applying logarithmic and exponential mapping. HGNN leverages attention-based aggregation by assigning attention score αijsubscript𝛼𝑖𝑗\alpha_{ij}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to indicate the importance of neighbour j𝑗jitalic_j to node i𝑖iitalic_i, computed as followed:

αijsubscript𝛼𝑖𝑗\displaystyle\alpha_{ij}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT=softmax(j𝒩(i))(sij)=exp(sij)j𝒩iexp(sij),absent𝑠𝑜𝑓𝑡𝑚𝑎subscript𝑥𝑗𝒩𝑖subscript𝑠𝑖𝑗subscript𝑠𝑖𝑗subscriptsuperscript𝑗subscript𝒩𝑖subscript𝑠𝑖superscript𝑗\displaystyle=softmax_{(j\in\mathcal{N}(i))}(s_{ij})=\frac{\exp(s_{ij})}{\sum_%{j^{\prime}\in\mathcal{N}_{i}}\exp(s_{ij^{\prime}})},= italic_s italic_o italic_f italic_t italic_m italic_a italic_x start_POSTSUBSCRIPT ( italic_j ∈ caligraphic_N ( italic_i ) ) end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG ,(4)
sijsubscript𝑠𝑖𝑗\displaystyle s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT=LeakReLU(aT[log0c(mil)log0c(mjl)]),absentLeakReLUsuperscript𝑎𝑇delimited-[]conditionalsuperscriptsubscript0𝑐superscriptsubscript𝑚𝑖𝑙superscriptsubscript0𝑐superscriptsubscript𝑚𝑗𝑙\displaystyle=\mathrm{LeakReLU}(a^{T}[\log_{0}^{c}(m_{i}^{l})\|\log_{0}^{c}(m_%{j}^{l})]),= roman_LeakReLU ( italic_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ roman_log start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∥ roman_log start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ] ) ,

where a𝑎aitalic_a is trainable vector and ||||| | denotes concatenation operation.

Output of HGNN, X~tsuperscriptsubscript~𝑋𝑡\tilde{X}_{t}^{\mathcal{H}}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT, is then used as input to HGRU along with attentive hidden state H~t1superscriptsubscript~𝐻𝑡1\tilde{H}_{t-1}^{\mathcal{H}}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT obtained by HTA , which generalize Ht1subscript𝐻𝑡1H_{t-1}italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to lastest w𝑤witalic_w snapshots {Htw,,Ht1}subscript𝐻𝑡𝑤subscript𝐻𝑡1\{H_{t-w},...,H_{t-1}\}{ italic_H start_POSTSUBSCRIPT italic_t - italic_w end_POSTSUBSCRIPT , … , italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } [39]. Operations behind HGRU is characterized by following equation [39]:

XtE=log𝐱c(X~t),superscriptsubscript𝑋𝑡𝐸superscriptsubscriptsuperscript𝐱𝑐superscriptsubscript~𝑋𝑡\displaystyle X_{t}^{E}=\log_{\mathbf{x^{\prime}}}^{c}(\tilde{X}_{t}^{\mathcal%{H}}),italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = roman_log start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) ,(5a)
Ht1E=log𝐱c(H~t1),superscriptsubscript𝐻𝑡1𝐸superscriptsubscriptsuperscript𝐱𝑐superscriptsubscript~𝐻𝑡1\displaystyle H_{t-1}^{E}=\log_{\mathbf{x^{\prime}}}^{c}(\tilde{H}_{t-1}^{%\mathcal{H}}),italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = roman_log start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) ,(5b)
PtE=σ(WzXtE+UzHt1E)superscriptsubscript𝑃𝑡𝐸𝜎subscript𝑊𝑧superscriptsubscript𝑋𝑡𝐸subscript𝑈𝑧superscriptsubscript𝐻𝑡1𝐸\displaystyle P_{t}^{E}=\sigma(W_{z}X_{t}^{E}+U_{z}H_{t-1}^{E})italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT + italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT )(5c)
RtE=σ(WrXtE+UrHt1E),superscriptsubscript𝑅𝑡𝐸𝜎subscript𝑊𝑟superscriptsubscript𝑋𝑡𝐸subscript𝑈𝑟superscriptsubscript𝐻𝑡1𝐸\displaystyle R_{t}^{E}=\sigma(W_{r}X_{t}^{E}+U_{r}H_{t-1}^{E}),italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT + italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) ,(5d)
H~tE=tanh(WhXtE+Uh(RtHt1E)),superscriptsubscript~𝐻𝑡𝐸subscript𝑊superscriptsubscript𝑋𝑡𝐸subscript𝑈direct-productsubscript𝑅𝑡superscriptsubscript𝐻𝑡1𝐸\displaystyle\tilde{H}_{t}^{E}=\tanh(W_{h}X_{t}^{E}+U_{h}(R_{t}\odot H_{t-1}^{%E})),over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = roman_tanh ( italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT + italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) ) ,(5e)
HtE=(1PtE)H~tE+PtEHt1E,superscriptsubscript𝐻𝑡𝐸direct-product1superscriptsubscript𝑃𝑡𝐸superscriptsubscript~𝐻𝑡𝐸direct-productsuperscriptsubscript𝑃𝑡𝐸superscriptsubscript𝐻𝑡1𝐸\displaystyle H_{t}^{E}=(1-P_{t}^{E})\odot\tilde{H}_{t}^{E}+P_{t}^{E}\odot H_{%t-1}^{E},italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = ( 1 - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) ⊙ over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT + italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ⊙ italic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ,(5f)
Ht=exp𝐱c(HtE).superscriptsubscript𝐻𝑡superscriptsubscriptsuperscript𝐱𝑐superscriptsubscript𝐻𝑡𝐸\displaystyle H_{t}^{\mathcal{H}}=\exp_{\mathbf{x^{\prime}}}^{c}(H_{t}^{E}).italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT = roman_exp start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) .(5g)

where Wz,Wr,Wh,Uz,Ur,Uhsubscript𝑊𝑧subscript𝑊𝑟subscript𝑊subscript𝑈𝑧subscript𝑈𝑟subscript𝑈W_{z},W_{r},W_{h},U_{z},U_{r},U_{h}italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are the trainable weight matrices, PtEsuperscriptsubscript𝑃𝑡𝐸P_{t}^{E}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT is the update gate to control the output and RtEsuperscriptsubscript𝑅𝑡𝐸R_{t}^{E}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT is the reset gate to balance the input and memory. [39]

Appendix F Additional Results

Here we present the test results for the six foundation models trained on different network sizes as well as the single model and persistence forecast results. Figure6 illustrates the AUC of these models on test set. In most datasets, foundation models outperform the single model and in all datasets outperform persistence forecast.

In Table3 the average and standard deviation of AUC is presented for all models. FM-64 shows the highest performance in six datasets and second best in five, and FM-32 has the highest performance in three datasets and second best in eight datasets. These results shows the power of foundation models in performing downstream tasks on unseen datasets.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (11)

TokenPer. Fore.Single ModelFM-2FM-4FM-8FM 16FM-32FM-64
WOJAK0.3780.479 ±plus-or-minus\pm±0.0050.766 ±plus-or-minus\pm± 0.0070.769 ±plus-or-minus\pm±0.0010.807 ±plus-or-minus\pm± 0.0040.794 ±plus-or-minus\pm± 0.0100.777 ±plus-or-minus\pm± 0.0080.774 ±plus-or-minus\pm± 0.022
DOGE2.00.2500.590 ±plus-or-minus\pm± 0.0590.763 ±plus-or-minus\pm± 0.0770.734 ±plus-or-minus\pm± 0.0250.759 ±plus-or-minus\pm± 0.0220.796 ±plus-or-minus\pm± 0.0360.808 ±plus-or-minus\pm±0.0320.823 ±plus-or-minus\pm± 0.011
EVERMOON0.2410.512 ±plus-or-minus\pm± 0.0230.598 ±plus-or-minus\pm± 0.0280.645 ±plus-or-minus\pm± 0.0190.683 ±plus-or-minus\pm± 0.0090.666 ±plus-or-minus\pm±0.0060.662 ±plus-or-minus\pm± 0.0320.671 ±plus-or-minus\pm± 0.020
QOM0.3340.633 ±plus-or-minus\pm± 0.0170.665 ±plus-or-minus\pm± 0.0660.691 ±plus-or-minus\pm± 0.0410.714 ±plus-or-minus\pm± 0.0380.725 ±plus-or-minus\pm± 0.0130.751 ±plus-or-minus\pm± 0.0240.755 ±plus-or-minus\pm± 0.023
SDEX0.4230.762 ±plus-or-minus\pm± 0.0340.712 ±plus-or-minus\pm± 0.1060.745 ±plus-or-minus\pm± 0.0460.790 ±plus-or-minus\pm± 0.0510.758 ±plus-or-minus\pm± 0.0810.839 ±plus-or-minus\pm± 0.0620.883 ±plus-or-minus\pm± 0.016
ETH2x-FLI0.3550.610 ±plus-or-minus\pm± 0.0590.582 ±plus-or-minus\pm± 0.0920.598 ±plus-or-minus\pm± 0.0130.661 ±plus-or-minus\pm± 0.0250.715 ±plus-or-minus\pm± 0.0200.710 ±plus-or-minus\pm± 0.0150.721 ±plus-or-minus\pm± 0.006
BEPRO0.3930.655 ±plus-or-minus\pm± 0.0380.668 ±plus-or-minus\pm± 0.0160.696 ±plus-or-minus\pm± 0.0100.716 ±plus-or-minus\pm± 0.0020.731 ±plus-or-minus\pm± 0.0240.735 ±plus-or-minus\pm± 0.0090.750 ±plus-or-minus\pm± 0.014
XCN0.5920.668 ±plus-or-minus\pm± 0.0990.761 ±plus-or-minus\pm± 0.0170.737 ±plus-or-minus\pm± 0.0420.733 ±plus-or-minus\pm± 0.0240.769 ±plus-or-minus\pm± 0.0210.770 ±plus-or-minus\pm± 0.0240.763 ±plus-or-minus\pm± 0.038
BAG0.7920.673 ±plus-or-minus\pm± 0.2270.719 ±plus-or-minus\pm± 0.0720.751 ±plus-or-minus\pm± 0.0600.781 ±plus-or-minus\pm± 0.0560.779 ±plus-or-minus\pm± 0.0190.799 ±plus-or-minus\pm± 0.0220.750 ±plus-or-minus\pm± 0.045
TRAC0.4000.712 ±plus-or-minus\pm± 0.0710.743 ±plus-or-minus\pm± 0.0290.761 ±plus-or-minus\pm± 0.0070.774 ±plus-or-minus\pm± 0.0100.786 ±plus-or-minus\pm± 0.0090.781 ±plus-or-minus\pm± 0.0020.779 ±plus-or-minus\pm± 0.013
DERC0.3530.683 ±plus-or-minus\pm± 0.0130.659 ±plus-or-minus\pm± 0.0130.675 ±plus-or-minus\pm± 0.0160.688 ±plus-or-minus\pm± 0.0110.732 ±plus-or-minus\pm± 0.0270.716 ±plus-or-minus\pm± 0.0290.739 ±plus-or-minus\pm± 0.030
Metis0.4230.715 ±plus-or-minus\pm± 0.1220.713 ±plus-or-minus\pm± 0.0430.727 ±plus-or-minus\pm± 0.0100.713 ±plus-or-minus\pm± 0.0340.734 ±plus-or-minus\pm± 0.0030.744 ±plus-or-minus\pm± 0.0080.743 ±plus-or-minus\pm± 0.005
REPv20.3210.760 ±plus-or-minus\pm± 0.0120.728 ±plus-or-minus\pm± 0.0170.756 ±plus-or-minus\pm± 0.0070.751 ±plus-or-minus\pm± 0.0110.785 ±plus-or-minus\pm± 0.0140.782 ±plus-or-minus\pm± 0.0160.780 ±plus-or-minus\pm± 0.012
DINO0.4310.730 ±plus-or-minus\pm± 0.1950.654 ±plus-or-minus\pm± 0.0230.751 ±plus-or-minus\pm± 0.0120.760 ±plus-or-minus\pm± 0.0150.749 ±plus-or-minus\pm± 0.0360.748 ±plus-or-minus\pm± 0.0100.728 ±plus-or-minus\pm± 0.005
HOICHI0.3740.808 ±plus-or-minus\pm± 0.0470.739 ±plus-or-minus\pm± 0.0830.793 ±plus-or-minus\pm± 0.0240.788 ±plus-or-minus\pm± 0.0080.794 ±plus-or-minus\pm± 0.0180.787 ±plus-or-minus\pm± 0.0350.804 ±plus-or-minus\pm± 0.011
MUTE0.5360.649 ±plus-or-minus\pm± 0.0150.580 ±plus-or-minus\pm± 0.0150.600 ±plus-or-minus\pm± 0.0180.593 ±plus-or-minus\pm± 0.0070.620 ±plus-or-minus\pm± 0.0170.622 ±plus-or-minus\pm± 0.0050.635 ±plus-or-minus\pm± 0.014
GLM0.4270.830 ±plus-or-minus\pm± 0.0290.653 ±plus-or-minus\pm± 0.0800.724 ±plus-or-minus\pm± 0.0250.749 ±plus-or-minus\pm± 0.0450.798 ±plus-or-minus\pm± 0.0380.823 ±plus-or-minus\pm± 0.0270.807 ±plus-or-minus\pm± 0.036
MIR0.3270.750 ±plus-or-minus\pm± 0.0050.552 ±plus-or-minus\pm± 0.0690.568 ±plus-or-minus\pm± 0.0150.652 ±plus-or-minus\pm± 0.0390.715 ±plus-or-minus\pm± 0.0180.711 ±plus-or-minus\pm± 0.0070.725 ±plus-or-minus\pm± 0.016
stkAAVE0.4260.702 ±plus-or-minus\pm± 0.0420.626 ±plus-or-minus\pm± 0.0290.597 ±plus-or-minus\pm± 0.0200.637 ±plus-or-minus\pm± 0.0280.658 ±plus-or-minus\pm± 0.0220.685 ±plus-or-minus\pm± 0.0160.667 ±plus-or-minus\pm± 0.024
ADX0.3620.769 ±plus-or-minus\pm± 0.0180.702 ±plus-or-minus\pm± 0.0110.701 ±plus-or-minus\pm± 0.0030.685 ±plus-or-minus\pm± 0.0420.701 ±plus-or-minus\pm± 0.0090.700 ±plus-or-minus\pm± 0.0040.696 ±plus-or-minus\pm± 0.012

Appendix G Computing Resources

For graph classification tasks on foundation models, we ran all experiments on NVIDIA Quadro RTX 8000 (48G memory) with 4 standard CPU nodes(either Milan Zen 3 2.8 GHz and 768GB of memory each or Rome Zen 2, 2.5GHz and 256GB of memory each). We repeated each experiment three times and reported the average and standard deviation of different runs. In Appendix Figure7 we report the time per epoch for each foundation model.

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (12)
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs (2024)

References

Top Articles
Latest Posts
Article information

Author: Mr. See Jast

Last Updated:

Views: 5505

Rating: 4.4 / 5 (75 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Mr. See Jast

Birthday: 1999-07-30

Address: 8409 Megan Mountain, New Mathew, MT 44997-8193

Phone: +5023589614038

Job: Chief Executive

Hobby: Leather crafting, Flag Football, Candle making, Flying, Poi, Gunsmithing, Swimming

Introduction: My name is Mr. See Jast, I am a open, jolly, gorgeous, courageous, inexpensive, friendly, homely person who loves writing and wants to share my knowledge and understanding with you.