From 79d4d216388dcc07688227cbcb9696c7325030d3 Mon Sep 17 00:00:00 2001 From: Marco Walz Date: Wed, 6 May 2026 17:01:52 +0200 Subject: [PATCH 1/2] docs: add evolution & scaling concept page (fault tolerance, subnet creation, chain evolution) Migrates 4 Learn Hub articles from the "Evolution & Scaling" section into a single docs/concepts/evolution-scaling.md page covering fault tolerance and node recovery, horizontal subnet scaling and XNet messaging, and governance- driven protocol upgrades via epoch-boundary transitions. Updates chain-key-cryptography.md: replaces Learn Hub chain-evolution link. Updates glossary.md: replaces Learn Hub fault-tolerance link. --- .../evolution-scaling/chain-evolution.md | 59 -------------- .../evolution-scaling/evolution-scaling.md | 18 ----- .../evolution-scaling/fault-tolerance.md | 57 -------------- .../evolution-scaling/subnet-creation.md | 31 -------- docs/concepts/chain-key-cryptography.md | 2 +- docs/concepts/evolution-scaling.md | 78 +++++++++++++++++++ docs/references/glossary.md | 2 +- 7 files changed, 80 insertions(+), 167 deletions(-) delete mode 100644 .migration/learn-hub/how-does-icp-work/evolution-scaling/chain-evolution.md delete mode 100644 .migration/learn-hub/how-does-icp-work/evolution-scaling/evolution-scaling.md delete mode 100644 .migration/learn-hub/how-does-icp-work/evolution-scaling/fault-tolerance.md delete mode 100644 .migration/learn-hub/how-does-icp-work/evolution-scaling/subnet-creation.md create mode 100644 docs/concepts/evolution-scaling.md diff --git a/.migration/learn-hub/how-does-icp-work/evolution-scaling/chain-evolution.md b/.migration/learn-hub/how-does-icp-work/evolution-scaling/chain-evolution.md deleted file mode 100644 index 450894a..0000000 --- a/.migration/learn-hub/how-does-icp-work/evolution-scaling/chain-evolution.md +++ /dev/null @@ -1,59 +0,0 @@ ---- -learn_hub_id: 34210120121748 -learn_hub_url: "https://learn.internetcomputer.org/hc/en-us/articles/34210120121748-Chain-Evolution" -learn_hub_title: "Chain Evolution" -learn_hub_section: "Evolution & Scaling" -learn_hub_category: "How does ICP work?" -migrated: false ---- - -# Chain Evolution - -The Internet Computer is governed by the [Network Nervous System (NNS)](https://learn.internetcomputer.org/hc/en-us/articles/33692645961236), its fully onchain governance system. One of the many duties of the NNS is to orchestrate upgrades of ICP to a new protocol version. Upgrading a blockchain protocol requires solutions to several challenging problems posed by the nature of decentralized systems including how to allow arbitrary changes to the protocol, preserve state of all canister smart contracts, minimize downtime, and roll out upgrades autonomously. - -Any software needs to be updated on a regular basis to stay competitive in the market. This could be to fix bugs, add new features, change the algorithms, change the underlying technology, etc. Blockchain protocols are no different. As a community, we keep learning better ways to solve our problems and would like to upgrade our blockchain protocols accordingly. For example, Ethereum had the “The Merge” upgrade, which upgraded their protocol from Proof of Work to Proof of Stake. Bitcoin had the “Taproot” upgrade, which extended the options for transaction verification. - -While upgrading a blockchain protocol is extremely crucial for its success, most blockchains including Bitcoin and Ethereum are not designed to do so easily and frequently. This is primarily because blockchains are not controlled by a single authority. Every upgrade proposal has to be evaluated by the community. However, the community's opinion on the proposals may be split. There is no quick and formal framework to finalize the decisions and build new features. Upgrades to the protocol potentially cause a fork in the network. As a result, upgrading a blockchain protocol could take years of joint effort by the community. Ethereum went through only [18 protocol upgrades in a 7.5 year time span](https://ethereum.org/en/history/). - -The Internet Computer is a unique blockchain that is designed to be easily upgradeable with a minimal user-perceived downtime and without any forks while still requiring consensus by the community for each upgrade. In the more than three years after genesis, ICP has upgraded many times, approximately once per week, adding crucial features such as deterministic time slicing, Bitcoin integration, HTTPS outcalls, chain-key signatures for ECDSA, Schnorr, and EdDSA, increased stable memory, etc. - -The “protocol upgrades” feature is designed with the following goals: (1) Allow arbitrary changes to the Internet Computer Protocol; (2) Preserve the state between upgrades; (3) Minimize downtime; (4) Roll out upgrades autonomously. - -Protocol upgrades are made feasible due to the blockchain governance system called Network Nervous System (NNS). In the NNS, there is a component called “registry”, which stores all the configuration of the Internet Computer. A versioning system is implemented for the configuration. Each mutation to the configuration shows up as a new version in the registry. The registry has a record for each subnet which includes a replica version, list of nodes in the subnet, cryptographic key material to be used by the subnet, etc. Note that the registry stores the desired configuration. The subnets might actually be running one of the older configurations. - -![Registry implements versioning mechanism](https://csojb-wiaaa-aaaal-qjftq-cai.icp0.io/_astro/registry-versions.-WLMQ1AE_Z2rzSoX.webp) - -To trigger a protocol upgrade, one has to submit a proposal in the NNS to change the configuration of the registry. The proposal can be voted by anyone who staked their ICP tokens. If a majority of voters accept the proposal, then the registry is changed accordingly. - -![Proposal to upgrade a subnet to a new replica version](https://csojb-wiaaa-aaaal-qjftq-cai.icp0.io/_astro/upgrade-proposal.CEzVfpIO_2t9Hbw.webp) - -Protocol upgrades are rolled-out on a per-subnet basis. Each subnet is run by many nodes. Each node runs 2 processes — (1) the Replica and (2) the Orchestrator. The replica consists of the 4-layer software stack that maintains the blockchain. The orchestrator downloads and manages the replica software. The orchestrator regularly queries the NNS registry for any updates. If there is a new registry version, the orchestrator downloads the corresponding change and informs the replica about it. - -In each consensus round, one of the nodes in the subnet (called the block maker) proposes a block. In every block, the block maker includes the latest registry version it downloaded from the registry canister. Other nodes notarize the block only when they have the referenced registry available. - -If the subnet record in the registry indicates a replica version change, the orchestrator downloads the corresponding software. After all the nodes in the subnet agree upon the latest registry version via consensus, the obvious next step is to switch to the new version. To avoid forks, it is crucial that all the nodes coordinate and switch their version at the same block height. To achieve this, the consensus protocol is divided into epochs. Each epoch is a few hundred consensus rounds (can be configured in the registry). Throughout an epoch, all the replicas in the subnet run the same Replica version, even if a newer Replica version is found in the registry and included in the blocks. Protocol upgrades happen only at the epoch boundaries. - -![Protocol upgrade happens at epoch boundaries](https://csojb-wiaaa-aaaal-qjftq-cai.icp0.io/_astro/protocol-transition.CvfhxtsH_Z1q5r7c.webp) - -The first block in each epoch is a summary block, which consists of the configuration information (including registry version and cryptographic key material) that will be used during the epoch. The summary block of epoch x specifies both the registry version to be used throughout epoch x, and the registry version to be used throughout epoch x+1. Therefore, all the nodes agree on what registry version to use for an epoch long before the epoch starts. - -Suppose a protocol upgrade of the subnet is supposed to be done at the beginning of epoch x indicated by a replica version change in the registry version the nodes agreed on. A blockmaker first proposes the summary block. The nodes then stop processing any new update call messages, but produce a series of empty blocks until the summary block is finalized, executed and the complete replicated state is certified. Query calls are executed normally during this time. Next, all the nodes create a catch up package (CUP), which contains the relevant information that needs to be transferred from the old replica software to the new replica software (see Section 8 of the whitepaper for more details) and is signed by more than 2/3 of the subnet nodes for validity. The CUP gives enough context for the new replica software to resume consensus. The replicas send the CUP to the orchestrator. The orchestrator runs the new replica software with the CUP as input. - -![Catch Up Package is handed over to new replica version](https://csojb-wiaaa-aaaal-qjftq-cai.icp0.io/_astro/handing-cup.DC6sx848_Z5jOcg.webp) - -## - -To prevent cross-version contamination, blocks and other consensus artifacts are tagged with protocol versions. With the exception of CUPs, the replica software only processes artifacts of its own version. As a consequence, CUPs must be decipherable by both pre-upgrade and post-upgrade replica software. - -Note that the registry records the desired configurations but does not track real-time subnet versions. Subnets may operate on older versions than indicated in the registry until they have completed the process outlined above. Therefore, nodes determine the currently used version by querying peers for the highest valid CUP. - -## Additional information - -[Blogpost on upgrading the Internet Computer Protocol](https://medium.com/dfinity/upgrading-the-internet-computer-protocol-45bf6424b268) - -[Whitepaper, see Section 8](https://internetcomputer.org/whitepaper.pdf) - -[10min video on core protocol upgrades](https://www.youtube.com/watch?v=mPjiO2bk2lI) - -[55min video on NNS-governed Canister Upgrades](https://www.youtube.com/watch?v=oEEPLJVX5DE) - diff --git a/.migration/learn-hub/how-does-icp-work/evolution-scaling/evolution-scaling.md b/.migration/learn-hub/how-does-icp-work/evolution-scaling/evolution-scaling.md deleted file mode 100644 index 612c229..0000000 --- a/.migration/learn-hub/how-does-icp-work/evolution-scaling/evolution-scaling.md +++ /dev/null @@ -1,18 +0,0 @@ ---- -learn_hub_id: 34576974172692 -learn_hub_url: "https://learn.internetcomputer.org/hc/en-us/articles/34576974172692-Evolution-Scaling" -learn_hub_title: "Evolution & Scaling" -learn_hub_section: "Evolution & Scaling" -learn_hub_category: "How does ICP work?" -migrated: false ---- - -# Evolution & Scaling - -The Internet Computer has the capability to adapt to changing application needs. In case of growing demand for resources, the creation of new subnets provides horizontal scalability. The protocol is also upgraded regularly, allowing for improvements in efficiency as well as extension of functionality. - - * [Subnet creation: ](https://learn.internetcomputer.org/hc/en-us/articles/34209955782420)The capacity of the network scales in response to user demand. To achieve this, the Internet Computer's architecture allows for the seamless addition of nodes and subnets, effectively expanding the network's resources and ensuring it can handle increasing usage. - * [Chain evolution: ](https://learn.internetcomputer.org/hc/en-us/articles/34210120121748) To meet the changing demands of its users, the Internet Computer has been design to evolve over time. Upgradeability must not come at the expense of the system's fundamental principles: decentralization and security. The Internet Computer must maintain its robust guarantees in these areas even as it evolves. - - - diff --git a/.migration/learn-hub/how-does-icp-work/evolution-scaling/fault-tolerance.md b/.migration/learn-hub/how-does-icp-work/evolution-scaling/fault-tolerance.md deleted file mode 100644 index bb33f49..0000000 --- a/.migration/learn-hub/how-does-icp-work/evolution-scaling/fault-tolerance.md +++ /dev/null @@ -1,57 +0,0 @@ ---- -learn_hub_id: 34210647901460 -learn_hub_url: "https://learn.internetcomputer.org/hc/en-us/articles/34210647901460-Fault-Tolerance" -learn_hub_title: "Fault Tolerance" -learn_hub_section: "Evolution & Scaling" -learn_hub_category: "How does ICP work?" -migrated: false ---- - -# Fault Tolerance - -In any large-scale distributed system, it is inevitable that individual nodes fail at any time due to hardware outages, network connectivity issues, or even attacks. ICP is fault tolerant, which means that the protocol will make progress even if some nodes fail or misbehave. When failures are detected, the [Network Nervous System (NNS](https://learn.internetcomputer.org/hc/en-us/articles/33692645961236)) selects a spare node that replaces the failed node in its subnet. The new node then joins the subnet and performs state synchronization with the subnet’s existing nodes and begins contributing to the subnet blockchain’s consensus protocol. - -## Node failures - -In each round, a block is produced by the [consensus layer](https://learn.internetcomputer.org/hc/en-us/articles/34207558615956), and the messages in the block are processed subsequently by the [execution layer](https://learn.internetcomputer.org/hc/en-us/articles/34208985618836). The proposed block and the resulting state need to be agreed upon by more than 2/3rd of the nodes in the subnet in order for the subnet to make progress. As long as less than 1/3rd of the nodes in a subnet fail or misbehave, even in an arbitrary, Byzantine manner, the subnet will continue making progress. - -If less than 1/3rd of the nodes in a subnet fail while the remaining nodes of the subnet continue to make progress, a failed node can recover automatically and catch up with the operational nodes. A newly joined node also uses the same process to catch up with the existing nodes in the subnet. - -Here’s one natural solution. A failed or newly joined node could download all the consensus blocks it missed from its peers, and process each block, one by one. Unfortunately, new nodes will take a long time to catch up if they have to process all the blocks from subnet genesis. Another solution is to let the failed or newly joined node directly copy the latest state from its peers. However, as the peers are continuously updating their state as they process new blocks, copying the latest state while the peers are updating it may lead to inconsistencies. - -ICP uses a mix of both the approaches. The consensus protocol is divided into epochs. Each epoch comprises a few hundred consensus rounds. At the beginning of each epoch, all the nodes create a checkpoint of their blockchain state and a catch-up package (CUP). The CUP at height h contains all relevant information required for consensus to resume from height h. This includes the hash of the blockchain state after processing the block at height h. The CUP is then signed by at least 2/3rd of the nodes in the subnet. Each normally-operating node then broadcasts the CUP. - -All the nodes in the subnet listen to the CUP messages broadcast by their peers. Suppose a node observes that a received CUP has a valid signature (signed by at least 2/3 of the nodes in the subnet) and has a different blockchain state hash than the locally available state hash for that height. Then the node initiates the[ state sync protocol](https://learn.internetcomputer.org/hc/en-us/articles/34471579767572) to sync the blockchain state at that height (the height at which the CUP is published). - -Note that while the failed/newly joined nodes are syncing the blockchain state, the well-functioning nodes continue to process new blocks and make progress. The well-functioning nodes use their backup copy of the blockchain state (created at the same time as the CUP) to supply the state to syncing nodes. After the syncing node finishes syncing the blockchain state, it will request the consensus blocks generated since the CUP and process the blocks one by one. Once fully synced, the node can then process messages regularly like the other nodes. - -If a failed node does not recover, or if a node keeps lagging behind or fail often, then a proposal to replace this node with another one may be submitted to the NNS. - -## Recovery of regular subnets - -In rare cases, an entire subnet can get stuck and fail to make progress. A subnet can fail due to many reasons such as software bugs that lead to non-deterministic execution. This can also happen when more than 1/3rd of the nodes in the subnet fail at the same time. In this case, the well-functioning nodes fail to create and sign a catch-up package (CUP), and thereby the failed nodes cannot recover automatically. - -When a subnet fails, manual intervention is needed for recovery. In a nutshell, as the subnet nodes fail to create and sign a CUP automatically, someone needs to manually create a CUP. The CUP needs to be created at the maximum blockchain height where the state is certified by at least 2/3rd of the nodes in the subnet. The subnet nodes naturally cannot trust a manually created CUP. Community consensus that the CUP is valid is required. Subnet recovery proceeds via a proposal to theNNS to use the created CUP for the subnet. Anyone who staked their ICP can vote on the proposal. If a majority of the voters accept the proposal, the CUP is stored in the NNS registry. - -Each node runs 2 processes — (1) Replica and (2) Orchestrator. The replica consists of the 4-layer software stack that maintains the blockchain. The orchestrator downloads and manages the replica software. The orchestrator regularly queries the NNS registry for any updates. If the orchestrator observes a new CUP in the registry, then the orchestrator restarts the replica process with the newly created CUP as input. As described earlier, the CUP at height h has information relevant to resume the consensus from height h. Once the replica starts, it will initiate a state sync protocol if it observes that the blockchain state hash in the CUP differs from the local state hash. Once the state is synced, it will resume processing consensus blocks. - -Note that this recovery process requires submitting a proposal to the NNS, and therefore works only for recovering regular subnets (not the NNS subnet). This process of recovering a subnet is often termed as disaster recovery in many Internet Computer docs. - -## Handling NNS canister failures - -The Internet Computer's NNS comprises the canisters that govern the entire Internet Computer. This includes the root canister, governance canister, ledger canister, registry canister, etc. - -Suppose a canister in the NNS fails while the NNS subnet continues to make progress. This could be due to a software bug in the canister’s code. In this case, the canister needs to be “upgraded”, i.e., restarted canister with a new Web Assembly code. Generally speaking, each canister in the Internet Computer has a (possibly empty) list of “controllers”. The controller has the right to upgrade the canister’s WASM code. The lifeline canister is assigned as a controller for the root canister. The root canister is assigned as a controller for all the other NNS canisters. The root canister has a method to upgrade other NNS canisters. Similarly, the lifeline canister has a method to upgrade the root canister. - -Suppose the governance canister is working. Then one can manually submit an NNS proposal to call the root/lifeline canister’s method to upgrade the failed canister. Anyone who staked ICP can vote on the proposal. If a majority of the voters accept, then the failed canister will be upgraded. - -## Handling NNS subnet failures - -In the worst case, the subnet which hosts the NNS canisters could get stuck and fail to make progress. In such a case, all the node providers who contributed a node to the NNS subnet need to manually intervene, create a CUP and restart their node with the new CUP. - -## Additional resources - -[12min video on resumption](https://www.youtube.com/watch?v=H7HCqonSMFU) - -[20min video on state synchronization](https://www.youtube.com/watch?v=WaNJINjGleg) - diff --git a/.migration/learn-hub/how-does-icp-work/evolution-scaling/subnet-creation.md b/.migration/learn-hub/how-does-icp-work/evolution-scaling/subnet-creation.md deleted file mode 100644 index af0d8f9..0000000 --- a/.migration/learn-hub/how-does-icp-work/evolution-scaling/subnet-creation.md +++ /dev/null @@ -1,31 +0,0 @@ ---- -learn_hub_id: 34209955782420 -learn_hub_url: "https://learn.internetcomputer.org/hc/en-us/articles/34209955782420-Subnet-Creation" -learn_hub_title: "Subnet Creation" -learn_hub_section: "Evolution & Scaling" -learn_hub_category: "How does ICP work?" -migrated: false ---- - -# Subnet Creation - -Ever wondered about the meaning behind DFINITY? It’s Decentralized + Infinity. It’s named that way because the Internet Computer is designed to scale infinitely. It means that the Internet Computer can host an unlimited number of canisters (smart contracts), store an unlimited amount of memory, process an unlimited amount of transactions per second. In simple words, Internet Computer is designed to host even large scale applications like social media platforms in a fully decentralized way. - -There are two types of widely-used approaches to improve the scalability of a system: (1) Vertical Scaling, and (2) Horizontal Scaling. Vertical scaling means adding more CPU, RAM and disk to a single computer. Horizontal scaling means adding more computers to the system. There is a limit to vertical scaling. But with horizontal scaling, one can achieve unlimited scalability. Internet Computer is one of the first blockchains to successfully use horizontal scaling. - -The Internet Computer scales its capacity horizontally by creating new subnets that host additional canisters — just like traditional cloud infrastructure scales by adding new machines. More precisely, the nodes in the Internet Computer are divided into subnets, each containing a few dozen nodes. The set of nodes in a subnet together maintain one blockchain. Each subnet can host thousands of canisters and process messages received by those canisters. Each subnet has a limited capacity in terms of the number of canisters (around hundred thousand), amount of storage (hundreds of GBs), and bandwidth (a few hundred transactions per second). But as more subnets are added to the Internet Computer, its overall capacity increases proportionately. Once the IC’s Network Nervous System (NNS) decided to create a new subnet, it selects a group of spare nodes that have joined the IC but have not yet been allocated to any subnet and creates the initial configuration of the new subnet. The selected group of nodes then begins to form a new subnet blockchain. - -![Internet Computer is divided into subnets](https://csojb-wiaaa-aaaal-qjftq-cai.icp0.io/_astro/add-new-subnet.34gYPhhU_ZY0r2F.webp) - -Another crucial design aspect is the inter-subnet (Xnet) communication of canisters: A canister of a subnet can send asynchronous messages to any canister on any other subnet. XNet messages are ingested by the receiving subnet’s consensus layer and their integrity is validated based on the sending subnet’s threshold signature — another application of [chain-key cryptography](https://learn.internetcomputer.org/hc/en-us/articles/34209486239252). This architecture of XNet messaging leads to a “loose coupling” of the subnets that does not require a central component such as a shard chain as used in other blockchains with multiple “shards” that would create a bottleneck when scaling out. Therefore newly added subnets can immediately send and receive XNet messages to any other subnet and an increasing number of subnets does not hit a natural bottleneck as in other, more simplistic, architectures. - -Creating a new subnet has two steps. (1) Adding new nodes to the Internet Computer, and (2) Creating a subnet with the available nodes. Anyone can purchase the node hardware and add it to the Internet Computer by following the [node provider onboarding process](https://wiki.internetcomputer.org/wiki/Node_Provider_Documentation). - -We now describe how to create a new subnet with the available nodes. The Internet Computer has a decentralized governance system called [Network Nervous System (NNS)](https://learn.internetcomputer.org/hc/en-us/articles/33692645961236). Essentially, the NNS consists of a group of canisters that manage the Internet Computer. In the NNS, there is a component called “registry”, which stores the full configuration of the Internet Computer. The registry has a record for each subnet which includes a protocol version, the list of nodes in the subnet, protocol configuration parameters, etc. - -![Proposal to create a new subnet.](https://csojb-wiaaa-aaaal-qjftq-cai.icp0.io/_astro/new-subnet-proposal.DhFWWB9r_1YulmL.webp) - -To add a new subnet, one has to submit a proposal to the NNS to add a record for a new subnet to the registry. The proposal consists of the list of nodes to be included in the new subnet. .The status of all proposals can be viewed on the [IC Dashboard](https://www.dashboard.internetcomputer.org). The proposal can be voted on by anyone who staked their ICP tokens. If a majority of voters accept the proposal, then the registry canister instructs the NNS subnet to generate — in a fully decentralized way using [chain-key cryptography](https://learn.internetcomputer.org/hc/en-us/articles/34209486239252) — the cryptographic key material to be used by the new subnet and a catch up package containing the genesis block. The registry canister then adds a record containing the configuration of the subnet. - -We now describe how a new subnet is created after a record is added to the registry. Each node runs 2 main processes, the (1) Replica and the (2) Orchestrator. The replica consists of the 4-layer software stack that maintains the blockchain and executes the canister messages. The orchestrator downloads and manages the replica software. When a new node is onboarded, the node provider has to install IC OS on the node, which contains the orchestrator software. The orchestrator regularly queries the NNS registry for any updates. If the orchestrator sees in a registry record that the node is included in a newly created subnet, then the orchestrator downloads the corresponding replica software, and runs the replica with the Catch Up Package included in the registry as input. The replica then starts accepting messages and the consensus protocol extends the genesis block present in the catch up package. - diff --git a/docs/concepts/chain-key-cryptography.md b/docs/concepts/chain-key-cryptography.md index bcc916d..63c1850 100644 --- a/docs/concepts/chain-key-cryptography.md +++ b/docs/concepts/chain-key-cryptography.md @@ -99,7 +99,7 @@ The same threshold cryptographic infrastructure that enables signing also enable Combined with the NNS governance system, this enables **autonomous protocol upgrades**: the NNS approves an upgrade, the orchestrator on each node downloads the new replica software, and the subnet transitions at the next epoch boundary: all while preserving canister state and maintaining the same public key. -For more on how upgrades work at the protocol level, see the [Chain Evolution](https://learn.internetcomputer.org/hc/en-us/articles/34210120121748) article on the Learn Hub. +For more on how upgrades work at the protocol level, see [Chain evolution](evolution-scaling.md#chain-evolution). ## Next steps diff --git a/docs/concepts/evolution-scaling.md b/docs/concepts/evolution-scaling.md new file mode 100644 index 0000000..3aa525c --- /dev/null +++ b/docs/concepts/evolution-scaling.md @@ -0,0 +1,78 @@ +--- +title: "Evolution & Scaling" +description: "How ICP scales horizontally through subnet creation, maintains liveness under node failures, and upgrades its protocol without forks." +--- + +The Internet Computer is designed to adapt to changing demands. When more resources are needed, new subnets can be added, expanding capacity horizontally. When nodes fail, the protocol continues making progress and recovers automatically. When the protocol itself needs to improve, upgrades roll out without forks and with minimal downtime. All of this happens under governance by the NNS. + +## Fault tolerance + +In any large-scale distributed system, individual nodes will fail due to hardware outages, network issues, or attacks. ICP is fault-tolerant: the protocol continues making progress as long as fewer than one third of the nodes in a subnet are faulty (including Byzantine failures, where nodes behave arbitrarily rather than simply going offline). + +When a node fails, the subnet continues producing blocks. The failed node can recover automatically using the [state synchronization protocol](protocol/state-synchronization.md). The consensus protocol is divided into epochs, each comprising several hundred consensus rounds. At the start of each epoch, all nodes create a checkpoint and a catch-up package (CUP). A CUP contains the replicated state hash and enough context for any node to resume consensus from that point. The CUP is signed by at least two thirds of the subnet's nodes. + +When a failed or newly joined node comes back online, it: + +1. Listens for CUP messages from peers. +2. Validates the CUP (verifying the threshold signature). +3. If the CUP's state hash differs from its local state, initiates state sync to download the checkpoint. +4. After syncing the checkpoint, replays the blocks produced since that CUP. +5. Rejoins consensus normally. + +If a node consistently lags behind or fails repeatedly, an NNS proposal can be submitted to replace it with a spare node. + +### Subnet recovery + +In rare cases an entire subnet can get stuck: for example, if more than one third of its nodes fail simultaneously, or if a software bug causes non-deterministic execution. In this case, the nodes cannot collectively produce a valid CUP, so automatic recovery is not possible. + +Recovery requires community action: a recovery coordinator manually creates a CUP at the highest certified block height, then submits an NNS proposal containing it. If the community approves, the NNS stores the CUP in its registry. Each node's orchestrator process detects the new CUP and restarts the replica using it, resuming from the certified state. + +This governance-gated recovery process applies to regular subnets. The NNS subnet itself requires coordinated action by all NNS node providers to restart manually. + +## Subnet creation + +ICP scales horizontally by creating new subnets. Each subnet hosts thousands of canisters and processes messages independently. Adding a subnet adds proportional capacity to the network: more canisters, more storage, more throughput. + +Subnets on the Internet Computer communicate using cross-subnet (XNet) messaging. A canister on any subnet can send asynchronous messages to any canister on any other subnet. XNet messages are included in the receiving subnet's consensus blocks and authenticated using [chain-key cryptography](chain-key-cryptography.md). This loosely coupled architecture means newly created subnets can immediately exchange messages with all existing subnets, without a central bottleneck. + +### How a new subnet is created + +1. **Onboard nodes.** New nodes must be onboarded to the network first. A node provider installs IC-OS, and the node's orchestrator registers with the NNS. The node is then available as a spare. + +2. **Submit a proposal.** Anyone can submit an NNS proposal specifying which spare nodes should form the new subnet. The proposal includes the subnet configuration: the node list, protocol version, and other parameters. + +3. **Community vote.** Anyone who has staked ICP can vote on the proposal. If a majority approve, the NNS registry canister records the new subnet configuration and instructs the NNS subnet to generate the initial cryptographic key material for the subnet using chain-key cryptography. + +4. **Subnet genesis.** Each selected node's orchestrator sees the new subnet record in the registry, downloads the correct replica software, and starts the replica with the genesis catch-up package. The nodes form the new subnet blockchain and begin accepting messages. + +## Chain evolution + +ICP upgrades its protocol approximately once per week, driven by NNS governance. These upgrades can change anything: fix bugs, add features, update algorithms, or alter the underlying technology. They are applied without forks and with minimal downtime, and the full state of all canisters is preserved across upgrades. + +### How protocol upgrades work + +The NNS registry stores the complete configuration of the Internet Computer, including the replica version each subnet should run. A version change in the registry triggers the upgrade process. + +Upgrades roll out on a per-subnet basis. Within a subnet, all nodes must switch to the new protocol version simultaneously to avoid a fork. This coordination is achieved using epochs: + +- The consensus protocol divides time into epochs, each several hundred rounds long. +- At each epoch boundary, nodes produce a summary block containing the configuration (including replica version and cryptographic key material) to use for the next epoch. +- If the registry indicates a new replica version for the upcoming epoch, all nodes download it in advance. +- At the epoch boundary, the nodes stop processing update calls and produce empty blocks until the summary block is finalized, executed, and the state is certified. Query calls continue normally during this pause. +- All nodes produce a CUP containing the state needed to resume at the new version, signed by more than two thirds of the subnet. +- Each node's orchestrator receives the CUP and starts the new replica software with it as input. +- The new replica resumes consensus immediately from the handed-off state. + +Blocks and consensus artifacts are tagged with the protocol version that produced them. A replica only processes artifacts from its own version, except CUPs (which must be readable by both the pre-upgrade and post-upgrade replica). + +### Upgrade governance + +To trigger a protocol upgrade, anyone submits an NNS proposal to update the registry with a new replica version. ICP token holders who have staked their tokens can vote. If a majority approves, the registry is updated and the upgrade rolls out automatically. No hard fork or manual intervention is needed. + +## Further reading + +- [Protocol Stack](protocol/index.md) — the four-layer architecture that runs inside each subnet +- [State synchronization](protocol/state-synchronization.md) — catch-up packages and how nodes rejoin +- [Chain-key cryptography](chain-key-cryptography.md) — the key management underlying subnet creation and XNet messaging + + diff --git a/docs/references/glossary.md b/docs/references/glossary.md index e456386..146f754 100644 --- a/docs/references/glossary.md +++ b/docs/references/glossary.md @@ -166,7 +166,7 @@ artifacts from the Internet Computer. #### consensus -In distributed computing, **consensus** is a [fault-tolerant](https://learn.internetcomputer.org/hc/en-us/articles/34210647901460-Fault-Tolerance) mechanism by +In distributed computing, **consensus** is a [fault-tolerant](../concepts/evolution-scaling.md#fault-tolerance) mechanism by means of which a number of [nodes](#node) can reach agreement about a value or state. From 1cdbdc4c508abf3cda4a0c9091d0f430932c8b5a Mon Sep 17 00:00:00 2001 From: Marco Walz Date: Wed, 6 May 2026 18:40:05 +0200 Subject: [PATCH 2/2] fix: remove broken protocol/* links and replace em-dashes in evolution-scaling --- docs/concepts/evolution-scaling.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/concepts/evolution-scaling.md b/docs/concepts/evolution-scaling.md index 3aa525c..dfbd6a5 100644 --- a/docs/concepts/evolution-scaling.md +++ b/docs/concepts/evolution-scaling.md @@ -9,7 +9,7 @@ The Internet Computer is designed to adapt to changing demands. When more resour In any large-scale distributed system, individual nodes will fail due to hardware outages, network issues, or attacks. ICP is fault-tolerant: the protocol continues making progress as long as fewer than one third of the nodes in a subnet are faulty (including Byzantine failures, where nodes behave arbitrarily rather than simply going offline). -When a node fails, the subnet continues producing blocks. The failed node can recover automatically using the [state synchronization protocol](protocol/state-synchronization.md). The consensus protocol is divided into epochs, each comprising several hundred consensus rounds. At the start of each epoch, all nodes create a checkpoint and a catch-up package (CUP). A CUP contains the replicated state hash and enough context for any node to resume consensus from that point. The CUP is signed by at least two thirds of the subnet's nodes. +When a node fails, the subnet continues producing blocks. The failed node can recover automatically using the state synchronization protocol. The consensus protocol is divided into epochs, each comprising several hundred consensus rounds. At the start of each epoch, all nodes create a checkpoint and a catch-up package (CUP). A CUP contains the replicated state hash and enough context for any node to resume consensus from that point. The CUP is signed by at least two thirds of the subnet's nodes. When a failed or newly joined node comes back online, it: @@ -71,8 +71,6 @@ To trigger a protocol upgrade, anyone submits an NNS proposal to update the regi ## Further reading -- [Protocol Stack](protocol/index.md) — the four-layer architecture that runs inside each subnet -- [State synchronization](protocol/state-synchronization.md) — catch-up packages and how nodes rejoin -- [Chain-key cryptography](chain-key-cryptography.md) — the key management underlying subnet creation and XNet messaging +- [Chain-key cryptography](chain-key-cryptography.md): the key management underlying subnet creation and XNet messaging