Skip to content

OmniLedger: a secure, scale-out decentralized ledger via sharding

February 9, 2018

OmniLedger: A secure, scale-out, decentralized ledger via sharding Kokoris-Kogias et al., IEEE S&P 2018

OmniLedger makes a nice complement to Chainspace that we looked at yesterday. The two systems were developed independently at the same time. OmniLedger combines Visa levels of scalability (caution: the authors compare against the average Visa tps, the peak tps in the Visa network is considerably higher) with a secure decentralised ledger. It’s also a demonstration of how quickly the field is progressing, and something of a wake-up call if you’ve been working in the field of distributed systems and transaction processing but so far ignoring developments in decentralised ledgers. Standard building blocks are emerging and being combined in novel ways, and there’s a lot to learn!

This paper introduces OmniLedger, the first distributed ledger architecture that provides “scale-out” transaction processing capacity competitive with centralized payment processing systems such as Visa, without compromising security or support for permissionless decentralisation.

The goals of OmniLedger are as follows:

  1. Full decentralisation, with no trusted third parties and no single points of failure
  2. Shard robustness (OmniLedger partitions state into multiple shards processing in parallel) — each shard must correctly and continuously process transactions assigned to it.
  3. Secure atomic transactions, but within and across shards
  4. Scale-out performance with throughput that increases linearly in the number of participating validators
  5. Low storage overhead so that validators do not need to store the full transaction history
  6. Low latency for transaction confirmations.

    A lot of moving parts need to come together to support all of this.

First we need a way of enabling new validators to join the system, and a secure way of sharding validators so that adversaries cannot easily overpower a shard (per the discussion we had yesterday around Chainspace). Underpinning this part of the system is an identity blockchain, and secure distributed randomness generation with RandHound (a scalable secure multi-party computation protocol providing unbiased decentralised randomness in a Byzantine setting) and cryptographic sortition (that we first saw with Algorand).

Next we need a secure and reliable way of processing transactions within a shard, for which OmniLedger introduces the Omnicon protocol which combines a tree communication pattern based on groups with a PBFT-like view-change procedure. Transactions are recorded on the transaction ledger. UTXOs (the unspent transaction output model) give us a (explicit) causal+ consistency model which means we can identify non-conflicting transactions and process blocks of these in parallel. The causality graph is maintained across blocks, not individual transactions, reducing some of the metadata overheads.

It’s not enough to reach consensus within a shard though, we also need atomic transactions across shards. For this OmniLedger introduces the Atomix protocol, based an a lock-then-unlock protocol.

The pieces all come together like this:


(Enlarge)

Sharding securely

We can’t allow the validators themselves to choose a shard to join, as this would permit an adversary to concentrate all his validators in one shard. If the assignment of validators to shards is random though, then with high probability all shards will have about the same fraction of malicious nodes. Suppose we have a suitable source of randomness, then we can proceed as follows:

  • Validators who wish to participate in the ledger starting from epoch e have to first register to a global identity blockchain.
  • Identities are created through a Sybil-attack resistant mechanism in epoch e-1 and broadcast, together with the respective proofs, on the gossip network at most \Delta before epoch e-1 ends.
  • Randomness is used to elect a leader for epoch e and assign nodes to shards (more on this shortly)
  • The leader requests a (BFT) collective signature on a block with all identities provably established thus far. If at least 2/3 of these validators endorse the block it becomes valid and the leader appends it to the identity blockchain.

The security of OmniLedger’s validator assignment mechanism is modeled as a random sampling problem with two possible outcomes (honest or malicious). Assuming an infinite pool of potential validators, we can use the binomial distribution…

A failed assignment is one which allows a shard to be controlled by an adversary. For given adversarial power (percentage of controlled nodes), the following chart shows the required shard sizes to peg the failure probability at ~$10^{-6}$.

But where does the source of randomness come from?

We require that the distributed randomness generation protocol provides unbiasability, unpredictability, third-party verifiability, and scalability. Multiple proposals exist… we focus on RandHound due to better documentation and open-source implementation.

RandHound itself relies on a leader to orchestrate the protocol run. So now we need away to select one of the validators for the role. Cryptographic sortition is used for this. Cryptographic sortition is based on verifiable random functions, VRFs.

At the beginning of an epoch e, each validator i computes a ticket \mathrm{ticket}_{i,e,v} = \mathrm{VRF}_{sk_i}(\mathrm{leader}\ || \mathrm{config}_e\ || v) where \mathrm{config}_e is the configuration containing all properly registered validators of epoch e (as stored in the identity blockchain) and v is a view counter.

Validators gossip the tickets for a time \Delta, after which they lock in the lowest value valid ticket they have seen thus far and accept the corresponding node as the leader of the RandHound protocol run.

To maintain operability during transition phases, OmniLedger swaps in the new validators gradually in each shard per epoch. See section IV.B for details.

Cross-shard transactions

In the UTXO model (which OmniLedger adopts), the outputs of a transaction create new UTXOs, and inputs completely “spend” existing UTXOs. With UTXOs randomly assigned to shards for processing, we can expect cross-shard transactions to be common.

OmniLedger uses a Byzantine Shard Atomic Commit protocol called Atomix to atomically process transactions across shards. It builds on the fact that shards are collectively honest, do not crash infinitely, and run ByzCoin internally (providing BFT consensus). The protocol is client-driven and proceeds in three phases:

  1. In the initialize phase a client creates a cross-shard transaction spending UTXOs of some input shards, and creating new UTXOs in some output shards. The transaction is gossiped on the network and eventually reaches all input shards.
  2. In the lock phase all input shards associated with a transaction first validate the transaction to be sure the inputs can be spent. Then if the transaction is valid the transaction is logged in the shard’s ledger and a proof-of-acceptance is gossiped. (Think “PREPARE” in classic 2PC). If the transaction is not accepted then a proof-of-rejection is gossiped instead. The transaction inputs are now locked, but the transaction is not yet committed. The client can inspect the input shard ledgers to verify the proofs and that the transaction was indeed locked. The client holds enough proofs to either commit the transaction or abort it and reclaim any locked funds, but not both.
  3. In the unlock phase the client either unlocks-to-commit or unlocks-to-abort by creating an gossiping the appropriate unlock transaction. Each involved output shard validates the transaction and includes it in the next block of its ledger in order to update the state and enable the expenditure of the new funds.

If you’ve had any operational experience with distributed transaction systems before, you may be wondering what happens if the client crashes or otherwise fails to proceed to phase 3 leaving the transactions in-doubt in the ledger. In this case funds are not automatically reclaimed. The funds themselves provide the incentive for clients to complete transactions.

We argue that a client who crashes indefinitely is equivalent to a client who lost his private key, which prevents him from spending the corresponding UTXOs. Furthermore, any entity in the system, for example a validator in exchange for a fee, can fill in for the client to create an unlock transaction, as all necessary information is gossiped.

Scalable BFT-consensus with Omnicon

OmniLedger builds on the ByzCoin Byzantine consensus scheme, which uses collective signing (CoSi) to make PBFT more scalable. ByzCoin distributes blocks using multicast trees for performance, and falls back to a less-scalable star topology for fault tolerance.

Omnicon trades-off some of ByzCoin’s high scalability to achieve better fault tolerance without resorting to a PBFT like all-to-all communication pattern:

During the setup of OmniLedger in an epoch, the generated randomness is not only used to assign validators to shards, but also to assign them evenly to groups within a shard…. At the beginning of an Omnicon roundtrip, the protocol leader randomly selects one of the validators in each group to the group leader, who is responsible for managing communication between the protocol leader and the respective group members.

Transactions that don’t conflict can be committed in different blocks and safely processed in parallel. A block-based DAG in which each block can have multiple parents captures the concurrent processing of blocks. The metadata overhead of tracking causality is reduced by noting that UTXO dependencies are transitive, so we only need to track causality between blocks, not individual transactions within blocks.

Low latency transactions

For clients with frequent latency-sensitive low-value transactions, OmniLedger supports an optional “trust but verify” model. Optimistic validators process transaction quickly, and core validators subsequently verify the transactions again to provide finality and ensure verifiability.

As a result, some bad transactions might be committed but ultimately core validators verify all provisional commitments, detecting any inconsistencies and their culprits, which makes it possible to punish rogue validators and to compensate the defrauded customer for the damages.

Ledger pruning

Bitcoin’s blockchain grows by about 144MB per day, but next-generation systems with Visa-level throughput (e.g., 4000 tps, and 500 B/tx) can produce over 150GB per day. This is a problem if a new validator needs to download and process the entire ledger in order to bootstrap. So OmniLedger introduces stable checkpoint state blocks. At the end of an epoch, the shard’s leader stores the UTXOs in an ordered Merkle tree, and puts the Merkle tree’s root hash in the header of a state block. Validators run consensus on this block, and if approved it becomes the genesis block for the next epoch.

OmniLedger and Chainspace

Our approach is synergistic to Chainspace as we focus on an open scalable UTXO style DL (distributed ledger), whereas Chainspace focuses on sharded smart-contracts and small-scale shards that can be deployed only under weak adversaries (e.g., in a permissioned setting). As a result, combining OmniLedger and Chainspace has great potential to create an open, scalable, smart-contract platform that provides scalability and security under strong adversaries.

Chainspace: a sharded smart contracts platform

February 8, 2018

Chainspace: a sharded smart contracts platform Al-Bassam et al., NDSS’18

Chainspace is a DApp (decentralised application) platform based on smart contracts, designed for higher scalability than is currently achievable with Bitcoin or Ethereum.

Our modest testbed of 60 cores achieves 350 transactions per second, as compared with a peak rate of less than 7 transactions per second for Bitcoin over 6K full nodes. Ethereum currently processes 4 transactions per second, out of a theoretical maximum of 25.

Chainspace manages state in objects, which are sharded across nodes. One very interesting twist is the distinction between procedures that execute a computation, and independent checkers that can verify a computation was carried out correctly. Only one party executes a transaction, the others just need to verify it. That makes for lots to think about as you work through the paper.

Objects, contracts, and transactions

State in Chainspace is stored in objects. Every object has a unique identifier (which can be used to verify its provenance, as we’ll see later). Objects are also immutable, so the only way to ‘update’ state is to create a new object (with a new id).

Contracts define a set of initial objects that are created when the contract is first created within Chainspace. A contract defines a set of types for all the inputs and outputs, a set of procedures, and a checker.

Procedures take as input both object ‘inputs’ and references. An object that is passed to a procedure as an input is consumed by that procedure and marked inactive — it can never be passed to another procedure again. References give read-only access to object state on the chain. The output of a procedure is a set of (new) objects. Within the procedure there may be local parameters and local return values (from calling other contracts). Some of these may contain secret information and require confidentiality. Procedures do not have to be pure functions, and may be randomised, keep state, or have side effects.

Associated with each smart contract c, we define a checker denoted as v. Those checkers are pure functions (i.e., deterministic and have no side-effects), and return a Boolean value. A checker v is defined by a contract, and takes as parameters a procedure p, as well as inputs, outputs, references and locals. Note that checkers do not take any secret local parameters.

A transaction is an atomic application of one or more procedures to active input objects, and possibly some referenced objects, to create some new output objects. “A user client executes all the computations necessary to determine the outputs of one or more procedures forming a transaction, and provides enough evidence to the system to check the validity of the execution and the new objects.

Once a transaction is accepted in the system it ‘consumes’ the input objects, that become inactive, and brings to life all new output objects that start their life by being active. References on the other hand must be active for the transaction to succeed, and remain active once a transaction has been successfully committed.

It’s all very reminiscent of ownership typing in Rust.

Within Chainspace, a transaction is represented by a sequence of traces of the executions of the procedures that compose it. Traces contain information about the contract, procedures, input objects, references and local parameters, as well as output objects and local returns (but not secret objects or returns). A trace can depend on other traces that came before it (causality), and is valid if its sequence of dependencies is valid and, (i) all inputs and references are valid, (ii) a trace that produces output objects must also consume some input objects, and (iii) all objects passed to the checker must be types defined by the smart contract.

The identifier of a trace is a cryptographic hash of all the information contained within it, and the identifier of an object is derived through the application of a cryptographic hash function to the identifier of the trace that created the object, as well as a unique name assigned to the output object by the procedures creating the trace.

An object identifier is (thus) a high-integrity handle that may be used to authenticate the full history that led to the existence of the object.

It all makes sense in principle, but I found it difficult to get my head around how I would practically write a checker for a procedure. Things clicked for me when I thought of it in terms of Hoare triples: we are given the procedure inputs, and can verify the pre-conditions. Then the procedure itself executes, which can be a black box to us. We’re left with the outputs, which we can verify meet the post-conditions (conditioned on the inputs).

Perhaps an example will also help. Consider a smart voting system with a contract SVote and three types SVote.Token, SVote.Vote, and SVote.Tally. There are three procedures: SVote.createElection, SVote.addVote, and SVote.tally. When add vote is called, it is passed a new vote to add, homomorphically encrypted and signed by the voter. The voter also provides a zero-knowledge proof certifying that her vote is a binary value and that she voted for exactly one option. The checker can assert the correctness of the votes by verifying the associated signatures and zero-knowledge proofs, without over having to learn the clear value of the votes.

Defining smart contract logic as checkers allows Chainspace to support privacy friendly-contracts by design. In such contracts some information in objects is not in the clear, but instead either encrypted using a public key, or committed using a secure commitment scheme. The transaction only contains a valid proof that the logic or invariants of the smart contract procedure were applied correctly or hold respectively, and can take the from of a zero-knowledge proof, or a Succinct Argument of Knowledge (SNARK)… the checker runs the verifier part of the proof or SNARK that validates the invariants of the transactions, without revealing the secrets within the objects to the verifiers.

System design

It’s time to take a brief look under the hood!

A Chainspace network comprises a set of nodes that manage valid objects and ensure only valid transactions get committed. Nodes are organised into shards. Shards partition the set of objects.

Nodes within a shard reach consensus on whether to accept or reject a transaction, whether an object is active or inactive, and whether traces from contracts they know check. Transactions can span shards, so there is voting at the shard level to reach consensus on commit. This requires unanimous agreement across all shards. Nodes in each shard periodically publish a signed hash chain of checkpoints, each block recording evidence of transactions processed in the current epoch.

An honest shard is one with at least 3f+1 nodes, where adversaries control no more than f faulty nodes. Safety and liveness is guaranteed in honest shards. With dishonest shards (adversary controls at least \frac{f}{3f + 1} of the nodes) correctness or liveness cannot be guaranteed, but foul play is guaranteed to be detectable.

There are many options for ensuring that concerned nodes in each shard do not reach an inconsistent state for the accepted transactions, such as Nakamoto consensus through proof-of-work, two-phase commit protocols and classical consensus protocols like Paxos. However, these approaches lack in performance, scalability, and/or security. We design an open, scalable and decentralized mechanism to perform Sharded Byzantine Atomic Commit or S-BAC.

S-BAC is a combination of Byzantine agreement (using the ModSmart implementation of PBFT) and atomic commit using a two-phase commit protocol inspired by Gray and Lamport. You can find more details in §IV.C. Here’s a quick visual summary:


(Enlarge)

The higher levels of Chainspace functionality are implemented using system smart contracts:

Effectively, instantiation of Chainspace is the combination of nodes running the basic S-BAC protocol, as well as a set of system smart contracts providing flexible policies about managing shards, smart contract creation, auditing and accounting.

Since Chainspace is intended to be an open system, there is also a CSCoin smart contract for tracking value. It can be composed with other procedures to enable payments for processing transactions. Shards can advertise that they will only consider actions valid if some value of CSCoin is transferred to their constituent nodes.

Integrity

It seems much easier for an adversary to overpower one or more shards in a Chainspace system than it does in systems where all nodes are contributing globally. The set of chainspace nodes are divided into shards, and within a shard an adversary just needs to supply/control \frac{f}{3f+1} nodes. I couldn’t see anything in the paper about the rules for new nodes joining shards. If that is also open, then I could flood a shard with my own nodes? If it is not open, then is there a central point of trust for node membership?

In the limitations and future work section, the authors have this to say:

In case one or more shards are malicious, we provide an auditing mechanism for honest nodes in honest shards to detect the inconsistency and to trace the malicious shard. Through the Hash-DAG structure, it is also possible to fully audit the histories of two objects, and to ensure that the validity rules hold jointly — in particular the double-use rules. However, it is not clear how to automatically recover from detecting such an inconsistency…

Blockstack: a global naming and storage system secured by blockchain

February 7, 2018

Blockstack: a global naming and storage system secured by blockchains Ali et al., USENIX ATC’16

We’ll be back in the world of cryptocurrencies and blockchains for the rest of this week, kicking off with this 2016 paper on Blockstack. It’s interesting both for the lessons learned trying to build a system on top of the Namecoin blockchain, as well the subsequent design of Blockstack that was informed by those experiences. Blockstack itself has evolved since this 2016 paper (see the latest white paper) but the core ideas are all here.

A blockchain-based identity service

One good match for blockchain technology is as an immutable ledger securely binding names to values. Things you might ideally want in a naming system include human-readable names, a strong sense of ownership (name-value pairs can be owned by cryptographic keypairs), and no central trusted party. Zooko’s triangle suggested that you can’t have all three properties at once, but Namecoin showed that the combination is possible using a blockchain-based approach.

Namecoin itself was started to create an alternate DNS-like system that replaces DNS root servers with a blockchain for mapping domain names to DNS records.

Given that blockchains don’t have central points of trust, a blockchain-based DNS is much harder to censor and registered names cannot be seized from owners without getting access to their respective private keys.

Registration fees are required to prevent people from domain-squatting, but in Namecoin the recipient of the registration fees is a “black hole” cryptographic address from which money cannot be retrieved. Registration itself is a two-phase process in which a user first pre-orders a name hash, and then later registers the name-value pair by revealing the actual name and the associated value.

The authors of Blockstack wanted to create a Blockstack ID service along similar lines, which maps a name to public keys along with other profile data.

How (not to) build a decentralised PKI system

Namecoin has a d/ namespace used for domain names, and the first version of Blockstack ID was built on top of Namecoin using a new u/ namespace. After a year in production, a number of worries had started to surface around security, network reliability and throughput, selfish mining, the impact of hard forks, and the failure of merged mining. The authors distill this into five key lessons, with the following bottom line:

We strongly believe that decentralized applications and services need to be on the largest, most secure, and most actively maintained blockchain. Currently [2016] no other blockchain even comes close to Bitcoin in terms of these security requirements.

Lesson one: there is a fundamental tradeoff between blockchain security and introducing new functionality to blockchains.

Starting a new blockchain is always tempting, but new chains are significantly less secure than Bitcoin because of the much reduced mining activity.

The most important factor in the security of a blockchain is the total cost of attacking the blockchain and tampering with recently written data.

If a single miner (or pool) gets to 51% of the computational power mining on the chain, then all decentralised bets are off. In 2014, the authors noticed that a single mining pool consistently had more than 51% of the compute power on Namecoin, and it subsequently got worse with concentrations of power of up to 75% in a given week. “At such concentration, Namecoin is effectively controlled by a single party.

Lesson two: there is currently a significant difference between the network reliability of the largest public blockchain network (Bitcoin) and network reliability of the long tail of alternate blockchains.

The authors noticed that some miners did indeed seem to be exploiting their concentrated power, either intentionally refusing or unable to package Blockstack transactions in their blocks. There were also cases of latency spikes caused by software issues in Namecoin itself.

Lesson three: selfish-mining is not just a theoretical attack, but selfish-mining like behaviour can already be observed in production blockchains.

Whenever software updates were issued on Namecoin, there was a considerable fluctuation of computing power: miners aren’t incentivised to upgrade software quickly for small chains which aren’t their main reason for operating a pool. This leads to…

Lesson four: Other than the engineering problems, consensus-breaking changes are complicated because of the fundamental incentive structures of the parties involved. System designers have never dealt with consensus-breaking changes before cryptocurrencies; it’s a novel challenge.

Finally, merged mining was supposed to address the problem of lesser mining power on smaller chains, by allowing Bitcoin miners to participate in the new network without spending extra compute cycles. Namecoin used merged mining with Bitcoin. “One of our key findings is that merged mining is currently failing in practice: the leading merged-mined blockchain, Namecoin, is vulnerable to the 51% attack.” This leads to lesson five:

Lesson five: At the current stage in the evolution of blockchains, there are not enough compute cycles dedicated to mining to support multiple secure blockchains.

After considering all of the above factors, it was an easy decision to move our PKI system from Namecoin to Bitcoin.

Blockstack v2: built on top of Bitcoin

The most fundamental idea in Blockstack will be familiar to anyone with a background in networking: separate the control plane from the data plane. In particular, Blockstack keeps the control plane on the blockchain, and the data plane off the chain but anchored in an on-chain root of trust. This alleviates problems with data storage limits in blocks (the data isn’t on the chain, go ahead and store as much as you like!). You could actually use any underlying blockchain of your choosing, the deployment chooses Bitcoin.

Blockstack maintains a naming system as a separate logical layer on top of the underlying blockchain on which it operates. Blockstack uses the underlying blockchain to achieve consensus on the state of this naming system and bind names to data records. Specifically, it uses the underlying blockchain as a communication channel for announcing state changes, as any changes to the state of name-value pairs can only be announced in new blockchain blocks.


(Enlarge)

The control plane comprises the underlying blockchain itself, and a logically separate “virtualchain” built on top of it. The data plane handles data storage and availability with a routing layer that tells you where to find things, and a storage layer that looks after the data itself.

The Virtualchain layer

A key contribution of Blockstack is the introduction of a logically separate layer on top of a blockchain that can construct an arbitrary state machine after processing information from the underlying blockchain. We call this layer a virtualchain. A virtualchain treats transactions from the underlying blockchain as inputs to the state machine and valid inputs trigger state changes. At any given time, where time is defined by the block number, the state machine can be in exactly one global state.

You could also a state machine like this using a smart contract. The paper doesn’t explicitly discuss why that option was rejected, but I can see two arguments: firstly, that would restrict the choice of underlying chain, and rule out the Bitcoin blockchain; and secondly, depending on design parameters, the state within the state machine may get too large to easily store on chain. The 2017 Blockstack white paper does touch on this issue, additionally saying, “Nodes on the network should not be required to compute complex untrusted programs just to stay synced with the network.”

Only Blockstack nodes are aware of the virtualchain layer, and blockstack operations are encoded in valid blockchain transactions as additional metadata. The virtualchain idea is independent of the actual state machine you use it with, Blockstack defines a global naming and storage system state machine. Here are the states and transitions for the naming system state machine:

What I can’t easily tell from the paper, is how I know I can trust the state machine logic. It’s not on the underlying blockchain (as it would be with a smart contract), so presumably it’s just embodied in code running on Blockstack nodes. If it essentially becomes impossible to upgrade those nodes without everyone noticing, and the code is open source, I guess that arguably gives similar properties to a smart contract. The second part of the answer here is you don’t necessarily need to inspect and validate the logic of the state machine, so long as you can validate the resulting state (i.e., the state is all that matters). Drawing again from the 2017 white paper on this topic:

Application nodes replay their logs to achieve application level consensus at each block b, such that two nodes will agree on a block if and only if the application transactions in that block leave the nodes in an identical state. If their resulting state after executing the operations in block b are identical, then their generated consensus hash for that block will be the same. Consensus hashes enable nodes to independently audit and efficiently query their histories, as well as detect forks and then migrate state between blockchains.

The routing layer

The task of routing requests is separated from the task of actually storing data. Blockstack uses zone files (c.f. DNS zone files) for storing routing information. The virtualchain binds names to their respective hash(zonefile) and stores these bindings in the control plane. The zone files themselves are stored in the routing layer.

Users do not need to trust the routing layer because the integrity of zone files can be verified by checking the hash(zonefile) in the control plane.

As of the time of writing the paper, nodes formed a DHT-based peer network for storing zone files. Only files whose hash has previously been announced on the chain will be stored. Blockstack today has migrated away from the DHT approach, and uses a new unstructured peer network called the Atlas network. All Atlas nodes maintain a 100% state replica and use gossip to keep up to date.

The storage layer

The storage layer hosts the actual data values for the name-value pairs. Stored values are signed by the key of the owner of the corresponding name.

By storing data values outside of the blockchain, Blockstack allows values of arbitrary size and allows for a variety of storage backends. Users do not need to trust the storage layer because they can verify the integrity of the data values in the control plane.

In mutable storage node, signed updates to the value can proceed as fast as the underlying storage system allows. There is a danger here though of readers consuming stale data. This problem is punted to the user: “readers and writers must employ a data versioning scheme to avoid reading stale data.” In immutable storage mode, a TXT record is also written in the zone file containing hash(data). Readers can verify data integrity by checking this hash. Throughput and latency for immutable writes is of course constrained by the underlying blockchain.

Blockstack today has implemented an overlay storage system called Gaia that sits on top of existing cloud storage providers.

Name verification

We briefly discussed the consensus hashes of state machine state in the virtualchain section. Consensus hashes also help new nodes come up to speed faster without needing to process the entire chain from block zero.

The process of verifying the authenticity of a prior name operation with a later trusted consensus hash is called Simplified Name Verification (SNV). SNV enables support for ‘thin clients,’ which can query the past state of the system without running Blockstack nodes or having access to the full blockchain history.

The construction of the consensus hash allows a user to verify the authenticity of any virtualchain operation from a block with height h_{prior}, less that the current height h, using only a logarithmic number of queries.

For the latest on blockstack, see https://blockstack.org.

Quantum algorithms: an overview

February 6, 2018

Quantum algorithms: an overview Montanaro,  npj Quantum Information 2016

This is a paper that Preskill cited in his keynote address (see yesterday’s post). It covers some of the same ground that we looked at yesterday, but also has some additional material and perspective of interest — and I’ll focus on those parts today. We’ll touch on Grover’s algorithm, amplitude amplification, and quantum walks among others. The first fun thing to know is that the Quantum Algorithm Zoo maintains a catalog of all known quantum algorithms – and there are a lot of them! For each algorithm you get the name, speedup, and a short description. A great resource. In ‘Quantum algorithms: an overview’, Montanaro offers a broad sweep of potential applications rather than details on how particular applications work. He also introduces us to a couple of complexity classes we haven’t seen before in The Morning Paper, BPP and QMA:

Shor’s algorithm, which we looked at last week, turns out to solve a special case of a mathematical problem known as the hidden subgroup problem (HSP).

For Shor’s algorithm, G = \mathbb{Z}. But if we use other groups G, the same quantum HSP algorithm can also attack other cryptosystems:

Grover’s algorithm

Beyond Shor’s algorithm, the best known quantum algorithm must be Grover’s algorithm. It’s a pretty amazing result for the unstructured search problem:

A classical algorithm must evaluate f 2^n times in the worst case, but Grover’s quantum algorithm solves the problem using O(\sqrt{N}) evaluations of f in the worst case. It is a bounded-error algorithm: it fails with probability \epsilon for some arbitrarily small (but fixed) \epsilon greater than 0. Grover’s algorithm does not rely on or exploit any internal structure in the function f.

Grover’s algorithm can be applied to any problem in the complexity class NP, since we can efficiently check if we have a solution in polynomial time (i.e., check whether f(x) = 1). In the following, we present a certificate which is a candidate proof of a solution, and the certificate can be verified in polynomial time:

Given a problem in NP that has a certificate length m, by applying Grover’s algorithm to A (the classical checking algorithm) and searching over all possible certificates, we obtain an algorithm which uses time O(2^{m/2}\ \mathrm{poly}(m)) rather than the O(2^m\ \mathrm{poly}(m)) used by classical exhaustive search over all certificates. This (nearly) quadratic speedup is less dramatic than the super-polynomial speedup achieved by Shor’s algorithm, but can still be rather substantial.

Amplitude amplification

The heuristic search problem is related to unstructured search, but starts with a probabilistic guessing algorithm to produce a guess which can then be tested:

The classical solution can find an answer in an average of O(1/\epsilon) evaluations of f, simply by repeatedly running A and checking the output each time. A quantum algorithm known as amplitude amplification, due to Brassard et al., can find w with only O(1/\sqrt{\epsilon}) uses of f, and a failure probability arbitrarily close to 0 — a quadratic speedup.

3-SAT is an NP-complete problem that runs in time O((4/3)^n \ \mathrm{poly}(n)). Applying amplitude amplification to the problem gives a quantum algorithm with runtime O((4/3)^{n/2} \ \mathrm{poly}(n)), “illustrating that quantum computers can speed-up non-trivial classical algorithms for NP-complete problems.”

Combinations…

Grover’s algorithm and amplitude amplification are powerful subroutines which can be used as part of more complicated quantum algorithms, allowing quantum speedups to be obtained for many other problems.

For example: finding the minimum of an unsorted list of N integers (quantum computers can do this in O(\sqrt{N})); determining graph connectivity, and pattern matching (find a pattern P of length M within a text T of length N).

If we can find the minimum of an unsorted list of N integers in O(\sqrt{N}), we can also find the minimum of an arbitrary and initially unknown function f : {0,1}^n \rightarrow \mathbb{Z} with O(\sqrt{N}) evaluations of f.

Adiabatic optimisation

The adiabatic optimisation algorithm can be applied to any constraint satisfaction problem (CSP) where we are given a sequence of constraints applied to some input bits, and are asked to output an assignment to the input bits which maximises the number of satisfied constraints…. Unlike the algorithms described in the rest of this survey, the adiabatic algorithm lacks general, rigorous worst-case upper bounds on its runtime.

The neat thing about the adiabatic algorithm is that we can directly implement it on a physical system whose Hamiltonian can be varied smoothly between the desired initial and final Hamiltonians. This is the approach taken by the D-Wave Systems machines.

Quantum walks

In classical computer science the concept of the random walk or Markov chain is a powerful algorithmic tool, and is often applied to search or sampling problems. Quantum walks provide a similarly powerful and general framework for designing fast quantum algorithms… a quantum walk is based on the simulated coherent quantum evolution of a particle moving on a graph.

Quantum walks can outperform classical random walks in two ways:

  1. faster hitting: the time it takes to find a target vertex from a source vertex (for some graphs, this time can be exponentially less), and
  2. faster mixing: the time taken to spread out over all vertices after starting from one source vertex. The mixing speedup can be quadratic, but no more than this.

One application of quantum walks is in fast evaluation of boolean formulae – a formula on N binary inputs can be evaluated in ‘slightly more than’ O(N^{1/2}) operations (vs O(N^{0.753...}) in the worst case for classical algorithms).

Quantum walks can also be used to obtain a very general speedup over classical algorithms based on Markov chains.

Why aren’t there more exponential speedups?

There are a number of instances of quadratic speedups, but comparatively fewer exponential speedups. Why is this?

One reason is that strong lower bounds have been proven on the power of quantum computation in the query complexity model, where one only considers the number of queries to the input as the measure of complexity… in order to achieve an exponential speedup over classical computation in the query complexity model there has to be a promise on the input, i.e., some possible inputs must be disallowed.

(This is the hidden structure that Shor’s algorithm successfully exploits for example).

Another observation is that many quantum algorithms seem to be composed of just a few common building blocks (e.g., quantum walks and the quantum Fourier transform). It turns out (van Dam) that any quantum circuit can be approximated using only Toffoli and Hadamard quantum gates. The first of these is a purely classical gate, the second is equivalent to the Fourier transform over the group \mathbb{Z}^2.

Thus any quantum algorithm whatsoever can be expressed as the use of quantum Fourier transforms interspersed with classical processing. (However, the intuition behind the quantum algorithms described above is much more varied than this observation would suggest).

Quantum computing in the NISQ era and beyond

February 5, 2018
tags:

Quantum computing in the NISQ era and beyond Preskill, Q2B 2017

Today’s paper is based on the keynote address given by John Preskill at the December 2017 ‘Quantum computing for business’ conference. It provides a great overview of the state of quantum computing today, and what we might reasonably expect to see over the coming years.

The NISQ era and the quantum chasm

… we are now entering a pivotal new era in quantum technology. For this talk, I needed a name to describe this impending new era, so I made up a word: NISQ. This stands for Noisy Intermediate Scale Quantum.

“Intermediate scale” refers to computers with between 50 and a few hundred qubits. The 50 qubit milestone is significant because that takes us beyond what we can simulate by brute force using the most powerful existing supercomputers. “Noisy” emphasises that we’ll have imperfect control over those qubits. Because of the noise, we expect a limit of about 1000 gates in a circuit – i.e., 1000 fundamental two-qubit operations. Executing a single gate is about 1000 times slower on an ion trap quantum processor than on a superconducting circuit.

Eventually we expect to be able to protect quantum systems and scale up quantum computers using the principle of quantum error correction… Unfortunately, there is a significant overhead cost for doing quantum error correction, so reliable quantum computers using quantum error correction are not likely to be available very soon.

For example, using quantum error correction we would need physical systems with millions of qubits in order to run algorithms involving thousands of protected (fault-tolerant) qubits. For the next few years, our limit is on the order of a hundred physical qubits.

Crossing the “quantum chasm,” from hundreds of physical qubits to millions of physical qubits, is going to take some time, but we’ll get there eventually… It is important to realize that we will need significant advances — in basic science as well as in systems engineering — to attain fully scalable fault-tolerant quantum computers.

What about those D-Wave machines available today then, which already have 2000 qubits? It’s complicated, but these are not circuit-based quantum computers, rather they are quantum annealers. We’ll take more about those later.

Quantum potential

If quantum error correction is our basis for thinking that quantum computers will be scalable to large devices solving hard problems, quantum complexity is our basis for thinking that quantum computing is powerful. We have at least three good reasons for thinking that quantum computers have capabilities surpassing classical computers:

  1. We know of problems believed to be hard for classical computers, but for which quantum algorithms have been discovered which can solve these problems easily. The best known example is Shor’s algorithm for prime factorisation.
  2. Theoretical computer scientists have proved that quantum states which are easy to prepare with a quantum computer have super-classical properties; “specifically, if we measure all the qubits in such a state we are sampling from a correlated probability distribution that can’t be sampled from by any efficient classical means.”
  3. No known classical algorithm can simulate a quantum computer, even after many decades of trying.

It’s a remarkable claim — one of the most amazing ideas I’ve encountered in my scientific life — that there is a distinction between problems that are classically hard and problems that are quantumly hard. And it is a compelling challenge to understand better what problems are classically hard but quantumly easy.

We don’t expect quantum computers to be able to solve the hard instances of NP-hard problems. But when will quantum computers be able to solve problems we care about faster than classical computers, and for what problems?

Quantum speedups

Quantum speedup refers to a quantum computer solving a problem faster than competing classical computers using the best available hardware and running the best algorithm which performs the same task.

A few years ago I spoke enthusiastically about quantum supremacy as an impeding milestone for human civilization. I suggested this term as a way to characterize computational tasks performable by quantum devices, where one could argue persuasively that no existing (or easily foreseeable) classical device could perform the same task, disregarding whether the task is useful in any other respect… But from a commercial perspective, obviously we should pay attention to whether the task is useful!

Preskill then goes on to outline several areas where quantum computers hold promise for outperforming their classical cousins, including optimisation, deep learning, matrix inversion, recommendation systems, semidefinite programming, and quantum simulation. Let’s take a brief look at each of them.

Optimisers

We don’t expect quantum computers to be able to efficiently solve worst case instances of NP-hard problems (such as combinatorial optimisation), but they might be better than classical computers at finding approximate solutions. That is, they might find better approximations, and/or they might find approximations faster.

For many problems there is a big gap between the approximation achieved by the best classical algorithm we currently know and the barrier of NP-hardness. So it would not be shocking to discover that quantum computers have an advantage over classical ones for the task of finding approximate solutions, an advantage some users might find quite valuable.

The emerging approach is a hybrid quantum-classical algorithm where a quantum processor prepares an n-qubit state, and a classical optimiser processes the measurement outcomes, instructing the quantum processor how to alter the n-qubit state for the next iteration. Iteration continues until the algorithm converges on a quantum state from which the approximate solution can be extracted.

If applied to classical approximation problems, this goes by the name Quantum Approximate Optimization Algorithm (QAOA), when applied to quantum problems it is called a Variation Quantum Eigensolver (VQE).

Quantum annealers (such as the DWave 2000Q machine) are noisy versions of something called adiabatic quantum computing, and we don’t have a convincing theoretical argument indicating that they can outperform the best classical hardware. (We do in the noiseless version). So far quantum annealers have mostly been applied to cases where the annealing is stochastic, which means it is relatively easy for a classical computer to simulate what the quantum annealer is doing.

What’s coming soon are non-stochastic quantum annealers, which may have greater potential for achieving speedups over what the best classical algorithms can do.

Deep learning

In quantum deep learning (or just quantum machine learning) we can construct quantum analogs of e.g. a restricted Boltzmann machine, but with the spins represented by qubits rather than classical bits.

It may be that quantum deep learning networks have advantages over classical ones; for example they might be easier to train for some purposes. But we don’t really know — it’s another opportunity to try it and see how well it works. One possible reason for being hopeful about the potential of quantum machine learning rests on the concept known as QRAM — quantum random access memory.

QRAM can represent a large amount of classical data very succinctly, encoding a vector with_N_ components in just log N qubits. Even with QRAM though, we have to take into account the costs of encoding the input into QRAM in the first place. Moreover, when reading the results we can recover only log N classical bits (not N) from a single shot measurement of log N qubits.

Thus quantum deep learning has most advantage in scenarios where both the input and output are quantum states. “… quantum deep learning networks might be very well suited for quantum tasks, but for applications of deep learning that are widely pursued today it is unclear why quantum networks would have an advantage.”

Matrix inversion

QRAM also helps with matrix inversion, where the HHL algorithm gives an exponential quantum speedup, running in time O(log N). Once again, we have to pay encoding costs though if applying it to classical data.

We do have good reason to believe this quantum matrix inversion algorithm is powerful, because it solves what we call a BQP-complete problem. That is, any problem that can be solved efficiently with a quantum computer can be encoded as an instance of this matrix inversion problem.

Unfortunately, HHL is not likely to be feasible in the NISQ era, “the algorithm is probably just too expensive to be executed successfully by a quantum computer which doesn’t use error correction.”

Recommendation systems

A quantum algorithm has been proposed which gives an exponential speedup over the best currently known classical algorithm for the task of making high-value recommendations.

The goal is to recommend a product to a customer that the customer will probably like, based on limited knowledge of the preferences of that customer and other customers.

Whereas the classical algorithm attempts to reconstruct the full recommendation matrix, the quantum one samples efficiently from a low-rank approximation to the preference matrix.

This is a significant quantum speedup for a real-world application of machine learning, encouraging the hope that other such speedups will be discovered. However, we don’t currently have a convincing theoretical argument indicating that the task performed by quantum recommendation systems (returning a high-quality recommendation in polylog(mn) time) is impossible classically.

Alas, the algorithm is also probably too costly to be convincingly validated in the NISQ era.

Semidefinite programming

Semidefinite programming is the task of optimising a linear function subject to matrix inequality constraints. Convex optimization problems of this type have widespread applications. “A recently discovered quantum algorithm finds an approximate solutions to the problem in time polylog(N), an exponential speedup.”

The good news: a quantum solver for semidefinite programs might be within reach of NISQ technology!

Quantum simulation

Finally, quantum computers should be really good for running quantum simulations! We can get started in the NISQ era, but the most exciting discoveries are perhaps beyond it. The theory of classical chaos advanced rapidly once we could simulate chaotic dynamical systems. Quantum simulation may promote similar advances in our understanding of quantum chaos.

Valuable insights might already be gleaned using noisy devices with of order 100 qubits.

The last word

Quantum technology is rife with exhilarating opportunities, and surely many rousing surprises lie ahead. But the challenges we face are still formidable. All quantumists should appreciate that our field can fulfill its potential only through sustained, inspired effort over decades. If we pay that price, the ultimate rewards will more than vindicate our efforts.

Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer

February 2, 2018

Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer Shor, 1996

We’re sticking with the “Great moments in computing” series again today, and it’s the turn of Shor’s algorithm, the breakthrough work that showed it was possible to efficiently factor primes on a quantum computer (with all of the consequences for cryptography that implies). Wikipedia has a neat timeline of developments in quantum computing which tells us that Shor’s algorithm was actually executed on a real quantum computer for the first time in 2001 – when it factored 15!

The presentation of the algorithm in the paper is hard to follow when you first read it cold, but thankfully there are lots of materials out there which help to bring it to life. Here’s Scott Aaronson’s explanation using only basic math: “Shor, I’ll do it.” I also really enjoyed Kelsey Houston-Edward’s videos in the PBS Infinite Series (danger: you can lose many hours in that YouTube channel!), ‘The Math Behind Quantum Computers’ and ‘Hacking at Quantum Speed with Shor’s Algorithm.’

Beyond the algorithm itself, the paper also touches on interesting discussions around what it is that makes a problem amenable to an efficient solution on a quantum computer, and hence what classes of problems quantum computers may efficiently solve.

In the history of computer science… most important problems have turned out to be either polynomial-time or NP-complete. Thus quantum computers will likely not become widely useful unless they can solve NP-complete problems. Solving NP-complete problems efficiently is a Holy Grail of theoretical computer science which very few people expect to be possible on a classical computer. Finding polynomial-time algorithms for solving these problems on a quantum computer would be a momentous discovery.

Writing in 1996, Shor saw some ‘weak indications’ that quantum computers would not be able to solve NP-complete problems, but wasn’t yet prepared to rule that possibility out. Scott Aaronson has a wonderful 2008 piece in Scientific American exploring this question, ‘The limits of quantum.’ The key diagram is this one:

For the latest thinking on quantum supremacy and complexity spaces see ‘Complexity-theoretic foundations of quantum supremacy.’

At the outer level we have PSPACE problems – these are problems that a conventional computer can solve using only a polynomial amount of memory, but possibly an exponential number of steps. In other words, they are polynomial in space (PSPACE), but not necessarily in time.

An important subset of PSPACE problems are the NP problems. These are problems whereby it takes a potentially exponential number of steps to find a solution, but once a solution has been found, you can verify that it is indeed a solution in polynomial time.

Within NP there are two further disjoint subsets: NP-complete problems, and P problems. NP-complete problems are a subset of NP problems that all share a fundamental structure. If you can find an efficient algorithm to solve any one of the NP-complete problems, then it is possible to adapt it to also solve any of the others. We have no known algorithm to efficiently (in polynomial time) solve any NP-complete problem. P problems are the subset of NP problems that we do know how to solve efficiently in polynomial time.

That’s the classical computing territory. How do quantum computers impact this landscape? The BQP class of problems (Bounded-error, Quantum, Polynomial time) are those that quantum computers can solve in polynomial time. All P problems are also in BQP (anything a classical computer can solve in polynomial time, so can a quantum computer), and all BQP problems are also in PSPACE (i.e., quantum computers don’t have any silver bullets to offer when it comes to resource consumption). All NP-complete problems are believed to be outside BQP (so quantum computers aren’t going to radically change everything — though Grover’s algorithm does show that we can get a quadratic speed up here). That leaves some problems in BQP that are also in NP, but not in P, where a quantum computer has a real edge over classical computers. These problems can be solved in polynomial time on a quantum computer, but require exponential time on a classical computer. The canonical examples of those problems are factoring integers and discrete logarithms. The quantum algorithms for these problems are the subject of today’s paper choice.

What is it about a problem that places it in BQP \ P?

Through the magic of superposition, a quantum computer with n qubits can be in an arbitrary superposition of up to 2^n different states simultaneously. When searching in a solution space therefore, a quantum algorithm can explore 2^n possible solutions in parallel, in the same total elapsed time as it takes to explore any one solution. The tricky part though, is that you can’t inspect all of those 2^n candidate solutions to find the one(s) that are genuine solutions. When you take a measurement (attempt to read the output of the computation), the superposition collapses and you get the results for just one of the candidate solutions. Which one? Think of it like a random sample drawn from the space. Assuming genuine solutions are rare, odds are you’ll end up finding out that something is not a solution.

Extending our probability distribution analogy for a moment, so far we’ve been assuming that the distribution is uniform. But the distribution doesn’t have to be uniform. If we can exploit certain global properties of the overall solution space (not just features of point solutions within the space) we may be able to shape the distribution in our favour. In particular, if we can make it more likely that a randomly drawn sample actually is a solution, we can increase our odds of resolving to a solution when we take a measurement. On a physical level, what we want is that non-solutions destructively interfere with each other, and solutions constructively interfere, thus increasing the amplitude of the wave at solution points.

Now all we have to do is run the quantum computation a sufficient number of times, (say 100), and the solutions should rise to the fore because we’ll sample them more often. Problems in BQP \ P are those where we can find something in the overall structure of the solution space that allows us to do this distribution shaping trick.

Shor’s algorithm for factoring primes

Shor’s algorithm uses a little number theory to convert the problem of finding prime factors into an equivalent problem which reveals an overall structure in the solution space we can exploit along the lines we’ve just been discussing.

Our quantum factoring algorithm takes asymptotically O((\log n)^2(\log \log n)(\log \log \log n)) steps on a quantum computer, along with a polynomial (in log n) amount of post-processing time on a classical computer that is used to convert the output of the quantum computer to factors of n.

A 2048-bit key has around 1400 decimal digits. So factoring it should take on the order of (1400^2)*3.14*0.5, or approximately 3 million steps.

Here I’m going to follow Kelsey Houston-Edward’s high-level explanation of the algorithm, which has 4 main steps. Suppose we want to find the prime factors p and q of N. We proceed as follows:

  1. Pick a random number a less than N. Repeat until you’ve found some a which is not a factor of N.
  2. Find r, the period of a mod N. (We’ll come back to this in just a moment)
  3. Check that r is even, and that a^{r/2} + 1 \not \equiv 0\ \textrm{mod}\  N.
  4. Let p = \gcd(a^{r/2} - 1, N), and let q = \gcd(a^{r/2}  + 1, N).

Step 2 is both the tricky part (i.e., where we need to take advantage of quantum computation), and also the part that enables us to find and exploit a global property of the solution space. We construct a sequence a\ \textrm{mod}\ N, a^2\ \textrm{mod}\ N, a^3\ \textrm{mod}\ N, .... Let’s try it with a = 2 and N = 15:

2\ \textrm{mod}\ 15, 4\ \textrm{mod}\ 15, 8\ \textrm{mod}\ 15, 16\ \textrm{mod}\ 15, 32\ \textrm{mod}\ 15, 64\ \textrm{mod}\ 15, 128\ \textrm{mod}\ 15, 256\ \textrm{mod}\ 15, ... = 2, 4, 8, 1, 2, 4, 8, 1, ...

The resulting pattern of digits (2,4,8,1) repeats with a period of 4. So r = 4 in this case, and p = \gcd(4-1,15) = 3 and q = \gcd(4 +1, 15) = 5. And indeed 3 and 5 are the prime factors of 15.

Steps 1, 3, and 4 we can do quite efficiently on a classical computer, but step 2 we can’t.

Following Aaronson now:

…suppose we could create an enormous quantum superposition over all the numbers in our sequence: a mod N, a^2 mod N, a^3 mod N, etc. Then maybe there’s some quantum operation we could perform on that superposition that would reveal the period.

The quantum operation that does this for us is the Quantum Fourier Transform (QFT)! This is described in §4 of the paper. Back to Kelsey Houston-Edwards, who has a nice easy to follow explanation in her video starting at 8:50.

Consider a set of dials, with increasing number of points equally spaced on a unit circle:

Every dial starts off pointing to 3 o’clock. Now for each number x in the sequence move the hand of each dial one place counter-clockwise. When x = 1, record the positions of the hands. Here are the positions after we’ve processed ‘2,4,8,1’ :

Above each dial we’ve placed a point. Move that point by 1 unit in the direction the hand is pointing to when recording the hand positions..

Keep doing this as you move through the sequence, and you’ll find that most of the points just trace out a shape eventually returning to where they started. But the dial with the same number of positions as the period of the sequence will always be pointing in the same direction each time, and it’s path will extend off into the distance.

The lengths of the paths traced out are like the amplitudes, or probabilities of the corresponding state. So by following this process, we’ve greatly increased the chances of sampling the right answer.

Following this general principle, the QFT is a linear transform mapping a quantum state encoding a sequence, to a quantum state encoding the period of the sequence.

Back to Aaronson:

Another way to think about this is in terms of interference. I mean, the key point about quantum mechanics — the thing that makes it different from classical probability theory — is that, whereas probabilities are always nonnegative, amplitudes in quantum mechanics can be positive, negative, or even complex. And because of this, the amplitudes corresponding to different ways of getting a particular answer can “interfere destructively” and cancel each other out.

For all periods other than the true one, the amplitude contributions cancel each other out, only for the true period do they all point in the same direction, “and that’s why, when we measure at the end, we’ll find the true period with high probability.

Section 6 in the paper describes the algorithm for computing discrete logarithms, which also makes use of the QFT, but we’ll leave that for another day!

Learning representations by back-propagating errors

February 1, 2018

Learning representations by back-propagating errors Rumelhart et al., Nature, 1986

It’s another selection from Martonosi’s 2015 Princeton course on “Great moments in computing” today: Rumelhart’s classic 1986 paper on back-propagation. (Geoff Hinton is also listed among the authors). You’ve almost certainly come across back-propagation before of course, but there’s still a lot of pleasure to be had from reading the paper and seeing again with fresh eyes the birth of the idea. Plus, we get the opportunity to take things nice and slow!

We want to build self-organising neural networks that can develop an internal structure appropriate for a particular task domain. Tasks are specified by giving the desired state vector of the output units for each state vector of the input units.

If the input units are directly connected to the output units it is relatively easy to find learning rules that iteratively adjust the relative strengths of the connections so as to progressively reduce the difference between the actual and desired output vectors. Learning becomes more difficult when we introduce hidden units whose actual or desired states are not specified by the task…

Back-propagation is a breakthrough (c. 1986!) algorithm that allows us to build and train neural networks with hidden layers. The learning procedure decides under what circumstances the hidden units should be active in order to achieve the desired input-output behaviour. “This amounts to deciding what these units should represent”. I.e., the hidden units are going to learn to represent some features of the input domain.

Consider networks with a layer on input units at the bottom, any number of intermediate (hidden) layers, and a layer of output units at the top. Connections within a layer, or from higher to lower layers are forbidden. Connections can skip intermediate layers though. (Skip connections would go on to be important components in the training of deep neural networks).

Feeding forward

Consider a unit j, with an input x_j. If the unit is in the input layer, the value x_j is provided by the input vector. But if the unit is in a higher layer, then its input will be a weighted sum of the outputs of all of the lower layer units that feed into it. For each such unit i that feeds into j, we have an output of that unit, y_i, adjusted by a weight on the link between i and j, w_{ij}. So:

[EQN 1] \displaystyle x_j = \sum_{i} y_{i} w_{ij}

We’ll assume that each unit is given an extra input which always has the value of 1. The weight on this extra input is the ‘bias’. (The bias lets us control the value of b in y = ax + b).

We now know how to calculate the input to unit j, but what should j output? The output should be some function of the input, which we call the activation function. The chosen activation function in the paper is the sigmoid function:

[EQN 2] \displaystyle y_j = \frac{1}{1 + e^{-x_j}}

This produces real-valued output, and is differentiable.

By repeating this procedure, starting with the input layer, we can feed-forward through the network layers and arrive at a set of outputs for the output layer.

Patching things up

We can now compare the calculated outputs to the desired output vector.

The aim is to find a set of weights that ensure that for each input vector the output vector produced by the network is the same as (or sufficiently close to) the desired output vector.

For a given input-output case, the error is given by:

[EQN 3] \displaystyle E = \frac{1}{2}\sum_{j}(y_j - d_j)^2

Where y_j is the computed output value of unit j in the output layer, and d_j is its desired state. We can get the total error across all training cases by simple summation of the individual case errors.

(Squaring things is handy when we sum, because it makes all differences positive so that positive and negative differences don’t cancel each other out).

We’d like to make the total error as small as possible (minimise it). To do that, we need to adjust the weights (the only parameters we have available to play with). But how should we adjust them to achieve our goal?

To minimize E by gradient descent it is necessary to compute the partial derivative of E with respect to each weight in the network.

We’d like to know how the error changes as we change the weights, \frac{\partial E}{\partial w}. We don’t know that straight away, but we do know how the error changes as we change the outputs y_j (\frac{\partial E}{\partial y_j}) as E and y are directly related in the equation for E (EQN 3).

\displaystyle \frac{\partial E}{\partial y_j} \frac{1}{2}\sum_{j}(y_j - d_j)^2 = y_j - d_j

So now we know how the error changes in relation to the outputs y. Recall that y_j itself is a function of the input x_j to unit j (EQN 2). So we can also work out how y changes as x changes, \frac{\partial y_j}{\partial x_j}.

If we denote the sigmoid function by \sigma then we have y_j = \sigma(x_j), and the derivative is:

\displaystyle \frac{\partial y_j}{\partial x_j} = \sigma(x_j)\cdot (1 - \sigma(x_j))

Getting closer! We know how the error changes with the output y, and we know how y changes with its input x. Now remember that x_j is itself a function of the weights w_{ij} and the outputs from lower layers which will be multiplied by those weights (EQN 1). So we can work out how x_j changes as we change the weights: \frac{\partial x_j}{\partial w_{ij}}. For a given weight w_{ij} from unit i to unit j the derivative is:
\displaystyle \frac{\partial x_j}{\partial w_{ij}} = y_i

Let’s put all those pieces together in a chain. If we know how the error changes with the output y, and how y changes with the input x, and how x changes with the weight w, then we can chain them together to figure out how the error changes with the the weights. Which is what we wanted to know in order to be able to adjust our weights using gradient descent. This is of course the chain rule from differentiation:
\displaystyle \frac{\partial E}{\partial w} = \frac{\partial E}{\partial y}  \cdot \frac{\partial y}{\partial x}  \cdot \frac{\partial x}{\partial w}

Almost there!

With \frac{\partial E}{\partial w} in hand, the simplest form of gradient descent would be to change each weight by an amount proportional to the accumulated gradient:
\displaystyle \Delta w = - \epsilon \frac{\partial E}{\partial w}

This method does not converge as rapidly as methods which make use of the second derivatives, but it is much simpler and can be easily implemented by local computations in parallel hardware. It can be significantly improved, without sacrificing the simplicity and locality, by using an acceleration method in which the current gradient is used to modify the velocity of the point point weight space instead of its position:

\displaystyle \Delta w(t) = - \epsilon \frac{\partial E}{\partial w}(t) + \alpha \Delta w(t-1)

Where t is incremented by 1 for each sweep through the whole set of input-output cases, and \alpha is an exponential decay factor between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change.

Variants on the learning procedure have been discovered independently by David Parker (personal communication) and Yann Le Cun.