Demystifying cloud spanner the world's biggest relational database

Google's cloud spanner is be the first scalable, globally-distributed, and strongly consistent database service built for the cloud specifically to combine the benefits of relational database structure with non-relational horizontal scale.

Properties	Spanner	SQL Systems	NOSQL Systems
Scalability	Horizontal	Vertical	Horizontal
Replication	Automatic	Configurable	Configurable
Consistency	Yes	Yes	No
SQL	Yes	No	Yes

Before we start digging deep in the internals just let you know this article will give overview of its working and references are provided for detailed explanation and this post is meant to be technical i.e it is intended for developers and anyone with CS background.

Many research articles exist today on this topics but I feel most of them is too complicated, not straight to the point and well So I decided to write my own.

-> Introduction

-> Working

-> Conclusion

-> References

Introduction

Long story short, Spanner is a NewSQL database that aims to meet all features of Sql databases along with horizontal scalability, consistency and high availability. It's world’s first horizontally-scalable relational database that can be stretched not only across multiple nodes in a single data center but also across multiple geo-distributed data centers, without compromising ACID transactional guarantees.

Working

Before digging deeper it's good to know few things in order to better understand how it is working internally.

Scalability -

Horizontal Scalability (systems that can be scaled as per needs by adding more machines to the network) has never been easier to achieve in relational database systems, it is one main aspects which lead to rise of NoSql databases lately.

As in scalable systems,the first thing comes to mind is sharding (in simple words partitioning the database without compromising it's consistency), which is difficult to implement in relational databases. Spanner does this by itself i.e automatic called load-based sharding which partitions the rows based on some range on basis of appropriately selected primary key.

Availability -

Next big commitment is higher availability, data is replicated synchronously for each storage request, i.e for each storage request first the data is replicated in other machines or datacenters in network (usually 3 to 4) after that further operations are carried on it.

What if a node in a network fail's, will the system crash?. As spanner is a distributed system spanner uses a concencus algorithm called paxos.Those who are not familiar with paxos, for now Paxos is a family of concencus algorithm.At high level of abstraction and for relevence of this article, it will prevent system failures and network from shutting down if any node in the network fails, as another surviving node will take it's place and retain knowledge of the results i.e data is recovered from it's replicated copies. in other words maintains high availability.

For implementing paxos algorithm in real world, requires all the nodes to be able to communicate in real time with as much low latency as possible which requires high performance systems and communication channels. As for the tech giant network performance is not much of an issue, as google lays off it's own fiber optic cables throughout the globe.

What about consistency and serializability ?

Consistency says transcation must change affected data only in allowed ways i.e data written must be valid and should not violate any defined rules or contraints specified in the system design or schema. For simplicity, serializability tells that the sequence of actions performed in real time should reflect the same in the database. For sequencing the transaction, every transaction holds a timestamp, and for this we require a global atomic clock. This clock should have high accurate and precision and is shared across continents. So google has something called truetime i.e servers with atomic clock throughout the world.

What is Truetime ? And how it manages transactions throughout the globe? TrueTime is a highly available, distributed clock that is provided to applications on all Google servers. TrueTime enables applications to generate monotonically increasing timestamps which is also used for external consistency.For now just understand this we will come to this in a bit.

Architecture

Write Operation -

As like most ACID databases, Spanner uses two-phase commit (2PC) and strict two-phase locking to ensure isolation and strong consistency. 2PC has been called the “anti-availability” protocol because all members must be up for it to work.I means in the commit phase,the nodes of that respective paxos group cannot serve other request (ex: read request) at that point of time. Spanner mitigates(makes it less severe) this by having each member itself be a Paxos group. So in that time frame 2PC “member” is available even if some of its Paxos participants are down and take more time. Data is divided into groups for storing and replicating data.

For two phase locking protocol, the TrueTime considers the time for acquiring the lock to be earlier by certain duration which is calculated considering latency for a request (for acquiring lock) to execute,usually in milliseconds. It also predicts the time at at which the lock would be released at it's latest or max, which google calls its as (now + epselon). which will be required for read operation to wait for.

Read Operation-

It's more intresting to know about read operation rather than write operation. What read operation basically does is that it returns the state of database at any given point. So data to be fetched for example, taking two different records located across different data center located in different continent should be the same for current time instance, any one of the copies should not be obsolete(expired) or newly modified or changes with respect to the time frame other wise the consistency of system is compromised.

A Blocking read operation is the option you might say, but it is leads to poor performance. For this,at any given time, spanner takes snapshots of the tables. When a read operation occurs spanner fetches the latest snapshot for that data. Retrieving snapshots spanner requests truetime to give the time at which it can fetch the latest snapshot of data. And if the time to wait in response exceeds a certain threshold limit then it queries the leader node of the paxos group for sending first the latest snapshot of data regardless of the time to wait.

What happens during a Partitioning

In general Spanner chooses C over A when a partition occurs. In practice, this is due to a few specific choices:

• Use of Paxos groups to achieve consensus on an update; if the leader cannot maintain a quorum due to a partition, updates are stalled and the system is not available (by the CAP definition). Eventually a new leader may emerge, but that also requires a majority.

• Use of two-phase commit for cross-group transactions also means that a partition of the members can prevent commits.

The most likely outcome of a partition in practice is that one side has a quorum and will continue on just fine, perhaps after electing some new leaders. Thus the service continues to be available, but users on the minority side have no access. But this is a case where differential availability matters: those users are likely to have other significant problems, such as no connectivity, and are probably also down. This means that multi-region services built on top of Spanner tend to work relatively well even during a partition. It is possible, but less likely, that some groups will not be available at all.

But never the less the tech giant made it's way of accomplishing this first.