In the earlier articles, we have covered the basics of Galera Cluster for MySQL and MariaDB and another article about MariaDB Galera Cluster in particular. To recall the overview of Galera Cluster – It is a synchronous multi-master cluster that uses the InnoDB storage engine (XtraDB also for MariaDB). It is actually the Galera replication plugin that extends the wsrep API of the underlying DBMS, MySQL or MariaDB. Galera Cluster uses Certification based synchronous replication in the multi master server setup. In this article we will look into the technical aspects of Galera Cluster Certification based replication functionality.
Certification based Database Replication
Database replication allows setting up of synchronized master-slave or master-master database clusters. Since data is synchronized across the individual databases also called nodes, this setup is fault proof with failover since a failed node is replaced by other nodes until the former’s recovery, thus ensuring high availability. The type of database replication can be either Synchronous or Asynchronous. In synchronous replication the transactions are committed at all nodes concurrently and in asynchronous replication, target nodes receive new transaction set from source node after a negligible lag from original transaction. Synchronous replication ensures that all nodes are in same state (transactions being committed at same time) and thus there are high availability 24/7 with consistent replicas and no data loss during individual node crashes. Another advantage is improved performance because clients can always perform READ/WRITEs at any nodes irrespective of where the transaction originated.
The disadvantage with Synchronous replication is the increased lock time and possible conflicts arose due to parallel transaction commits at all nodes. This can affect performance when number of nodes and transactions increases. Asynchronous transaction solves this problem by allowing each node to independently commit transactions that they have received from the source node. This prevents concurrent locks and ensures more availability with large number of nodes and transactions. The performance issues with synchronous replication are solved by Certification based synchronous replication in Galera Cluster. This replication is based on:
1. Group communication model that define a pattern for inter-node communication.
2. Write-sets that groups or bundles a set of WRITES as a single transaction message to be applied to individual nodes.
3. Database state machine process that treats READ/WRITE transactions on one database as its state at a given time and then broadcasts the transaction to other nodes to change their state to that of the source node’s state.
4. Transaction reordering to reorder transactions that are either not certified or not committed due to a node failure. Transactions are re-ordered so that they are not lost.
The Certification process uses a global coordinated certification scheme in which transaction from the source node is broadcasted to other nodes using global total ordering of concurrent transactions to achieve global consistency. Certification based replication works with databases having the following features:
1. Transactional database with COMMIT and ROLLBACK capability.
2. Atomic changes capable database that accepts entire transaction to be applied for COMMIT else no COMMIT at all.
3. Global ordering capable database that can undergo global ordering of replication events or transactions.
Working of Certification based Replication
When a transaction (series of changes) occurs at a database node and it issues a COMMIT, before the commit actually takes place, all the changes or WRITES/UPDATES/ALTERS occurred at the database node along with the modified rows’ PRIMARY KEYS are collected as a write-set. The source node then broadcasts this write-set to all other nodes. Each node in the cluster, including the originating node then performs a deterministic certification test on the write-set using the PRIMARY KEYS in the write-set and the actual PRIMARY KEY values in the nodes. This test is to determine the key constraint integrity of the write-set. If the test fails, the originating node drops the write-set and the cluster rolls back the original transaction. If the certification test succeeds, the transaction commits and write-sets are applied to all nodes in the cluster, thus making the replication.
In Galera Cluster, each transaction is assigned a global ordinal sequence number. During the deterministic certification test on the write-set, the cluster checks the current transaction with the last successful transaction. If any transactions would have occurred in between these 2 globally ordered transactions, primary key conflicts will occur and the test fails. Upon a successful deterministic certification check, all replicas apply the transaction in the same order. Thus all nodes reach on a consensus about the outcome of the transaction and replication happens. On this, the originating node notifies the client about the successful transaction commit.
Galera Cluster Replication Plugin and wsrep API
The core of Galera replication is the Galera Replication Plugin that implements the write-set replication functionality. The DBMS (MySQL or MariaDB) that is used to setup the Galera Cluster uses the Galera Replication Plugin that implements an API called the Write Set Replication API or wsrep API. It is implemented as a replication plugin interface in the Galera Replication Plugin. Thus the Galera Replication Plugin is the replication or wsrep provider. It consists of 3 components:
1. Certification layer that prepares the write-sets and performs the certification tests.
2. Replication layer that manages the replication process and global ordering.
3. Group communication framework provides plugin architecture for the group communication systems in the Cluster.
The wsrep API considers the content including data and schema of a database as a state. When a client performs WRITE/UPDATE/ALTERs on the database, it is considered as a transaction which is nothing but the changes happened to the database represented as a series of atomic changes. To keep a consistent state across all nodes, the wsrep API uses a Global Transaction ID (GTID) for each transaction write-set. The GTID allows to identify state changes and to compare two states, the current and a previous one. In the Galera Cluster all nodes need to have the same state. The synchronization and replication needed to maintain consistency in state is performed using the GTID serial order. The GTID consists of 2 parts: a State UUID that identifies the state and sequence of changes happened, and an Ordinal Sequence Number (seqno) used to denote the position of the change in the sequence.
Eg: 45eec521-2f34-11e0-0800-2a36050b826b:94530586304
The Galera Cluster Replication Plugin uses the wsrep hooks defined in the DBMS to call the wsrep API. A function called dlopen() acts as a bridge between the wsrep provider (the Galera Replication Plugin and wsrep API) and the wsrep hooks in the DBMS. The state change or atomic changes grouped as a transaction write-set with a GTID is the key for implementing replication. Below are the steps performed to implement replication using the wsrep API.
1. At any node a change occurs, causing a state change.
2. The database invokes the replication provider using the wsrep hooks to create the write-set from the changes.
3. dlopen() connects the wsrep provider functions with the wsrep hooks in the database.
4. The Galera Replication Plugin performs the write-set certification and replication in the cluster.