Real-time synchronization of online data is an important scenario that data integration software needs to handle. Unlike scenarios such as building real-time data warehouses and disaster recovery, real-time synchronization of online data requires much higher end-to-end latency and reliability for the data itself. The core reason for this is that the source-to-end data is applied in online business. In this scenario, the database on both ends of the source often requires ACID TP databases. If there are delays and data problems, it will have a significant impact directly on enterprise online business. Data software faces many challenges when dealing with real-time synchronization of online data, one of which we need to share today - efficient handling of UK conflicts in real-time synchronization of online data.
A significant feature of online data synchronization is that the relationship tables between source and destination endpoints are required for online business, and it is very common for table structures to include UK. Users will subscribe to real-time changes in a large number of tables at once. The key to efficiently synchronizing these data to the destination database in real time is designing an efficient concurrent model that can parallelize the real-time synchronization of this data while ensuring that the data written to the destination endpoint is eventually consistent.
Some of the more commonly thought of concurrent models are:
- Table-level parallelism: Data is partitioned by the table it belongs to and then processed in parallel. The advantage is that data within the same table is completely serialized, which can avoid UK conflicts or database deadlocks caused by concurrent data within the same table. However, when there are a small number of subscribed tables, it cannot fully utilize multi-core performance and overall concurrency capability may be limited.
- PK hash: Hashing is performed based on the primary key of each row of data. Generally speaking, if the primary keys are evenly distributed, this method will not have significant data skew issues and will also have good concurrency performance. Changes made on identical primary keys are executed serially and data consistency is guaranteed. However, this method does not consider unique keys (UK), which means that concurrent execution may result in write conflicts or database deadlocks.
The first option above may not perform well in some scenarios, while the second one cannot solve the problem caused by UK when executing concurrently. Before introducing the new concurrent model discussed in this article, let’s first understand the problem of concurrency synchronization and UK.
We assume that when synchronizing, row data is bucketed according to the PK value. Considering the following execution sequence, parallelism based on default PK hash will result in UK conflict issues and prevent data from being written. The subscription table structure is assumed as follows:
CREATE TABLE IF NOT EXISTS `test_table`(
We initialized 2 pieces of data, and in the order of id, name, and c_uk columns, the first value initialized is [1,wanshao,‘1’]. We will use this expression to represent a row of data later.
insert into test_table(`id`,`name`,`c_uk`) values (1,'wanshao','1');
We construct a sequence of DML executions to simulate the generation of binlog. The DML execution sequence in the following figure only swaps the values of UK in rows with id=1 and id=2, as follows.
If the above changes are written to the other end in parallel according to pk hash, assuming that the degree of parallelism is set to 2, the writing process is as follows. We have three SQL numbers that are written to the other end. In the concurrent model of pk hash, we can only guarantee that ② will definitely be executed before ①, but the order between ③ and ①② is uncontrollable. Only when the execution sequence is ①③② can it be written normally. If it is any other case, it will cause a UK conflict.
Assuming we continue to follow the default strategy of pk hash for writing, in addition to UK conflicts that prevent normal writing, it may also cause deadlocks in the remote database during writing. A table with a composite UK and concurrent UPDATE on it can result in deadlocks under certain data loads. We will construct such a scenario next. First, we prepare a table and some data:
CREATE TABLE IF NOT EXISTS `test_table`
Assuming there is a normal execution sequence:
[1,1] --> [1,2] --> [1,5] --> [2,2] --> [2,1] --> [2,3]
If the partitioning and parallel execution is based on PK HASH, there will be 2 partitions. Two threads will respectively handle changes on rows with pk=1 and pk=2, and when writing to the other end, two transactions will actually perform the write operation. The actual write scenario may result in the following transactions and write sequences. In this case, two transactions A and B will wait for each other to obtain locks on changes in column uk1, resulting in a deadlock exception that causes the write operation to fail.
A [1,1] --> [1,2] --> [1,5]
To solve the UK conflict and deadlock problems, we can design a concurrent model that is aware of UK conflicts. Based on the problem description in the previous section, we can understand that the root cause of these issues is that when executing changes in parallel, the order of row changes with identical UK values cannot be predicted due to lack of consideration for UKs, leading to deadlocks or UK conflicts. After understanding the root cause of this fundamental issue, we can consider designing an overall approach for a concurrent model that addresses UK conflicts:
- Detect conflicts in data flow changes
- Adjust concurrency sequences based on the situation with UK values after identifying conflicts so that changes on rows with identical UK values are executed serially during concurrency.
The workflow of the physical design is as follows:
- Ordered change data flows into the memory of the data integration software.
- Conflict detection begins for the data flow, identifying rows with UK conflicts and recording all primary key values in these rows as a set S. All rows in the current data flow with primary key values in set S are stripped out according to their original order and placed into an independent queue called a conflict queue. The data in this queue is partitioned and parallelized based on the specified degree of parallelism and number of tables. If no conflicts are detected, then the conflict queue will be empty, and non-conflicting data can be parallelized using default pk hash.
The reason for adopting this design is as follows:
- Doing secondary partitioning under the overall PK hash concurrency model has good data balance and performance: Adopting the PK HASH concurrency model, the overall data partitioning is relatively balanced. Even in scenarios where hot updates occur, the overall update will not have too serious data skew problems through hot data merging technology. With balanced data, parallel processing advantages can be better utilized.
- Secondary partitioning of conflict queues improves parallelism: To avoid a large number of UK conflicts leading to complete degradation into single-threaded serial execution, conflict queues are further partitioned according to the table to which the data belongs, further improving parallelism under conflicts.
How to detect UK conflicts
Detecting UK conflicts is the key to achieving a concurrent model that perceives UK conflicts. Issues to consider when implementing UK conflict detection include:
- The situation of use unique composite key
- The situation where multiple UK constraints are present.
Overall, the execution steps of UK detection in the UK can be summarized as follows:
- Load the schema metadata of tables associated with data and obtain key UK constraint information. It is necessary to cache schema information and ensure consistency here.
- Traverse all UK constraints, each of which specifies a set of columns called U for the UK column. For data that needs to be written to the other end, detect the values of columns in set U among them. If there are cases where identical values exist, it is considered a conflict detection. Put other rows with PK values corresponding to these rows into a conflict queue for subsequent secondary partitioning. Note that during the entire traversal process of UK, when detecting conflicting rows put into a conflict queue, it is still necessary to maintain order and ensure that it does not change the original execution sequence.
Real-time synchronization of online data is a very important issue when it comes to achieving high-performance writing while perceiving constraint conflicts. Similar problems include how to avoid performance degradation caused by constraint conflicts when encountering PK constraint conflicts. We will share relevant content on this topic in the future.