Recently, CloudCanal users and our team have provided feedback that the performance of writing to gp is not good enough. Therefore, I spent some time researching whether there are new ways of writing that can improve the performance of CloudCanal’s writing to gp/pg. After verification, it was found that using the copyManager method has better performance compared to CloudCanal’s traditional batch/multi-sql based writing method.
For a quick start of using copyManager, you can refer to the following.：Fast CSV and JSON Ingestion in PostgreSQL with COPY
Essentially, it can be understood as a form of physical replication, similar to MySQL’s LOAD DATA INFILE. Because it is a physical replication and directly appends content, the efficiency is much higher compared to traditional INSERT/UPDATE/DELETE processing. The database avoid SQL parsing and optimized execution, and directly writes data at the bottom layer, making it very fast. Not all databases have implemented similar features. For example, PostgreSQL’s JDBC driver provides a copy API that we can use to improve write efficiency.
How to handle UPDATE/DELETE
The COPY method only supports INSERT at the bottom level. For mixed write workloads, the performance advantages of the copy API can actually be utilized, but special processing is required for UPDATE and DELETE.
PostgreSQL does not support the expression ‘pk in (xx,yyy)’, so it is impossible to rely on PreparedStatement to issue SQL statements at one time. If issued separately, there will inevitably be a significant performance overhead. Here introduces a temporary table for deleting tables, which uses multiple temporary tables and batch writing to achieve high-performance deletion. The process can be referred to the following diagram. For more information about postgresql temporary tables, please refer to the documentation.
For the update event, first cooperate with the DELETE processing mentioned in the previous section to convert it into batch delete, and then convert it into INSERT using copy API for writing. If you want to optimize write performance based on this implementation method, you need to meet the following conditions:
- Have a primary key: When deleting from a table without a primary key, more data may be deleted than expected.
- Need to be able to obtain old values of columns that have not been updated in the update event: When writing, all columns must be written. If old values are missing, data integrity cannot be guaranteed.
Handling conflicts between format and delimiter
The official documentation COPY has a detailed explanation of the usage details of copy. Generally speaking, using the csv format is better to avoid writing key with duplicate column names, but the main side effect is the conflict between separators and values. This official document also has a special description as follows. The solution is to wrap the data that needs to be written with escape and quote symbols defined in advance, which avoids the problem of separator conflict due to pre-escaping.
The values in each record are separated by the DELIMITER character. If the value contains the delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE character, and any occurrence within the value of a QUOTE character or the ESCAPE character is preceded by the escape character. You can also use FORCE QUOTE to force quotes when outputting non-NULL values in specific columns.
Null value processing
According to the official description below, when using csv format, pg cannot distinguish between NULL and empty strings. Therefore, by default, writing “” represents NULL (note that when writing null values, do not add quotes on both sides, otherwise it will be escaped and will not become null). If you need to write an empty string, you need to represent it as “” (using quotes on both sides for escaping).
The CSV format has no standard way to distinguish a NULL value from an empty string. PostgreSQL's COPY handles this by quoting. A NULL is output as the NULL string and is not quoted, while a data value matching the NULL string is quoted. Therefore, using the default settings, a NULL is written as an unquoted empty string, while an empty string is written with double quotes (""). Reading values follows similar rules. You can use FORCE NOT NULL to prevent NULL input comparisons for specific columns.
This section compares the performance difference between the traditional write method based on preparedStatement and the write method using copy API.
The machines and databases used for running the writing program are the same in different test groups. This test was performed on a local Mac OS, with GreenPlum 6.8.1 being used as the database in the testing environment.
The structure of the test table is as follows:
create table public.ak_sk_list
The main code used in the test environment is as follows, comparing the efficiency of writing over 800,000 rows of data using two implementation methods based on PreparedStatement and CopyManager. The specific way to obtain PgConnection can be implemented through a tool class.
You can see that if you simply use INSERT to write data, the performance gap will be more than a hundred times.
|Write method||Write rows||Cost|
|PreparedStatement Batch Insert||10240 rows||47910ms|
Application in data integration
High-performance writing has important application value in data migration and incremental sync. CopyManager is of great significance for optimizing the writing performance of postgresql/greenplum. However, for incremental data writing, some extra SQL merging work needs to be done to avoid PK constraint conflicts that may cause sync task interruption. Overall, the writing method based on CopyManager has the following advantages in data migration and synchronization:
- More extreme performance
- Work with SQL key merge feature and avoid performance degradation caused by constraint conflicts.