Click on the blue letters above to follow us
On August 27, the open source offline Meetup jointly organized by the ChunJun community and the OceanBase community was successfully held, and the “OceanBase & ChunJun: Building an Integrated Data Integration Solution” was released at the meeting.
This is the first release of the OceanBase & ChunJun joint solution, which will provide a highly reliable data integration solution for real-time data integration of database sub-tables, data integration across clusters/tenants, real-time data integration of different data sources, and full incremental integrated processing of log type data.
The following brings you a specific introduction, welcome to share with more developers and enthusiasts to learn and discuss together.
Pay attention to the public account “ChunJun”, and the backstage private message “Meetup” to get the courseware to share
ChunJun & OceanBase is what
A stable, efficient, and easy-to-use data integration framework
ChunJun is an efficient, stable, and easy-to-use data integration framework that currently enables batch data reading and writing based on the Apache Flink real-time compute engine.
● ChunJun’s core competencies
• Multiple data sources: There are currently more than 30 data sources, covering various types of databases, file systems, etc
• Flexible task operation mode: support out-of-the-box local mode operation, but also support flink standalone, yarn, k8s and other modes; Support Taier, DolphinScheduler, Dlinky and other big data scheduling platforms
• Data restore: Support DML and DDL synchronization, which can maximize the unification of data and structure on the source and target side
• Resumable upload: Relying on Flink’s Checkpoint mechanism, you can retry from the failed site
• Rate control: Supports a variety of sharding methods, allowing users to adjust the sharding logic according to their own services; Supports adjusting the concurrency of reads and writes and controlling the amount of data read per second
• Dirty Data Management: Supports multiple ways to store dirty data, control the dirty data lifecycle, and provide statistics
Enterprise-grade open source distributed HTAP database
Enterprise-grade open source distributed HTAP (Hybrid Transaction/Analytical Processing) database with a native distributed architecture that supports enterprise-class features such as financial-grade high availability, transparent horizontal scaling, distributed transactions, multi-tenancy, and syntax compatibility.
● OceanBase’s core capabilities
• High availability: Paxos protocol based, strong consistency; A small number of replicas fail, data is not lost, and services are not stopped; RPO=0; RTO<30s
• High scaling: online horizontal expansion and contraction; Automatic load balancing
• Low cost: do not rely on high-end hardware, reduce costs; Extreme compression ratio, cost savings
• HTAP: A set of calculation engines that support mixed loads at the same time; A set of databases, read and write split
• High compatibility: compatible with MySQL protocol and syntax; Reduce business transformation and migration costs
• Multi-tenancy: one set of environments to run multiple sets of services independently; Keep tenant data secure
ChunJun OceanBase Connector implementation
● OceanBase CDC
As a distributed database, OceanBase has log information distributed on different machines in the cluster, and a tool is required to summarize these log information to get correct and complete log information.
OceanBase Community Edition uses CDC component architecture to do this work, it is mainly through oblogproxy to provide log pull services, if you want to integrate OceanBase incremental data processing, you can integrate oblogclient in your own business applications to process, has been docked ChunJun, Flink CDC, Cloud Canal and other data integration frameworks.
OceanBase Community Edition CDC component architecture
The working mode of ChunJun Connectors
Reading and writing in ChunJun is mainly implemented through some structures and modules in the Connector, including RDB, CDC, NoSQL, MQ, File and so on.
• RDB Connectors: Based on JDBC Connector, it supports fully incrementally integrated reads and writes when the source table contains self-incrementing columns and the incremental data is only inserted through polling.
CDC Connectors: Database-based Binlog or Redolog for incremental data reading.
● Flink streaming data and dynamic tables
This data on ChunJun is eventually processed in Flink, where streaming data can be converted into operable tables before executing SQL by defining the structure of dynamic tables, and then continuously querying to obtain a continuously updated execution result.
The following figure shows that the data flows from the data to the dynamic table, defines a label on the stream data, and performs continuous queries to obtain continuously updated results.
Implementation of ChunJun OceanBase Connector
In ChunJun, the Chunjun Core module is mainly used to read data to Flink and write out from Flink, where DynamicTableSourceFactory and DynamicTableSinkFactory support SQL-type tasks, and SourceFactory and SinkFactory support Json-type tasks.
As shown in the figure below, ChunJun OceanBase Connector is implemented in two main ways: one is from Chunjun Core to JDBC Connector to OceanBase Connector; The other is from Chunjun Core directly to the OceanBase CDC Connector.
ChunJun & OceanBase applications
● Scenario 1: Real-time data integration for database and table sharding
With the Oceanbase CDC Connector, library table names leverage Fnmatch wildcard to enable real-time data integration of database and table data sources. This scenario can do incremental synchronization or ETL operations for a single data stream.
● Scenario 2: Data integration across clusters/tenants
At present, the data of different tenants can not be obtained in one connection, if you want to do a unified processing of the data of different tenants in the OB, you need to read separately through the connection of multiple databases, then you can use ChunJun and OceanBase-related connectors to read different clusters and tenant data to Flink.
● Scenario 3: Real-time data integration of different data sources
You can aggregate data from different kinds of data sources, use the connectors of different types of databases, and read data from different data sources to Flink.
● Scenario 4: Full incremental integrated processing of log type data
For data sources with only insert incremental changes, full incremental integration is processed based on self-incrementing columns.
ChunJun & OceanBase’s future outlook
● Improve code quality
· Add test cases to cover all startup methods and common business scenarios
· Fully adaptable to MySQL 5.1.4x and 8.0 drivers
● 20+ rich mission types
· Added support for sync tasks in non-transformer mode
· Added support for Oracle mode for OceanBase Enterprise Edition
● Improve the reliability of the program
· Increased transactional support for data reads
· Simplify the deployment of oblogproxy and support Docker deployment
· Added detailed usage documentation
▫ Hi, I’m ChunJun, a fun and easy open source project
▫ ChunJun Meetup Presentation Sharing | Exploration of digital warehouse integration construction based on kangaroo cloud open source framework
▫ Meetup reviews | ChunJun and OceanBase discuss new enterprise digital warehouse solutions
▫ Ding！ You have a practical guide to ChunJun, please check it out
Kangaroo Cloud Open Source Technology Framework Exchange Group
DingTalk group |30537511
Click “Read the original article” to go straight to the open source project!