Author / Muyun
Editor / Ah Han
According to the China Academy of Information and Communications, in the 10 years from 2012 to 2021, the scale of China’s digital economy will grow from 12 trillion yuan to 45.5 trillion yuan, and the proportion of the entire GDP will increase from 21.6% to 39.8%. In line with the new trend of the development of the times, it is an unquestionable consensus that “data” has become a new factor of production.
If the rise of the data middle office represents the digital transformation of enterprises from process-driven to data-driven, from digital to intelligent. DataOps, then, is an excellent concept or methodology for implementing the data middle office.
The concept of DataOps was proposed by Lenny Liebmann as early as 2014, and in 2018, DataOps was officially included in Gartner’s data management technology maturity curve, marking the official acceptance and promotion of DataOps by the industry. Although it is still in the early stage of development in China, the popularity of DataOps is increasing year by year, and it will be more widely used in practice in the foreseeable future 2-5 years.
Kangaroo Cloud is one of those explorers. Kangaroo Cloud, a full-link digital technology and service provider, has been deeply involved in the field of big data since its inception. With the acceleration of digital and intelligent transformation in all walks of life year by year, many problems in data governance and data management have gradually emerged.
To this end, driven by technological progress and customer digital transformation needs, Datastack DTinsight, a one-stop big data development and governance platform created by Kangaroo Cloud, carries out data value-based processes based on the DataOps concept, realizes quality supervision and data development process specifications for the whole life cycle of data, and escorts data governance.
Respond to changes
One of the core concepts of DataOps is to respond to changes on the demand side in a timely manner. The following is a typical enterprise data architecture diagram:
Data flows in from the source system on the left, and the intermediate link is a variety of data processing tools, such as data lakes, data warehouses or data marts, AI analysis, etc. The data is cleaned, processed, summarized statistics, data governance and other processes, and finally served by various demanders through BI, customized reports, APIs and other tools.
When defining the platform architecture, data architects or managers in the enterprise generally consider extreme performance, latency, load management and other issues in the production environment, and many computing engines/databases are proficient in this way. But this architecture does not demonstrate the ability to “respond to rapid change”:
It’s like designing a highway that only considers the normal traffic capacity when designed, and does not consider various temporary changes such as accidents, traffic jams, and heavy rain, and after the “on-line”, it is found that it is tired of responding every day and does not improve the traffic capacity too much (for example, a tunnel with only 2 lanes, a small scratching accident may block the entire tunnel). Data platforms within the enterprise will encounter a similar situation, with data workers in the enterprise responding to this change every day or even every hour, and sometimes a simple SQL change may even take several days to go live.
Considering changes during the design phase allows for a more flexible and stable response, and here’s the data architecture from a DataOps perspective.
Data architecture from a DataOps perspective
Data architects put forward some agility criteria at the outset, such as:
• The task needs to be completed within 1 hour from the completion of development to the release of production, and has no impact on the production environment
• Identify data errors before publishing to production
• Material changes, to be completed within 1 day
There are also a number of environmental issues to consider, including:
• Separate development, test, and production environments must be maintained, but with some degree of consistency, at least metadata
• Can be manually orchestrated or automated to implement data testing, quality monitoring, and deployment to production
When architects start thinking about data quality, rapid releases, real-time monitoring of data, and more, organizations are one step closer to DataOps.
Decomposition and practice of DataOps architecture
After talking about the data architecture from the perspective of DataOps, let’s talk about the decomposition and practice of the DataOps architecture. The specific practice of DataOps can be broken down into the following key points:
Management of multiple environments
The first step of DataOps starts with “environment management”, which generally refers to independent development, test and production environments, each of which can support task orchestration, monitoring and automated testing.
At present, the data stack can support multiple sets of environments at the same time, as long as the network is connected, you can realize the unified docking and unified management of a set of data stacks to multiple different environments. The stack is distinguished by the concept of “clustering” in the Console, and different clusters can flexibly dock with various types of different computing engines, such as various open source or distribution versions of Hadoop, Starring Inceptor, Greenplum, OceanBase, and even MySQL, Oracle and other traditional relational databases, as shown in the following figure:
To undertake the previous step of multi-environment management, in the actual development process, it is necessary to release tasks between multiple environments, assuming that there are development, test, and production environments, it is necessary to publish in a cascaded form between multiple environments, as shown in the following figure:
In this case of multi-environment release, the stack can support three ways to publish management:
● One-click publishing
When the network of various environments is connected, a set of platforms can be used to dock with each environment to achieve “one-click release”. In the one-click release process, only users with certain permissions can perform release actions to improve the stability of the production environment. At the same time, some key environmental information can be automatically replaced, such as data source connection parameters in synchronization tasks, computing power configurations in different environments, etc. One-click publishing is more suitable for SaaS or internal cloud platform management.
● Import/export publishing
In the vast majority of scenarios currently in domestic contact, in order to achieve a higher level of security, the production environment will adopt strict physical isolation. In this scenario, cross-environment publishing of tasks can be achieved by importing and exporting, and users can manually import new, changed, or deleted tasks into downstream environments.
● Devops release
Some customers may have purchased or developed their own company-level online publishing tools, and at this time, they need to customize the interface of the digital stack and publish the relevant change information by a third-party CI tool (such as Jekins).
Version management of code
Each time you publish across environments, you need to record the version of your code for later troubleshooting. In the actual scenario, it is often necessary to compare the code between different versions, version rollback, and other operations.
In addition to supporting the comparison of code content, the stack also supports the comparison of more information related to the task, including the configuration of the task scheduling cycle, task execution parameters, environmental parameters, etc., and can “one-click fallback” to the specified version.
Access and rights management
In multiple environments within an enterprise, the production environment is generally the most demanding, and the development and test environment is relatively relaxed. In this case, it is necessary to manage the authentication or access information of the user in different environments. In fact, in order to develop and test conveniently, and there is no sensitive data, in these 2 links, generally ordinary users have all the data permissions and can access various tools, but in the production environment, users must only have data permissions within their own permissions.
Depending on the engine, the stack can support a variety of data rights management methods, including:
● Hadoop engine
Kerberos-based authentication security + Ranger/LDAP-based data security. You can support data permission control at the library, table, and field levels. Data masking is also supported.
● JDBC class engine
In some scenarios, customers may not use Hadoop to build a data platform, but use some JDBC databases (such as TiDB, Doris, Greenplum), and the data stack itself does not manage the permissions of the JDBC database, but uses account binding to control to distinguish the permissions of different accounts, such as:
· Stack A account and bind the database root account
· Stack B account and bind the database admin account
● Task orchestration, testing and monitoring
After the release is launched to production, the data stack can connect the above links, and the user can release to the test environment with one click from the development stage, and after the test environment is verified, observe the operation of the task instance and data output, and publish to the production environment after the operation is correct.
Write the last words
DataOps is a best practice concept, but it is still in a relatively early stage in China, the data stack has some practical experience in this regard, but there are still many things that can be optimized, such as data quality rules also need to be released across the environment, task code, task template export needs to support more task types, etc., looking forward to more DataOps best practices in the industry in the future.
▫ DataOps is not a tool, but a best practice to help enterprises realize the value of data丨DTVision Development Governance
▫ The practical five-step method teaches you the design and processing of the indicator system丨DTVision analysis insights
▫ He’s coming! Kangaroo Cloud Big Data Basic Platform EasyMR is officially launched丨DTVision Lake Warehouse Engine Chapter
▫ Finally, someone explained the processing content of different labels and the falling library clearly丨DTVision analysis insight article