Data warehouses, data marts, data Pipeline, ODS, big data platforms, and data lakes

rabbi khan
13 min readDec 8, 2020

Why build a real-time data platform?

Around 2000 or even earlier, our trading system and analysis system were not separated. With the continuous improvement of business requirements and the requirements for 7*24 hours online transactions, the pressure on trading system services is increasing. In order to prevent the analysis system from affecting the transaction system, the analysis system was gradually separated from the business system, and two system concepts of OLTP (online transaction processing) and OLAP (online analytical processing) were born. Two concepts were produced at the same time, one is ODS (copy all the original data in the trading system, and then do various processing, processing and analysis on ODS); the other is Data Mart (data mart, according to business The actual demand is to take out part of the data to be analyzed from the trading system for sorting).

Both ODS and data marts are based on the data model and data specifications of the core business system/transaction system. With the continuous development of the business, the trading system must be iterated continuously. When the trading system is upgraded, ODS and data set The city will be overthrown and rebuilt. In the face of high construction costs and severe system shocks, everyone found that establishing a relatively independent and comprehensive data warehouse is a very effective solution.

As the cost of storage and computing resources is getting lower and lower, computing power and computing requirements are constantly evolving, and questions about whether a centralized data warehouse is still needed are rampant. Because data warehouses usually use T+1 batch loading data to process data, the timeliness is not high enough, and it is difficult to meet the increasingly high timeliness requirements in the business. In addition, a large amount of external data cannot be integrated, and the big data platform It came into being. With the rapid growth of data in various industries, the concept of data lake has gradually formed [the concept of data lake has also been formed]. Data can be advanced to the data lake and accessed on demand.

With the evolution of technology, data warehouses, data marts, ODS, big data platforms, and data lakes are all classified into non-real-time data processing and analysis systems. In recent years, due to the increasing demand for timeliness of business, distributed computing and streaming computing have emerged, and real-time data fusion has gradually been promoted. The current data acquisition mode requires real-time, accurate and comprehensive data acquisition without affecting the normal operation of the business system. The data can be processed, fused and calculated on the same platform, and then pushed to the downstream real-time application system.

The above content is why we need to build a real-time data platform development concept.

Three common problems in real-time data

Around 2000, a large enterprise used relatively few types of databases. From a brand perspective, Oracle, IBM DB2, Sybase, and MS SQL Server were more commonly used. However, even with multiple brands, they were basically relational databases. . With the development of data technology today, from a global perspective, there are more than 200 technology brands that can be classified into databases, including traditional relational databases, time series databases, graph databases, search engines, message queues, big data platforms, and There are more than 40 mainstream databases such as object storage.

With the continuous development of business, in order to cope with different application scenarios, transaction systems, accounting systems, management systems, etc. will adopt different database technologies, invisibly building a large number of technical barriers. The data itself is unique within an enterprise domain and needs to be integrated with each other. In the process of continuous development of data technology and the gradual increase of the differences between each technology, how to break the technical barriers so that data will not hinder the release of its value due to the choice of technology stack is a major issue before us.

Both technicians and Internet practitioners can clearly feel that the user’s interaction time is getting shorter and shorter, and the attention economy is becoming more and more prominent. Whoever can catch the user’s attention can get the corresponding traffic and return. In this process, how to capture the user’s attention in a short interactive time, the entire real-time data link is crucial. But this has some natural contradictions with actual R&D management and IT data management. R&D management requires a complete set of processes such as development, testing, and launch, while business requires higher agility in data. Most IT management systems hinder the support of agile business scenarios, data fusion, or underlying data integration. An end-to-end real-time link, the general delivery cycle is in months. At the same time, more than a dozen or even dozens of data technologies are used in a mixed manner. How can the data stored in them quickly build links? The ability to effectively integrate incremental data and full data has become a core issue for IT departments to solve.

After breaking through the different technical barriers, a data fusion platform needs to be built immediately. The real-time data link plays the role of business operation and background business analysis and management. It needs to have very high stability and fault tolerance to cope with changes in the external organization structure and internal requirements for the platform. When data fusion itself is not centralized, it will definitely be affected by data links, upstream systems, and downstream systems. The upstream system is a more important business system. Changes in upstream data structure and large-scale data processing will not take too much into consideration the actual conditions of downstream data links. For example, a simple update operation upstream may cause millions or tens of millions of incremental data to the downstream system. The stability of the downstream system does not only stem from its own stability, but also effectively responds to the influence of the upstream system through some preset rules. When the upstream and downstream systems are stable, the underlying systems, such as the network environment, storage environment, and CPU memory, will also affect the stability of the entire system. At this time, it is necessary to consider how to shield the above unstable factors for inter-network transmission/large-scale data links.

In summary, there are three main problems that companies encounter in the process of implementing data management. The first question is that when more and more database technologies are applied in enterprises, a large number of technical barriers appear. How can we break these technical barriers and effectively integrate data to drive business development. The second question is that business departments have higher and higher requirements for the timeliness of data processing, but the process of building real-time applications for data processing still requires a scientific and rigorous construction logic, business departments’ requirements for data timeliness and IT department construction The balance between the efficiency of high-quality data links. Finally, after the real-time data link is constructed, because it has both business operation and management support requirements, the requirements for stability and fault tolerance are very high, and this process is restricted by the upstream and downstream systems and the system environment. How to ensure Efficient and stable operation ensures high fault tolerance to deal with various emergencies.

The main problems of real-time data management and how to deal with them

The overall architecture of a standard enterprise-level real-time data platform for the financial industry. Its upstream is data stored in different database technologies or external data nodes. Data Pipeline can integrate these data into the platform through different technology stacks, and then push it to various downstream business systems.

In the past two decades, the types of data sources have undergone tremendous changes. Most of the early integrated data is business system data, and there will be more data in the enterprise domain. Now, the data that needs to be integrated has not only increased a large amount of unstructured data, but also a large amount of external sources.

In addition to business system data, client data such as customer behavior data, electronic devices, APP, cameras, sensors, etc. will also enter the real-time data link, and the analysis value of this type of real-time data is very high.

Now every company pays attention to the upstream and downstream of its entire industrial chain. A large number of partners, in addition to cooperation at the business level, there are also cooperation between IT systems. This requires a real-time data processing platform that can cope with the integration of real-time incremental and full data from external business systems.

Enterprises are still collecting a large amount of external data, such as weather data, information data, etc. How these data can be effectively integrated into the enterprise domain for integration, and how to enter the real-time data link to play a role, is also a company that needs to pay attention to and solve problems in building a real-time data platform The problem.

The database technology/data processing technology used by each data source may be different, so it involves multi-source heterogeneous data processing issues. How to obtain global real-time [full-time] data without affecting the normal operation of the system. Here we are going to talk about the Log Base Change Data Capture concept, which is a log-based incremental data acquisition technology independently developed by Data Pipeline. We are now also working with IBM to integrate IBM InfoSphere Data Replication products to collect database logs including mainframes and medium-sized computers (AS400 systems). For mainstream MySQL, MS SQL Server and other databases, data can be obtained by log analysis. Of course, log-based real-time incremental acquisition is not a single type. For example, MS SQL Server has two real-time incremental acquisition modes: CT mode and CDC mode.

After the problem of accurate acquisition of multi-source heterogeneous incremental data is solved, the second problem to be solved next is to quickly support the agile development of IT systems.

We deconstruct the entire real-time data fusion platform into four logical concepts: data nodes, data links, fusion tasks, and system resources. Through the hierarchical aspects of data node registration, data link configuration, fusion task construction, and system resource allocation, etc. Management, on the premise of effectively meeting the needs of system operation and maintenance management, improve the efficiency of real-time data acquisition and management in all links.

Data node, data node is the original carrier of data. Data nodes can be databases, file systems, data warehouses, files, applications, and all carriers that store data can become data nodes. In the process of data fusion, a data node can be used as a data source node or a data destination node. When a data node is registered through the data node management, it is not assigned a role, and the existence of the data link will give the relevant The role attributes of the data node “data source node” and “data destination node”.

Data link, data link is a logical description of the real-time data fusion process. It not only describes the data object relationships involved in specific business scenarios, but also describes the restrictions and strategies that need to be followed when performing specific data fusion tasks.

In the data fusion process, the data link serves as the logical layer separating data management and data application, which not only protects the safety, stability of data nodes and the consistency, accuracy, and integrity of data semantics, but also ensures the monitoring and log in the data fusion process The basic operation and maintenance work such as early warning and early warning follows the overall information management mechanism of the enterprise.

Fusion task, fusion task is the smallest management unit for real-time data fusion. The fusion task determines the specific data fusion content and basic rules to be executed by selecting the data link. Under the premise of ensuring the controllable data resources, the fusion task provides more autonomy for data applications. The business department or application developer who actually uses the data can independently choose the scope of data acquisition, the life cycle of the integration task, and the amount of system resource investment. On the basis of following the data link configuration, you can define configuration options such as automatic restart strategy, early warning strategy, log strategy, read limit, write limit, and transmission queue limit.

System resources, system resources are the carrier of the system platform and the execution of integrated tasks. System resources refer to the resource group used by the fusion engine that performs the fusion task, and also refer to the system resources used by the various functional modules that ensure the normal operation of the entire system. In the data fusion process, the fusion task only needs to care about whether the system resources required to perform the task meet the performance, efficiency and other system resources that affect the timeliness of the data, and the message queue, error data storage, log storage, early warning storage and even the platform configuration are persistent The system resources used by functional modules such as transformation can be managed by the data engineer who manages the data link and the operation and maintenance engineer who manages the system resources.

Why split the data task from the data link? Business departments/data users do not care about how the data is mapped, but more concerned about what kind of data can be obtained at what point in time and what method, whether full data or real-time changing data can be obtained, whether it is in timing mode or monitoring mode, Do you want to clear it for me during execution? These are the things that data users are more concerned about. Between the DBA and the maintainer of the data link and the user of the data, the three logical concepts of data task, data link and data node can be used to separate clearly what they are responsible for.

The DBA can quickly define all the data nodes in the entire enterprise domain used daily to the platform, and the data group can configure the corresponding data links based on these data nodes according to the needs of the business department or the planning of the IT system, such as Mapping rules, link execution strategies, log strategies, warning strategies, and configuration strategies. The data user can obtain the appropriate data at the appropriate time and in the appropriate manner based on the data link that has been configured. In this way, all parties involved in real-time data fusion can quickly complete their tasks and management control requirements through code-free configuration.

After supporting the agile and multi-speed development requirements of IT systems, let’s take a look at how the system is stable and highly fault-tolerant.

After the data link is successfully constructed, how to ensure the fault tolerance and stability of its operation? The real-time data link is affected by the upstream and downstream nodes and runtime of the data link. We need to provide real-time data links with corresponding processing strategies and rules that are effective and in line with user expectations. For example, if the upstream data structure changes, what mode should be implemented for the downstream of the data link? If it is very important data, the structure should not be lost and the task can be stopped. But not every data task should be stopped. Stopping data tasks may affect business progress. At this time, you can preset some rules for the link. For example, when the upstream database table increases or decreases the field, you can Synchronize these changes downstream and provide users with an option to pause the task, ignore the change, or apply the change.

In addition, about 5% of IT system failures have no cause, and can be recovered naturally by restarting. The Data Pipeline platform also supports automatic restart mode. But in a distributed system, frequent restarts in some cases can make the situation worse. Therefore, Data Pipeline will preset some rules to tell the user that it is not recommended to restart automatically when encountering certain failures, and should stop to deal with the failure.

There are dozens of preset policy configurations and restriction configurations on data nodes, data links, and data tasks, which can effectively help users deal with various known or unknown situations that may occur during the execution of data links.

In addition to improving fault tolerance in the system preset strategy, the Data Pipeline real-time data fusion platform uses a distributed engine. The system components and computing components have undergone rigorous high-availability testing, supporting component-level high availability, ensuring overall fault tolerance, and at the same time. Dynamically and flexibly expand and shrink, allowing different computing tasks to be redistributed to different machines without affecting the operation of other parts.

When a node has a problem, the remaining nodes can take over the corresponding tasks and continue to run. When the middle node is re-registered when it is restored, two different strategies can be selected at this time. If the load of the other two machines is already very high and a new node is registered, a rebalance is required. These six tasks are redistributed evenly. Another strategy is that the burden of running the two nodes is still within the ideal range, and it can continue to run without rebalancing, but waiting until new tasks are generated before being allocated, because rebalancing also consumes system resources. .

On the basis of distributed clusters, Data Pipeline supports the method of dividing independent resource groups to ensure that high-priority tasks can run stably and efficiently. For example, some data tasks related to customers have high requirements for data processing efficiency. You can divide a resource group independently, and prevent other integration tasks from occupying system resources.

Based on log CDC technology, Data Pipeline breaks all kinds of complicated data technical barriers, solves multi-source heterogeneous problems through various mappings of data semantics, and helps companies break the barriers to data integration caused by the adoption of different database technologies. Through data nodes, data links, data tasks and system resources to ensure that different roles can effectively divide and cooperate on the platform. Through the low-code configuration method, the agility of the data link, especially the real-time data link, can be improved, and a real-time data link can be constructed within 5 minutes. Finally, through various preset rules and the high fault tolerance of the distributed architecture, the stable and normal operation of the entire system is guaranteed.