After leaving Alibaba in 2019, I have been running my own business for more than three years. The main product I have developed is called CloudCanal. Counting from the beginning of my career in data integration, it has been six or seven years now and I have gained some insights and experiences along the way. Recently, due to changes in work, I finally have some time to summarize these insights and experiences. This series will be divided into three chapters:
- Industry Insights: Analyze and summarize the software essence of the entire data ecosystem and rethink what the real problems they solve. Rethink the role that data integration software plays within it. Sharing some industry insights on this field of competition and future development perspectives.
- Technical Chapter: While working at Alibaba and since starting my own business, I have been worked in the field of data integration for many years. In this chapter, I will summarize some key technical issues in data integration and considerations about code implementation.
- Entrepreneurial Insights: After three years of entrepreneurship, there are many things that I’ve learned. This chapter mainly involves thoughts and reflections on infra-software entrepreneurship.
Data Software Ecology
Why do entrepreneurs in the field of data integration pay attention to the entire data software ecology?
Data integration software is essentially a member of the data software ecology. If we only focus on the field of data integration itself, it will bring many problems:
- It is easy to form a one-sided view: Data integration software is not only a member of the data software ecology, but also has very close connections with other software. Only focusing on the field of data integration and ignoring the impact of other data software (such as database software) can lead to some one-sided views and cause oneself to be shortsighted.
- Unable to see essential issues or future trends: Taking history as a mirror can help us understand changes over time. Starting from the entire ecosystem of data software and looking at what they are all focused on solving can help us better understand essential issues and make predictions for the future.
From the law of entropy increase to the organized arrangement of data.
In the past, people often defined these data software with classification tags (such as AP/TP databases, search engines, data governance tools, data development tools, ETL tools, etc.). This kind of classification can easily make people lost in these concrete concepts and unable to think about the essence of the problem. If we can break away from these labels and look at the essence of the problem, on one hand it can deepen our understanding of these data ecosystem softwares; on the other hand it can also break through limitations when we think about the essesce of problems. Let’s use a “First principle” way of thinking to re-examine this issue.
The essential issues of many things can be applied with basic natural laws. The law of thermodynamic entropy increase tells us that an isolated system will always increase in entropy without external energy intervention until it evolves into a completely disordered state. Data can be understood as material in the digital world; they are objective “materials” that exist in this big digital system. Without external intervention, they tend to become a completely disordered system too. In Erwin Schrödinger’s book “What is Life?”, he mentioned that life lives off negative entropy; the meaning of human existence is to constantly resist entropy increase. The value of life lies in resisting entropy increase; we establish laws to make the world operate according to order and reorganize materials production for various new materials/tools/equipment so as to improve human life quality. This also applies equally well in digital world where we need to fight against entropy increase so that unordered data returns back into orderliness because only well-organized ordered data could bring real value for humanity. The essential work for modern data software ecosystems is fighting against increasing entropy by organizing them orderly which is called normalization.
The relationship between the development of data software ecosystem and normalization.
We fight against the increase of data entropy, so that we can better understand the process of extracting value from well-organized data, which can be understood as data normalization. The development of the software ecosystem for data is actually a history of the development of data normalization. Next, we will use an analogy to understand how data normalization has developed.
The initial dataset can be understood as a primitive tribe. The efficiency in obtaining food, making weapons and hunting was relatively backward for primitive tribes. This is like early data organization, people record data on paper. Data is persisted,but with low efficiency in information dissemination and only have limited values.
As humans developed, primitive tribes evolved into tribal clans. Compared to primitive tribes, tribal clans gathered members of the same clan together and had more rules among different tribes. They trained young men in assembling weapons uniformly within their clan before hunting while women were responsible for raising children within their clan. Compared to primitive tribes, tribal clans greatly improved their developmental efficiency. Data normalization development can analogous to tribal clans. We use tables and excel instead of paper to record data at this stage which result in more orderly organized datasets where similar features are grouped together allowing greater potential value extraction than ever before.
After long-term evolution into modern villages by these tribal clans they achieved much higher production efficiencies compared to earlier times. In modern villages there are various administrative units responsible for various administrative work such as weapon stores or breakfast shops run by villagers leading to great improvements in production and developmental efficiencies along with increased living standards and happiness indices among people . For Data Normalization ,the birth of relational databases during 1970s was equivalent to forming modern villages where querying & operations on data have good abstraction. Relational database enable human manage data and make use of data much easier and efficient than ever before.
As time goes by, modernized villages began urbanizing and many modern cities appered. Some are industrial cities, some are financial centers. The development of data normalization is similar to that of villages. Databases were no longer limited to relational databases, many new databases targeting different usage scenarios appeared. Various databases became the main method to achieve data normalization.
More about data software and data view
The main urban area is the core and important component of a city, but a modern metropolis definitely has more than just one main urban area. Suburbs, satellite cities, surrounding industrial areas, and economic development zones are all important components.
Data normalization is essentially to obtain a reorganized data view. In this article, the data view refers to a more general concept than the database view, which means the new data view after data normalization. It can be the result of SQL execution or a chart presented by data visualization. In order to quickly and effectively obtain this final data view, it is often impossible to rely solely on the database itself. Other software in the data ecosystem fills these gaps very well, making the entire end-to-end process of data normalization smoother. If you are an entrepreneur in the field of data ecosystem software, it is particularly important to consider what other blank positions or underdeveloped areas exist in the entire end-to-end process of data normalization. The following summarizes share how other non-database software in the data ecosystem provides assistance for data normalization:
- Data Development and Governance: By a series of governance standards, rules, and supporting tools, the efficiency, accuracy, and security of data normalization are improved. Data development tools and modeling tools reduce the difficulties of data normalization. Data governance and data security ensure the safety and accuracy of the data view after data normalization.
- Data Application and Visualization: The value of the data view after data normalization needs to be realized by better presentation. Data visualization software presents the results of data normalization better while BI tools and other applications maximize the final value of new data views.
- Data Processing: In order to obtain expected results from a normalized dataset, proper calculation is crucial. Higher demands on processing have led to many more professional computing engines emerging in various forms such as databases or independent products. The paradigm for handling this type of work is constantly evolving with map-reduce or stream computing but essentially serves faster completion times for normalizing datasets.
At the root, data ecosystem software is ultimately serving data normalization. Understanding what problems they are solving in terms of data normalization can help us see the direction for the future. For entrepreneurs in the field of data ecosystem software, this understanding can guide their actions.
The position of data integration software in the software ecosystem of data.
We still use the example of cities to discuss the role of data integration software in the data software ecosystem. The role of data integration in the data software ecosystem is like bridges and roads between cities. Bridges connect cities, allowing cities with different characteristics to be closely connected together, accelerating development through trade and communication, and generating greater value. In terms of data normalization, data integration software act as bridges and roads to connect isolated units of data together so that they can be organically combined for data exchange and fusion, serving to provide richer and more accurate views of the data.
. In today’s new era, when we discuss data integration software, we are actually more concerned with its Extract and Load capabilities, including historical data migration as well as CDC. This core responsibility is fundamental to the existence of Data Integration Software.
The rise of data integration software.
Overall, compared to the database software ecosystem, data integration software has developed slowly. It was not until after 2010 that some domestic and foreign software focused on data integration capabilities gradually emerged in people’s vision with the emergence of big data, new databases, mobile internet, digital economy and other driving forces. Domestically there are data pipeline, DSG, tapdata, CloudCanal etc., while overseas there are debezium, fivetran, streamset,stitch,airbyte etc. There is a background as to why data integration software has been slow to develop in the past but has become increasingly important in recent years.
- The database software itself needs to be prioritized for development: Since the emergence of databases, data normalization has been its top priority. For decades, numerous enterprises and computer scientists have worked tirelessly to improve its theoretical foundation and engineering applications. The role of data integration software is naturally not as significant when databases have not yet developed to a certain extent. It’s like how proper roads and bridges won’t be built until basic villages are formed. Nowadays, with the relatively mature database software system, data integration software can catch up as the weak link in the data software ecosystem.
- The rich database ecosystem promotes the development of data integration software: Data normalization is a big problem, as a large amount of heterogeneous data and different data application scenarios objectively cannot give birth to a one-for-all database. Many directions for data applications are even in direct conflict with each other in terms of technical implementation, not to mention the many competitors in the market. Under objective conditions, it is impossible for a one-for-all database to emerge, just like hoping that Bolt can run very fast but also be good at playing football like Messi is impossible. The reason why there is a rich database ecosystem is mainly because different databases excel at different points. For example, small businesses can use standalone MySQL to solve their data management problem; processing geographic information may require using some data that supports geographic data types well; and those who need to search for detailed information need to use search engines such as ElasticSearch. In addition, the era is developing continuously and it’s highly probable that databases tailored for certain scenarios will continue emerging within this evolving ecosystem. These objective factors are actually opportunities for the development of data integration software and currently, the time has matured.
- Private domain data: In the past, data may have only been stored in databases. However, with the development of the internet, the application forms of data are now very diverse. There is a lot of private domain data on the entire internet, such as personal data from social media software, self-owned enterprise data within advertising systems and even some IoT device data. Data integration software is also indispensable for breaking these private domain data islands and integrating them.
- Digital cloud computing promotes the circulation of data: I have always believed that the future belongs to the cloud. The digital economy has driven a large number of enterprises to undergo digital transformation, and the continuous increase in cloud penetration rate has made data integration software an indispensable part of modern data stacks.
The gene of data integration software
The field of data integration has only really started to develop rapidly in recent years, so many entrepreneurs in this field are still feeling their way forward. Their understanding and focus on the industry may differ due to different genetic backgrounds. Different genes can affect the growth of products as well as the success or failure of the final market. For a commercial product in a specific field, everyone’s core functions are generally similar, and many times if we just take a superficial look, many features seem to be the same, but after a deep use we can find that they are completely different. This is because different companies invest resources differently even on the same features. In the long run, differences will gradually form in subsequent capability layouts. Therefore, it is important to have industry insights and observe its genes for basic software in the data integration field. Next I will share some important genetic differences I have seen among data integration software.
See T differently in ETL.
Fivetran and Airbyte are both integration software that weaken the role of T, making the entire data integration process as automated, smooth, and user-friendly as possible. They have respectively proposed the concepts of ELT and EL, which have received very positive feedback from the market. This is actually very critical because how you view T will make a big difference in terms of featires and user experience for data integration. If you take a closer look at StreamSets’ capabilities, you’ll find that its product strength is actually quite strong, but why is it sometimes not as popular with investors compared to Airbyte which was only established in 2020? The main reason is different genes - StreamSets has not given up on T and provides a canvas for you to orchestrate processors, strengthening the role of T in their product. In terms of product strength, StreamSets certainly does not lose out to Fivetran or Airbyte but customers and investors may prefer other options. Financing can be seen through funding rounds and valuations while market popularity can be glimpsed through Google Trends.
approve of the weakening of T. Of course, weakening T is only one of the important factors that Fivetran and Airbyte can gain trust from customers and investors, not the only factor. Weakening T essentially means choosing to give up some capabilities and some customers in order to make data integration products more focused on where they truly believe there is value. Doing subtraction has never been easy; people often pursue “more.” From my personal understanding, I also think that weakening T is the future. The main reasons are:
- T does increase the threshold for users to use data integration software: giving up T may lose some users, but users who demand EL will have a very good experience using it because it’s simple, stable, and efficient. Less is more.
- The introduction of T will increase system complexity, increase error probability, and reduce performance: if you abandon T, the technical architecture of data integration can be simplified further with more space for optimizing performance making EL process smoother and more stable.
The core objectives are different.
The different core objectives result in completely different product forms. Some examples will be listed below.
Like Fivetran, it follows a firm embrace of cloud data warehouses and SAAS, targeting mainly data consumers (data analysts, data scientists). Under this purpose, its gains and losses are as follows:
- You can take advantage of the cloud and be more efficient: By focusing solely on SAAS, you can fully leverage the advantages of cloud infrastructure, which is more flexible and cost-effective. Its own technology implementation can also be fully integrated with the cloud, such as using serverless services, ECS, high-performance storage on the cloud.
- Reduce support costs and save resource investment: Since there is no version available for private deployment or open source, there will be fewer headaches caused by thoses things. There is no need to maintain a community or provide additional support due to various heterogeneous system network environments of customers.
- Can focus more on the cloud experience: Embracing cloud data warehouses can lead to better collaboration with cloud vendors, and can focus on making the connectors associated with cloud data warehouses as smooth as possible. Support for some open source and non-cloud data warehouses can be abandoned or reduced.
- Beneficial for business: Focusing on SAAS and cloud data warehousing is advantageous for commercialization. Identifying users with commercial value early on allows for targeted sales and technical support. Users who are willing to use SAAS and cloud data warehousing often have good payment habits.
- Insufficient connector validation: There are many corner cases in the connection between connectors of heterogeneous data sources. Without a large number of user-assisted validations, it takes a very long time for the connector to mature.
- The user group and market that have been abandoned may have huge value: the current era is still a transitional period of digital transformation, abandoning non-cloud users and data integration scenarios outside the data warehouse may miss many business opportunities.
- Additional investment in compliance for SAAS: Enterprises are very concerned about data privacy and security. SAAS requires more resources to comply with regulations.
The core idea of Airbyte is very clear, mainly:
- Fully embrace open source and become an open-source EL(T) standard
- Target users include both data consumers and data engineers
- Compatible with private deployment and SAAS
- Abandon the volume-based pricing strategy of the past, making it more affordable.
As a latecomer, Airbyte obviously has a deep insight into the industry. The market that Fivetran gave up or customers who are not too stable are what Airbyte needs to occupy. With the power of open source and developers, Airbyte can fully expand more connectors at lower cost. Customers who pay special attention to data privacy and security, as well as data engineers, will be more inclined towards Airbyte.
The focus of the track is different.
The difference in focus among different enterprises on the track of data integration is also an important genetic difference. Whether to focus on the matter of data integration itself reflects an entrepreneur’s firm belief in the data integration track. Overseas companies, as far as I can observe, are relatively steadfast in their pursuit of this track. Of course, they include me and still firmly believe that data integration is an indispensable part of future data ecosystem software and infrastructure. I classify this level of focus into several categories:
- Low: There is no dedicated data integration commercial product provided, but its commercial products include the ability to integrate data. Obviously, such enterprises will not invest all their efforts in data integration. Many times, data integration software is more of an auxiliary positioning. We can see that popular domestic solutions for data platform, independent databases and data management software all have the ability in terms of data integration. This focus obviously indicates that these entrepreneurs are not players in the field of data integration race; they may specialize in databases or specialized solutions for a data platform etc. In this field of data integration, they are not yet considered players.
- Medium: There are specialized data integration commercial products available, but they are only one of many products in their product matrix. These entrepreneurs have seen the value of data integration software itself, but have not gone all-in on this track.
- High：In the field of all-in data integration, these companies often have their company name as the name of their data integration software. These entrepreneurs believe that the time is ripe and the data integration track is still a blue ocean that has not been fully developed. Modern data infrastructure software must have a place for data integration software, and it is in a very important position.
The different levels of focus on various tracks and the degree of determination towards the future will constantly influence a company’s decision-making process during its growth, ultimately affecting its success or failure. Therefore, there are many software programs in the market that have data integration capabilities, but there aren’t actually too many real players.
User profiles are different.
Players in the data integration track have different ways of portraying user profiles. Different user profiles will affect the layout and user experience of the entire product’s capabilities. This feeling is particularly deep for Fivetran and Airbytes. Generally speaking, we can divide the user profile of data integration software into two groups:
- Data consumers: This group of users is characterized by not having strong technical and development capabilities, but rather being end-users. They are generally data analysts, data scientists, financial personnel, etc. For them, the requirement for product usability is low enough. On the other hand, these people are not so sensitive to data latency. For example, Fivetran mainly targets data consumers and can relax CDC’s data delay to minutes without affecting their actual experience.
- Data Engineer: This group of users often have technical and development capabilities, and have higher requirements for the functionality, flexibility, and performance of data integration software. Data engineers can be responsible for operating data integration software while also developing more business-oriented real-time platforms on this basis. This group of users is more concerned about controlling the data software itself, and providing private deployment or open source kernel will win their favor.
In terms of domestic market, the user group of data engineers is relatively large. It is necessary to think and explore whether there is a market for data consumers in China.
The current situation of data integration track.
- This track still has huge potential: In fact, we have discussed the development of the entire data software ecosystem earlier. The data integration track only began to gain momentum after 2010, and there are still many problems to be solved in this field. A large number of new innovations are waiting to be discovered.
- There is still a lot of innovation space for data integration software: The entire data software ecosystem has been changing over the past decade. The rise of big data, cloud computing, cloud-native technology, AI and the emergence of new databases have injected new vitality and uncertainty into the entire data software ecosystem. Data integration software has also been constantly evolving in recent decades, from batch offline processing to real-time CDC (change-data-capture), from ETL (extract-transform-load) to ELT (extract-load-transform) as well as reverse-ETL. As a bridge and road role, the development of data integration software is closely related to that of the entire data ecosystem; environmental changes provide conditions for its innovation.
The future of data integration track
Data integration software will definitely be an important component of modern data stacks in the future. If there are any objections, you are welcome.