id=”js_tags” class=”article-tag__list”> included in the collection #大数据
id=”js_article-tag-card__right” class=”article-tag-card__right”> 3
The full text is 7198 words, which is recommended reading 19 minutes
In 2021, we saw a considerable acceleration around the rise of modern data stacks. We now have a tsunami of newsletters, influencers, investors, dedicated websites, conferences, and events to promote it. The concept around the modern data stack, albeit still in its early stages, is closely tied to the explosive growth of data tools in the cloud. Cloud computing brings a new infrastructure model that will help us build these data stacks quickly, programmatically, and on-demand, using cloud-native technologies like Kubernetes, infrastructure as code like Terraform, and cloud best practices for DevOps. As a result, infrastructure becomes a key factor in building and implementing a modern data stack.
As we enter 2022, it is clear that best practices in software engineering have begun to infuse data: data quality monitoring and observability, specialization at different ETL layers, data exploration, and data security all thrived in 2021 and will continue as data-driven companies from early-stage startups to multibillion-dollar Fortune 500 companies continue to store and process data and processing into databases, Cloud data warehouses, data lakes, and data lakehouses.
Below you’ll find 5 data trends we forecast to establish or accelerate in 2022.
If 2020 and 2021 were about the rise of data engineers (according to Dice’s Tech Work Report, that’s the most important). fastest-growing job in tech in 2020), then in 2022, analytical engineers will clearly enter the spotlight.
The rise of cloud data platforms has changed everything. Traditional technology structures, such as cubic and monolithic data warehouses, are giving way to more flexible and scalable data models. In addition, the transformation can take place on all data within the cloud platform. ETL has largely been replaced by ELT. Who controls this transformation logic? Analytical Engineer.
The rise of
this role can be attributed directly to the rise of cloud data platforms and data building tools (DBTs). DBT Labs, the company behind DBT, actually created the character. The dbt community started with five users in 2018. As of November 2021, there are 7300 users.
The analytics engineer is an example of a natural evolution, as data engineering is likely to end up in multiple T-engineering roles, driven by engineers developing self-service data platforms rather than pipelines or reports.
Analytics engineers first appeared in cloud natives and startups such as Spotify and Deliveroo, but have recently begun to gain status in enterprise companies such as JetBlue. You can read an article here by the Deliveroo engineering team on the emergence and evolution of analytics engineering in their organization.
We are seeing more and more modern data teams adding analytics engineers to their teams as they become increasingly data-driven and build self-serving data pipelines. According to LinkedIn job postings, typical must-have skills for an analytics engineer include SQL, dbt, Python, and tools related to modern data stacks (such as Snowflake, Fivetran, Prefect, Astronomer, etc.).
Job Postings as of December 1, 2021
According to LinkedIn, the
demand for data scientists is about 2.6 to 2.7 that of analytics engineers, and the gap continues to close.
In 2022, we expect this gap to narrow even further, as the demand for analytics engineers continues to grow, approaching the demand for data scientists (once known as the sexiest job in tech).
between data warehouses and data lakes
.” Few in the data community missed the very public showdown between Databricks and Snowflake at the end of 2021. It all started when Databricks claimed the TPC-DS benchmark record for its data lake library technology, saying a study showed it was 2.5 times faster than Snowflake. Snowflake said Databricks lacked integrity and said the study was flawed and had an “uncertain” claim.
We don’t have to go back that many years to when Snowflake and Databricks were emerging cloud software startups and they were so friendly that their sales teams often passed customer leads to each other. That all has changed now, as Snowflake accused Databricks of using shady marketing to win attention. This is at stake in the potential revenue of tens of billions of dollars in the future. Ali Ghodsi, CEO and co-founder of Databricks, noted in a statement how Snowflake and Databricks coexist in the data heap of many customers.
“What we’re seeing is that more and more people now feel like they can actually use their data in the data lake for data warehousing workloads with us. And these could be workloads that would otherwise go to Snowflake. “
Data warehouse vendors are moving from existing patterns to convergence of data warehouse and data lake patterns. Similarly, vendors that started their journey at the data lake are now expanding into the data warehouse space. We can see the convergence of both sides happening.
So just as Databricks made its data lake look more like a data warehouse, Snowflake has been making its data warehouse look more like a data lake. In short, a data lakehouse is a platform designed to combine the benefits of a data warehouse and a data lake. According to marketing terminology, a data lake room combines the benefits of a data warehouse and a data lake to provide converged workloads for data science and analytics use cases. Databricks leverages the term in its marketing materials, while Snowflake prefers the term data cloud.
But does a data lakehouse mean the end of data warehousing? Data Lakehouse is a new, open data management architecture that combines the flexibility, cost-effectiveness, and scale of a data lake with data management and ACID transactions in a data warehouse to make business intelligence and ML for all data possible.
That was in 2012, when experts at Strata-Hadoop World claimed that data lakes would kill data warehouses (startups rejected SQL and used Hadoop — SQL was a bit inferior at the time, for reasons that seem ridiculous today). Such a death never happened.
Will newer concepts paired with technological innovations in cloud computing and converged workloads be scrapped in 2022?
Time will tell, but the field is heating up and we expect more open showdowns in 2022. Other startups in the space, such as Firebolt, Dremio, and Clickhouse, have all recently raised significant amounts of money, pushing valuations above $1 billion.
The evolution of data storage and warehouses
as Ali Goldsey puts it, is not going to be a winner-take-all market.
“I think Snowflake is going to be very successful
, I think Databricks is going to be very successful… You’ll see other top companies pop up, I’m sure, in the next three to four years. It’s just a huge market, and it makes sense for a lot of people to focus on pursuing it.
According to Bill Inmon, who has long been considered the father of data warehousing, data lake libraries offer an opportunity similar to the early days of the data warehouse market. A data lake library can “combine the data science focus of a data lake with the analytical capabilities of a data warehouse.”
vs Data Lake VS Data Lake Pavilion by Striim
Data Lakehouse vs. Data
Warehouse (vs. Data Lake) is still an ongoing debate. The choice of data architecture should ultimately depend on the type of data the team is working with, where the data comes from, and how stakeholders will use it.
As the debate about data warehousing vs data silos intensifies in 2022, it’s important to separate hype and marketing jargon from reality.
just like M att Turck in his MAD Landscape 2021 analysis feels like real-time has always been a technological paradigm that has always been about to explode. As we enter 2022, the trade-offs we hear still seem to be in terms of cost and complexity. If a company is building a cloud data warehouse and needs to have an immediate impact of 4-6 weeks, the overall concept still seems to be that this is a live streaming pipeline compared to a batch pipeline. Or if a company is at the beginning of its data journey, it’s pure overkill.
At Validio, we expect this perception to change in the coming years as technology in the real-time space continues to mature and cloud hosting continues to evolve. Many use cases, such as fraud detection and dynamic pricing, struggle to get value without real-time processing.
As cloud service providers continue to improve their streaming tools, data-led organizations are moving toward building large-scale streaming platforms. It’s also a concept that Ali Ghodesi implies.
“If you don’t have a real-time stream processing system, you have to handle things like that, well, then the data is coming in every day. I’m going to put it here. I’m going to add it over there. So, how do I check? What if some data is late? I need to join two tables, but that table is not here. So, maybe I’ll wait a bit and run it again again. Ali Ghodsi on a16z
Kafka has been a solid streaming engine for the past 10 years. Heading into 2022, we’re seeing companies increasingly turn to cloud-hosted engines like Amazon’s Kinesis and Google’s Pub/Sub.
The zombie dashboard is a very concrete example of why this streaming/real-time movement is happening gradually. They seem to become a very real thing in modern data-driven companies, which Ananath Packkildurai (founder of Data Engineering Weekly) discusses in this Twitter thread
For many companies, operational analytics is a good starting point to start their move towards real/near real-time analytics. As Bucky Moore, a partner at Kleiner Perkins, discussed in his recent blog post:
“Cloud data warehouses are designed to support business intelligence use cases, which are equivalent to large queries that scan an entire table and summarize the results. This is ideal for historical data analysis, but for the question “What’s happening now?” Such queries are becoming increasingly popular to drive real-time decision making. That’s what operational analytics refers to. Examples of this include in-app personalization, churn forecasting, inventory forecasting, and fraud detection. Whereas business intelligence, operational analytics queries connect many different data sources together, require real-time data ingestion and query performance, and must be able to handle many queries simultaneously. “
Thanks to noted by McKinsey back in 2020, the cost of real-time data messaging and streaming pipelines has dropped dramatically, paving the way for mainstream use. McKinsey further predicts in an article that by 2025, data generation, processing, analysis, and visualization of end users will be vastly altered by new and more pervasive technologies, such as Kappa or lambda architectures for real-time analytics, leading to faster and more powerful insights. They argue that with the falling cost of cloud computing and the introduction of more powerful “in-memory” data tools (e.g., Redis, Memcached), even the most sophisticated advanced analytics can reasonably be made available to all organizations.
It can’t be objectively said whether streaming data is becoming more critical than batch data as we enter 2022 – because of the huge differences between different companies and use cases. For example, Chris Riccomini designed a hierarchy of data pipeline progression. He believes that data-driven organizations will go through such an evolutionary sequence in their pipeline maturity.
We don’t make any predictions about whether maturity progress for the aforementioned pipelines will become more common – some argue that live streaming pipelines are almost always overkill.
However, we are seeing more and more companies investing in real-time infrastructure as they move from data-driven (making decisions based on historical data) to being data-driven (making decisions based on real-time and historical data). A good indicator of this trend is Confluent’s explosive IPOs and new products such as Clickhouse, Materialize, and Apache Hudi, which offer real-time capabilities on data lakes.
The timeliness of data, such as moving from this batch-based periodic architecture to a more real-time architecture, will become an increasingly important competitive factor as every modern company becomes a data company. We expect this to accelerate further in 2022.
In the data infrastructure space, the PLG (product-led growth) trend has been going on for several years as usage-based pricing, open source, and software affordability have pushed purchasing decisions to end users. However, product-led growth and usage-based pricing can be complex to implement and execute on the software side from a business model and product perspective compared to the traditional sales-led market model. Cloud marketplace platforms through AWS, GCP, and Azure are becoming the best first step for businesses to move toward the future of digital sales.
As developer tools companies – including startups in the modern data stack – deploying different levels of PLG initiatives (free/free/free trials of products) become more or less the norm, we are also experiencing the rise of the cloud market as the preferred channel for modern data teams to adopt new technologies. This is largely due to the frictionless buying experience they offer that resembles a consumer (think the Apple App Store or Google Play Store), and data teams can leverage the spending of cloud vendors they’ve already committed to adopting new technologies through the cloud marketplace.
For the world’s leading cloud companies, the cloud market is now a necessity to enter the market, not an option. These figures – both realized and predicted – illustrate why.
Enterprise committed spending through the big three cloud providers exceeds $250 billion per year – and that number is climbing fast.
In 2021 alone, independent software vendors generated more than $3 billion in revenue through cloud marketplace platforms, a number that Bessemer predicts will grow in multiples of 10 percent over the next few years.
Forrester had projected that 17% of the $13 trillion in global B2B spending will flow through e-commerce and marketplace platforms by 2023 – but that number could already be reached in 2021.
A 2020 Tackle survey found that 70% of software vendors said they had increased their focus and investment in marketplace platforms as a gateway to market due to the advent of COVID-19.
> More than 45% of Forbes The Cloud 100 companies actively use the cloud marketplace as a distribution channel for their software.
The explosive growth of the cloud market is mainly due to the mutual advantages they offer to modern data teams and data infrastructure technology vendors.
recent study by Gartner predicts that by 2025, nearly 80% of sales interactions will take place through digital channels. Distributing technologies through GCP, AWS, or Azure Cloud Marketplaces is becoming a natural entry point for modern data teams. Modern data stack companies such as Astronomer and Fivetran have found success by becoming early adopters in the cloud market. Other early adopters of the cloud market, such as CrowdStrike, have seen sales cycle times reduce by nearly 50 percent.
Buying behavior has changed completely, and modern data teams are looking forward to consumer-level experiences in their business lives. They want to discover, try, and even buy new data infrastructure technologies in a very low-key, technology-leading way. The cloud marketplace is becoming an access point for these teams to explore new technologies, just as the Apple App Store and Google Play Store are becoming access points for all of us to explore new everyday services and entertainment.
Startups that provide modern data infrastructure tools can learn obvious patterns and experiences from our consumer lives to remove friction, scale sales more effectively, and help data teams get value faster.
We expect that in 2022, the cloud marketplace will become the preferred way for modern data teams to adopt modern data stack technologies. With so much of the concept of the modern data stack already emerging due to the explosive growth of the cloud and new infrastructure, it makes sense that the cloud market would be the natural entry point.
It’s incredible to see the data quality space in the context of modern data stacks go from a niche category in 2020 to a full explosion in the last 18 months, with a total of $200 million flowing into the space in 2021. Even the G2, in their recent “What Is Happening in the Data Ecosystem in 2022” article, pointed out that 2022 will be the world of data quality, and they saw a sharp increase in traffic in the data quality category in 2021, which is an unusual trend.
In the context of modern cloud data infrastructure, the rise of the data quality category makes great sense. Data quality is not only the foundation of any modern data-driven company (whether it’s plain reporting, business intelligence, operational analytics, or advanced machine learning), according to the 2022 State of Data Engineering Survey, data quality and validation are the number one challenge cited by survey respondents, primarily data engineers. 27% of respondents are unsure of what, if any, data quality solution their organization uses. For organizations with low DataOps maturity, that number jumps to 39 percent.
However, the explosive growth of data quality technologies has also had some negative effects. With the rapid explosion of modern data quality tools, we can also see a lot of inconsistent and overlapping usage of terminology in the field. As the authors point out, Bessemer’s players in the field of data quality have coined terms borrowed from application performance monitoring, such as “data downtime” (a joking term for “application downtime”) and “data integrity engineering” (a joking term for “site reliability engineering”).
There are now countless ways to describe important but somewhat complex processes that can be defined as data quality verification and monitoring. We see terms such as data observability, data integrity, data integrity engineering, data quality monitoring, data datadog, real-time data quality monitoring, data downtime, unknown data failure, silent data failure, etc. being used interchangeably and inconsistently.
In its current state, most data quality tools in modern data stacks focus on monitoring pipeline metadata or SQL queries on static data in warehouses – some linked to different levels of data context or root cause analysis.
A piece of software now defined as a data observability tool might focus only on data lines, or only on monitoring pipeline metadata. A tool that provides real-time data quality alerts but does not support monitoring live streaming pipelines may now be defined as a real-time data quality monitoring tool. A tool that only makes SQL queries on data in a warehouse might be defined as an end-to-end data integrity tool, while a tool that monitors pipeline metadata might be defined as a data quality monitoring tool (and vice versa). The list goes on. There are a lot of inconsistencies now that lead to confusion in the market and for end users.
The data quality category in the
MAD landscape in 2020 compared to the 2021 landscape is
something that goes
beyond data quality and extends to the entire modern data stack.
One of the most powerful early indicators of an industry is the proliferation of new terminology whose use is inconsistent. As a concrete example, most of us think of, for example, Shopify or WordPress when someone says an ecommerce platform or a CMS platform, and have a clear idea of what the tool does in business. But when you hear terms like “operational analytics,” “data lake,” or “data observability,” someone working in the data world may find it difficult to spell out exactly what they mean and/or what they contain. This is often directly related to the fact that many terms are coined by companies that have broken new ground with specific technologies and have been created by classification. Interestingly, even the hottest data terms, such as “modern data stack”, lack a consistent definition in the data world – besides, terms such as “data web” and “data structure” are often used to describe new data architectures.
As actual users layer the technology onto their stacks and build use cases, the industry will eventually help shape the definition of specific tools and architectural patterns.
In 2022, as modern data stacks and data quality categories mature, we also expect to see harmonization and consistency in how terminology is used.
public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords.
like + Looking, less bugs 👇