Problems with scaling
While we were able to scale our data system, previously, we didn’t pay enough attention to important data issues that became even more important as we scaled up, including
data duplication: Some key data and metrics lack a single source of truth, which leads to duplication, inconsistency, and a lot of confusion when using them. Consumers must take time off from solving business problems to do a lot of due diligence to compensate for this. The problem is exacerbated by the hundreds of thousands of datasets created using self-service tools, as we can’t clearly see which one is more important.
Spotting issues: Without rich metadata and faceted search, discovering data in hundreds of thousands of datasets is difficult. Poor findings result in duplicate data sets, duplicate work, and inconsistent answers (depending on the data used to answer the questions).
Disconnected tools: Data flows through many tools, systems, and organizations. But our tools don’t integrate with each other, resulting in duplication of effort and a poor development experience – for example, having to copy and paste documents and owner information between multiple tools; Developers cannot confidently modify the data schema because it is not clear how it will be used downstream.
Log inconsistencies: Logging on mobile devices is done manually; There is no uniform structure for logs, and we can’t measure the actual behavior of users in a simple, consistent way, only by inference (which is inefficient and error-prone).
Missing processes: The lack of cross-team data engineering processes results in varying levels of maturity across teams and a lack of consistent data quality definitions or metrics across teams.
Missing ownership and SLAs: Datasets don’t have a clear owner – they often have no quality assurance, inconsistent SLAs for bug fixes, phone support, incident management are far from the way we manage our services.
These problems aren’t unique to Uber — based on our conversations with engineers and data scientists at other companies, they’re common, especially for companies that are growing very fast. Because service failures/outages are immediately visible, we tend to focus more on service and quality of service and less on data and related tools. But at scale, it’s extremely important to address these issues and align them with the rigor of service tools/management, especially if data plays a key role in product functionality and innovation, just as it does at Uber.
Total data solutions
The following diagram shows the high-level data flow from mobile applications and services to the data warehouse and final consumption plane. We initially only tackled the symptoms of data problems where they occurred in the data stream, without addressing the underlying issues. We recognize the need for a holistic approach to address these issues and to address their root causes once and for all. Our goal is to reorganize the data logging system, tools, and processes to gradually change the quality of data across Uber. We brought together teams across the end-to-end data flow stack, including engineers and data scientists from all parts of the stack, and ultimately modified more than 20 existing systems.
To focus on thinking holistically, we took key “slices” of data related to trip and session information from the Rider app and tried to build a source of truth (SoT) for them, as well as the processes required to fix the logging, data processing tools, the data itself, and maintain it into a SoT on the app.
Basic principles of data processing
Unlike services that try to hide data and expose narrow interfaces outside the service, offline data in a warehouse is more about exposing data from related services and domains for analysis together. One of our key realizations is that in order to do this well, we need to solve not only the problem of data tools, but also the people and processes of the data. So we propose some guidelines
data as code: Data should be treated as code. The creation, deprecation, and critical changes to data artifacts should go through a design review process and use appropriate written documentation that is written from the customer’s perspective. Model changes must be assigned reviewers who review them before they land. Model reuse/extension takes precedence over creating new models. Data artifacts have tests associated with them, which we want to test continuously. In general, this is the practice used for API services, and we need to be equally rigorous when considering data.
an owner: Data is the code, and all code must have an owner. Each data artifact must have a clear owner, a clear purpose, and be discarded when it is finished.
knows: Data artifacts must have data quality SLAs, as well as incident reporting and management, just as we do with our services. The owner is responsible for maintaining these service agreements.
Accelerate data productivity: Data tools must be designed to improve collaboration between producers and consumers, with owners, documents, and reviewers when necessary. Data tools must integrate seamlessly with other related tools so that we no longer have to think about the necessary metadata. Data tools should be staffed with the same level of developers as the service, provide the ability to write and run tests before changes land, test changes in a pre-production environment before moving to production, and integrate well with existing monitoring/alerting ecosystems.
Organization of data: The goal of the team should be a “full-stack” configuration, so the necessary data engineering talent can take a long-term view of the entire lifecycle of the data. While more core teams have complex datasets, most teams generating data should aim for local ownership. We should have the necessary training materials and prioritize training engineers to be fairly proficient in data production and consumption practices. Finally, team leaders should be responsible for the ownership and quality of the data they produce and use.
Data must have
The issues that have been resolved
are in the rest of this article, We’ll highlight some of the most useful and fun takeaways from our programming experience.
Data quality and rating
We have endured a lot of hard work due to poor data quality. We have seen examples of inaccurate experimental measurements, resulting in a lot of manual labor and reducing the efficiency of validating and correcting data. As it turns out, this problem is becoming more common with the adoption of big data – according to a study by IBM and the Harvard Business Review (HBR), data-driven businesses will suffer a huge negative impact due to insufficient data.
To reduce heavy work and adverse business impact, we wanted to develop a common language and framework for discussing data quality so that anyone could produce and consume data with consistent expectations. To achieve this, we have developed two main concepts: standard data quality checks and dataset hierarchies to define
Data quality is a complex topic with many different aspects that deserve in-depth study, so we’ll limit the discussion of data quality to areas where we’ve made significant progress, leaving others for later. The environment in which Uber generates and uses data plays an important role in which areas of data quality we choose to focus on. While some of these can also be applied to others, some are not. At Uber, a series of common questions for data producers and consumers are: How do you make a trade-off between analyzing the latest and complete data? If pipelines are run in parallel in different data centers, how do we interpret data consistency across different data centers? What semantic quality checks should be run on a given dataset? We want to select a set of checks that provide a framework for interpreting these issues.
Data Quality Checks
After several iterations, we get the 5 main types of data quality checks described below. Each dataset must come with these checks and configure a default SLA
: The time delay between data generation and when the data reaches 99.9% completion in the target system, including the integrity watermark (default setting is 39 seconds), because optimizing only freshness without considering integrity leads to low-quality decisions.
target system to the number of rows in the source system.
Duplicates: The percentage of rows with a repeating primary key or unique key, which defaults to 0% duplication in the original data table and allows a small number of duplicates in the modeling table.
Consistency across data centers: The percentage of data loss when comparing a copy of a dataset in the current datacenter to a replica in another datacenter.
Semantic checking: Captures key properties of a data field, such as null/not-null, uniqueness, percentage of different values, and range of values.
Integrity: The ratio of the number of rows in the
Dataset owners can choose to provide different SLAs and provide appropriate documentation and explanations to consumers—for example, depending on the nature of the dataset, one may want to sacrifice completeness in exchange for freshness (like streaming datasets). Similarly, consumers can choose to consume datasets based on these metrics – running pipelines based on integrity triggers rather than simply time triggers.
We are continuing to investigate more complex checks, including consistency across dataset concepts, and anomaly detection on top of the time-dimensional checks described above.
In addition to quality measures, it is necessary to have a way to correlate datasets with different levels of importance of the business, so that it is easy to highlight the most important data. For services, we do this by assigning “ranks” (based on the business importance of the data). These levels help determine the impact of downtime and provide guidelines on which data levels should be used for which purposes. For example, if some data impacts compliance, revenue, or branding, then it should be labeled as Level 1 or Level 2. Temporary data created by users for less important temporary searches is marked as level 5 by default and can be deleted after a fixed period of time if not used. The data rating also determines the level of incidents submitted, and the SLA for fixing bugs created for the dataset. A by-product of grading is a systematic inventory of data assets that we rely on to make business-critical decisions. Another benefit of this is explicit deduplication of datasets that are similar or no longer the source of fact. Finally, the visibility achieved through hierarchical levels helps us restructure the dataset, which allows for improved modeling, data granularity consistency, and normalization levels.
We have developed automated methods to generate “graded reports” for institutions, showing which data sets need to be graded, usage of graded data, etc., as a measure of an organization’s “data health.” We also track these metrics as a “engineering excellence” standard. As there is more adoption and feedback, we iterate on specific definitions and measurement methods to further improve them.
Data quality tools
If we don’t automate these definitions and make them easy to use and apply, it’s not enough to have them. We combined several existing data quality tools into a single tool that implements these definitions. If it makes sense, we can automatically generate tests (for raw data, i.e. data dumped into the data warehouse by Kafka topics, we can automatically generate four types of tests in addition to semantic tests), and simplify the test creation process by minimizing input from the dataset owner. These standard checks provide a minimal set of tests for each dataset, and the tool provides producers with the flexibility to create new tests with just one SQL query. We learned many interesting lessons, including how to scale these tests with low overhead, how to simplify the abstraction of building a set of tests for a dataset, when to schedule tests to reduce false positives and noise alerts, how to apply these tests to streaming datasets, and more that we hope to publish in future articles.
Affects Databooks and metadata
As mentioned earlier, we have thousands of datasets and thousands of users. If we consider other data assets – reports, machine learning features, metrics, dashboards – the number of assets we manage is much larger. We want to make sure that: a) consumers use the right data to make decisions, b) producers make informed decisions to improve data, prioritize bug fixes, etc. To do this, we need a single catalog that collects metadata for all data assets and provides users with the right information based on their needs. In fact, we realized that bad discoveries had led to a vicious cycle of duplicate, redundant data sets for producers and consumers, which were then discarded.
We want to provide users with detailed metadata about each data artifact (table, column, measure):
underlying metadata: such as documents, ownership information, pipelines, source code that generated the data, sample data, pedigree, and artifact layers
Usage metadata: Statistical quality metadata about who used this metadata when, popular queries, and artifacts used together:
was tested, when it was run, which tests passed, and aggregated SLA
cost metadata provided by the data: Resources for calculating and storing data, including monetary cost
: Bugs submitted for artifacts, incidents, recent alerts, and overall SLAs create
bugs and SLAs
this single metadata catalog in response to owner questions and provide a powerful user interface with context-based search and discovery, Critical to enabling collaboration between producers and consumers, reducing the amount of effort to use data, and improving overall data quality.
To achieve this goal, we’ve made a radical overhaul of the backend and UI of the internal metadata catalog Databook. We’ve standardized the metadata vocabulary to make it easy to add new metadata properties to existing entities, designed extensibility to easily define new entity types with minimal effort, and integrated most of our key tools into the system and published their metadata to this central location, connecting various data assets, tools, and users. The improved user interface is clearer and makes it easier for users to filter and narrow down the data they need. As a result of these improvements, tool usage has increased dramatically. We cover these changes in detail in this blog: Turning Metadata Into Insights with Databook.
Application context logs
In order to understand and improve the product, it is crucial to have our application print logs to get the actual user experience. We wanted to measure the user experience, not infer it, but each team had a custom log printing method, resulting in inconsistencies in how the user experience was measured. We want to standardize the way teams print logs across the application, and even “platform” log printing, so that developers can develop all product features without thinking about how to print the necessary information through logs, such as: what was presented to the user, the state of the application when interacting with the user, the type of interaction and the duration of the interaction.
After digging deeper into the mobile framework that
Uber uses to build apps, we realized that the mobile app development framework (previously open source) already had a natural structure built into it that could provide critical information about the state of the app as users interacted. Automatically getting the RIB hierarchy will give us an idea of the state of the application and which RIBs (which can roughly be treated as components) are currently active. Different screens on the application map to different RIB levels.
Based on this intuition, we developed a library to capture the current RIB hierarchy, serialize it, and automatically attach it to every analytics event triggered by the application. In the back-end gateway that receives this information, we implement lightweight mapping from the RIB hierarchy to a flexible set of metadata, such as screen names, names of stages in the application, and so on. This metadata can evolve independently, allowing producers or consumers to add more information without having to rely on changes to the mobile app (slow and costly due to weeks-long build and release cycles). On the back end, the gateway appends this extra metadata to analytics events in addition to serializing state before writing to Kafka. This mapping on the gateway is also available through the API so that warehouse jobs can backfill data as the mapping evolves.
In addition to the core issues above, there are other issues that we must address that we will not go into detail here, such as: optimizing the serialized RIB hierarchy to reduce the analysis load size, making mapping efficient, keeping the mapping correct when the application changes through a custom test framework, correctly mapping RIB trees to states, standardizing screen and state names, and so on.
While this library doesn’t completely solve all the logging problems we’re trying to solve, it does provide a structure for logs that makes a lot of analysis easier, as described below. We are iterating on this library to solve other problems raised.
Rider Funnel Analysis
Using the data generated by the logging framework above, we were able to greatly simplify funnel analysis of rider behavior. We set up a dashboard in a matter of hours, which could have taken weeks in the past. This data is currently supporting many experiment monitoring and other dashboards that allow us to understand user behavior.
When we started Data180, there were many measurement code bases in the company. We evaluated the pros and cons of these solutions and standardized them on a codebase called uMetric. In fact, it’s more than just a codebase — it has advanced features like letting users focus on the definition of YAML formats and saving a lot of work by generating queries for different query systems such as Hive/Presto/Spark, building streaming and batch pipelines for metrics, automating the creation of data quality tests, and more. This system is gaining wider adoption, and we are investing to further strengthen it. We’re automating duplicate and near-duplicate metrics detection, integrating this system with Databooks and other data consumption interfaces so consumers can consume metrics directly instead of copying and running metrics SQL (tuning SQL is more error-prone and causes metrics to be duplicated), improving the nature of self-service, detecting errors before incidents occur, and more. This standardization has helped us greatly reduce duplication and confusion when consuming. This system is described in detail in this blog – The Journey Towards Metric Standardization.
Other tool and process changes In addition to the changes listed above, we have implemented several other
tool and process changes
, To improve our data culture, here’s a quick overview
Shared Data Model: To avoid redefining schemas of the same concept (which is common), we’ve improved the schema definition tool to allow importing and sharing existing types and data models. We are building additional features and processes to drive the adoption of shared data models and reduce the creation of duplicate and nearly duplicate data models.
Mobile Analytics mandatory code reviewers and unit tests: We’ve reorganized the pattern of mobile analytics events and allowed producers and consumers to add themselves as mandatory reviewers to avoid making changes without proper review and notification. We also built a mobile log test framework to ensure that data tests run at build time.
Enforce ownership: We’ve improved the underlying data tools and interfaces for data generation (schema definition, Kafka topic creation, pipelines for creating data, metrics creation, dashboard creation, and more) to enforce ownership information when the owner cannot be automatically inferred. Ownership information is further standardized into a single service across the company, tracking teams and organizations, not just individual creators. This modification avoids adding new masterless data. We further run heuristics to assign owners to “obsolete” datasets that have no owner or whose owner is no longer in the company, which puts us on track to achieve 100% ownership coverage.
Cross-tool integration: We integrate tools so that once documents, ownership, and other critical metadata are set on the source tool, it flows seamlessly across all downstream tools. We integrate pipeline tools with standard alerting and monitoring tools to provide consistency in service and data pipelines in generating and managing alerts.
We start with the assumption that thinking holistically about data—thinking about end-to-end data flows across people and systems—improves overall data quality. We believe that this effort has shown strong evidence in support of this hypothesis. However, the initial work is only the beginning of our journey to a better data culture. Building on the success of this work, we rolled out the program to different Uber organizations and apps. The project team focused on grading, building truth-based data sources, improving data quality and data SLAs, while the platform team continued to improve the tools described above and more. Both companies work together to improve processes and build a strong data culture at Uber. For example, here are some work in progress:
fundamental improvements to the tool, enabling more automation to support different data quality checks; Enables more integrations to reduce effortenhances
logging framework, further captures more visual information about what users actually “see” and “do” on the application
processes and tools, improves collaboration between producers and consumers
implements lifecycle management of data assets to remove unused and unnecessary artifacts from the system
Applying the above principles further to the daily data development process of engineers and data scientists
we hope to share more lessons as we move towards a better data culture in the future.
About the author
Krishna Puttaswamy is a senior engineer at Uber. He handles various data and experimental issues in the marketing team. The work described in this blog is a solution to the real-world problems he faced when applying data to improve Uber’s apps and services. He currently leads DataNG and a project to rewrite the experimental platform. He previously worked on data/machine learning at Airbnb and LinkedIn.
Suresh Srinivas is an architect focused on data platforms, focused on enabling users to successfully realize value from Uber’s data. The work described in this blog post is part of this effort. Prior to Uber, he co-founded Hortonworks, a company built around the Apache open source project to bring the Hadoop ecosystem to the enterprise. Suresh is a long-time contributor to Apache Hadoop and related projects and a member of the Hadoop PMC.
article is reproduced from: AI Frontline
Original link: https://eng.uber.com/ubers-journey-toward-better-data-culture-from-first-principles/
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇