Author丨Krishna Puttaswamy, Suresh SrinivasSource
The full text is 7286 words in total, and it is recommended to read it for 15 minutes
Powered by data Uber
is revolutionizing the way the world gets around by empowering billions of ride-hailing and delivery services to connect millions of riders, businesses, restaurants, drivers and couriers. At the heart of this massive transportation platform is big data and data science that underpin all of Uber’s work, such as better pricing and matching, fraud detection, reduced estimated time to reach (ETA), and experimentation. Every day, petabytes of data are collected and processed, and thousands of users use this data to analyze and make decisions to build/improve these products.
brought about by scale expansion
While we were able to scale our data system, previously, we didn’t pay enough attention to important data issues, and as they scaled up, they became even more important, including
data duplication: Some key data and metrics lack a single source of truth, which leads to duplication, inconsistency, and a lot of confusion when using them. Consumers must take time off from solving business problems to do a lot of due diligence to compensate for this. The problem is exacerbated by the hundreds of thousands of datasets created using self-service tools, as we can’t clearly see which one is more important.
Spotting issues: Without rich metadata and faceted search, discovering data in hundreds of thousands of datasets is difficult. Poor findings result in duplicate data sets, duplicate work, and inconsistent answers (depending on the data used to answer the questions).
Disconnected tools: Data flows through many tools, systems, and organizations. But our tools don’t integrate with each other, resulting in duplication of effort and a poor development experience – for example, having to copy and paste documents and owner information between multiple tools; Developers cannot confidently modify the data schema because it is not clear how it will be used downstream.
Log inconsistencies: Logging on mobile devices is done manually; There is no uniform structure for logs, and we can’t measure the actual behavior of users in a simple, consistent way, only by inference (which is inefficient and error-prone).
Missing processes: The lack of cross-team data engineering processes results in varying levels of maturity across teams and a lack of consistent data quality definitions or metrics across teams.
Missing ownership and SLAs: Datasets don’t have a clear owner – they often have no quality assurance, inconsistent SLAs for bug fixes, phone support, incident management are far from the way we manage our services.
These problems aren’t unique to Uber — based on our conversations with engineers and data scientists at other companies, they’re common, especially for companies that are growing very fast. Because service failures/outages are immediately visible, we tend to focus more on service and quality of service and less on data and related tools. But at scale, it’s extremely important to address these issues and align them with the rigor of service tools/management, especially if data plays a key role in product functionality and innovation, just as it does at Uber.
holistic data solution is needed
The following diagram shows the high-level data flow from mobile applications and services to the data warehouse and final consumption plane. We initially only tackled the symptoms of data problems where they occurred in the data stream, without addressing the underlying issues. We recognize the need for a holistic approach to address these issues and to address their root causes once and for all. Our goal is to reorganize the data logging system, tools, and processes to gradually change the quality of data across Uber. We brought together teams across the end-to-end data flow stack, including engineers and data scientists from all parts of the stack, and ultimately modified more than 20 existing systems.
principles of data processing
data as code: Data should be treated as code. The creation, deprecation, and critical changes to data artifacts should go through a design review process and use appropriate written documentation that is written from the customer’s perspective. Model changes must be assigned reviewers who review them before they land. Model reuse/extension takes precedence over creating new models. Data artifacts have tests associated with them, which we want to test continuously. In general, this is the practice used for API services, and we need to be equally rigorous when considering data.
an owner: Data is the code, and all code must have an owner. Each data artifact must have a clear owner, a clear purpose, and be discarded when it is finished.
knows: Data artifacts must have data quality SLAs, as well as incident reporting and management, just as we do with our services. The owner is responsible for maintaining these service agreements.
Accelerate data productivity: Data tools must be designed to improve collaboration between producers and consumers, with owners, documents, and reviewers when necessary. Data tools must integrate seamlessly with other related tools so that we no longer have to think about the necessary metadata. Data tools should be staffed with the same level of developers as the service, provide the ability to write and run tests before changes land, test changes in a pre-production environment before moving to production, and integrate well with existing monitoring/alerting ecosystems.
Organization of the data: The team’s goal should be a “full-stack” configuration, so the necessary data engineering talent can take a long-term view of the entire lifecycle of the data. While more core teams have complex datasets, most teams generating data should aim for local ownership. We should have the necessary training materials and prioritize training engineers to be fairly proficient in data production and consumption practices. Finally, team leaders should be responsible for the ownership and quality of the data they produce and use.
Data must have
Uber’s data governance practices
target system to the number of rows in the source system.
Data duplication: The percentage of rows with a repeating primary key or unique key, which defaults to 0% duplication in the original data table and allows a small amount of duplication in the modeling table.
Consistency across data centers: The percentage of data loss when comparing a copy of a dataset in the current datacenter to a replica in another datacenter.
Semantic checking: Captures key properties of a data field, such as null/not-null, uniqueness, percentage of different values, and range of values.
> freshness: Time delay between data generation and data reaching 99.9% completion in the target system, including integrity watermark (default setting is 39 seconds), because optimizing only freshness without regard to integrity leads to low-quality decisions.
Integrity: The ratio of the number of rows in the
We have developed automated methods to generate “graded reports” for institutions, showing which data sets need to be graded, usage of graded data, etc., as a measure of an organization’s “data health.” We also track these metrics as a “engineering excellence” standard. As there is more adoption and feedback, we iterate on specific definitions and measurement methods to further improve them.
If we don’t automate these definitions and make them easy to use and apply, it’s not enough to have them. We combined several existing data quality tools into a single tool that implements these definitions. If it makes sense, we can automatically generate tests (for raw data, i.e. data dumped into the data warehouse by the Kafka topic, we can automatically generate four types of tests in addition to semantic tests ), and simplifies the test creation process by minimizing input from the dataset owner. These standard checks provide a minimal set of tests for each dataset, and the tool provides producers with the flexibility to create new tests with just one SQL query. We learned many interesting lessons, including how to scale these tests with low overhead, how to simplify the abstraction of building a set of tests for a dataset, when to schedule tests to reduce false positives and noise alerts, how to apply these tests to streaming datasets, and more that we hope to publish in future articles.
1. Databooks and metadata
underlying metadata: Examples include documentation, ownership information, pipelines, source code for generated data, sample data, pedigree, and artifact layers
about who uses this metadata when, popular queries, and artifacts used together
: Test the data, when it was run, which tests passed
: The resources used to calculate and store the data, including monetary costs
Bugs and SLAs: Bugs submitted for artifacts, incidents, recent alerts, and overall SLAs in response to owner issues
Usage metadata: Statistical quality metadata
, and the aggregated SLA cost metadata provided by the data
To achieve this goal, we’ve made a radical overhaul of the backend and UI of the internal metadata catalog Databook. We’ve standardized the metadata vocabulary to make it easy to add new metadata properties to existing entities, designed extensibility to easily define new entity types with minimal effort, and integrated most of our key tools into the system and published their metadata to this central location, connecting various data assets, tools, and users 。 The improved user interface is clearer and makes it easier for users to filter and narrow down the data they need. As a result of these improvements, tool usage has increased dramatically.
After digging deeper into the mobile framework that
Using the data generated by the logging framework above, we were able to greatly simplify the funnel analysis of rider behavior. We set up a dashboard in a matter of hours, which could have taken weeks in the past. This data is currently supporting many experiment monitoring and other dashboards that allow us to understand user behavior.
When we started Data180, there were many measurement code bases in the company. We evaluated the pros and cons of these solutions and standardized them on a codebase called uMetric. In fact, it’s more than just a codebase — it has advanced features like letting users focus on the definition of YAML formats and saving a lot of work by generating queries for different query systems such as Hive/Presto/Spark, building streaming and batch pipelines for metrics, automating the creation of data quality tests, and more. This system is gaining wider adoption, and we are investing to further strengthen it. We’re automating duplicate and near-duplicate metrics detection, integrating this system with Databooks and other data consumption interfaces so consumers can consume metrics directly instead of copying and running metrics SQL (tuning SQL is more error-prone and causes metrics to be duplicated), improving the nature of self-service, detecting errors before incidents occur, and more. This standardization has helped us greatly reduce duplication and confusion when consuming.
Cross-tool integration: We integrate tools so that once documents, ownership, and other key metadata are set on the source tool, it flows seamlessly across all downstream tools. We integrate pipeline tools with standard alerting and monitoring tools to provide consistency in service and data pipelines in generating and managing alerts.
fundamental improvements to the tool, enabling more automation to support different data quality checks; Enables more integrations to reduce effort
by enhancing the application logging framework, further capturing more visual information about what users actually “see” and “do” on the application
improving processes and tools, improving collaboration between producers and consumers
Implement lifecycle management of data assets to remove unused and unnecessary artifacts from the system
Further apply the above principles to the daily data development process of engineers and data scientists
Krishna Puttaswamy is a senior engineer at Uber. He handles various data and experimental issues in the marketing team. The work described in this blog is a solution to the real-world problems he faced when applying data to improve Uber’s apps and services. He currently leads DataNG and a project to rewrite the experimental platform. He previously worked on data/machine learning at Airbnb and LinkedIn.
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇