OLAP Technical architecture
The impala technology architecture shown in the figure above visually shows the core modules of OLAP technology: data model, storage format and data processing architecture;
model layer mainly solves the problem of data transmission, through data serialization and deserialization, and provides remote invocation (such as RPC) functions, so as to achieve cross-platform, multilingual, client and server data transmission and communication. Traditional
cross-language communication scheme:
WebService based on SOAP
message format RESTful service distributed
, large data volume cross-language communication scheme based on JSON message format
The following figure shows the
protocol stack structure, which is transmitted through the parsing, sending, and acceptance of structured data
You may wonder why there are so many communication protocols and communication frameworks; Doesn’t it smell good to use only JSON? In fact, when you want to store some data in a file or want to send it over the network, you go through several evolutionary stages:
using built-in serialization in programming languages such as Java serialization, Ruby’s marshal, or Python’s pickle.
realize that being stuck in a programming language is bad, so you switch to a widely supported, language-agnostic format such as JSON.
However, you find that JSON is too redundant, parses too slowly to distinguish between integers and floating-point numbers, and you think you like binary and Unicode strings very much.
Then you’ll find that people populate all sorts of random fields into their objects with inconsistent types, and you really need a schema and some documentation. Maybe you’re still using a statically typed language and generating a model class from a schema. You’ll also realize that JSON-like binaries aren’t actually that compact because you’re storing field names over and over again. If you have a schema, you can avoid storing the object’s field names, which saves more bytes.
If you reach stage four, your choice will generally be Thrift, Protocol Buffers or Avro. All three approaches provide Java personnel with efficient, cross-language data serialization (using schema) and code generation.
formatStorage format refers to how data is stored in the storage medium; Traditional relational databases, such as Oracle, DB2, MySQL, SQL SERVER, etc., use row-based storage (Row-based), in a database based on row storage, data is stored according to the row data as the basic logical storage unit, and the data in a row exists in the form of continuous storage in the storage medium. With the development of big data, columnar stores and columnar databases have now emerged. It is very different from traditional row-based databases. Column-based is relative to row-based storage, and distributed databases such as emerging Hbase, HP Vertica, and EMC Greenplum all use columnar storage. In a columnar-based database, data is stored according to column-based logical storage units, and data in one column exists in the form of continuous storage in the storage medium. The following figure shows the comparison of the two main columnar storage formats, Parquet and ORC.
Therefore, what kind of storage format to choose will affect to a large extent whether to support schema evolution, whether to support ACID, whether to support update operations, query performance, data compression capabilities and other data
Data processing frameworks refer to OLAP engines or OLAP tools, such as presto, doris, doris, kylin, clickhouse, impala, and so on. These OLAP data processing frameworks generally send SQL queries through the client, engine nodes parse and analyze query statements, execute a query task, and send result sets to the client; The following diagram shows the impala execution process
what to size
Having said that, what is our OLAP technology selection; Obviously, we are selecting
the OLAP data processing framework
but the storage format used by the data processing framework plays a decisive role in many aspects of OLAP; As shown in the figure below, if we select impala (using Parquet as the storage format), we cannot expect good support for schema evolution; For another example, if DRUID is selected (used as a storage format), the execution process of the underlying query on the storage engine cannot be executed vectorally, so the performance should be worse than impala and presto when aggregating and calculating queries on large data volumes and a small number of columns.
through the storage format used by the data processing framework, combined with our needs, the selection range can be greatly reduced; How can we further choose OLAP technology that meets our scenario in this small area? In the next article, we will talk about OLAP technology classification, through which we can further narrow our selection scope and finally select the appropriate technology.
two years of experience to win an Ant/Headline/PingCAP Offer, awesome
." Kuaishou big data platform as a service
Follow me, Java learning is not lost!"