1 What is Flink? Briefly describe the 2. Explain the

concepts of data flow, flow batch integration, fault tolerance, etc.?

3. What is the difference between Flink and Spark Streaming?

4. What does Flink’s architecture include?

5. What is the degree of parallelism of Flink, and what is the introduction?

6. How to set the parallelism of Flink?

7. Do you understand the Flink programming model?

8, DataStream in Flink jobs, Transformation introduction?

9. Do you understand Flink’s partitioning strategy?

10. Describe the steps included in the implementation of Flink wordcount?

11. What are the commonly used operators of Flink?

12. How does Flink calculate real-time topN?

1. What is Flink? In a nutshell,

Flink is a highly available and high-performance distributed computing engine with streaming as the core. It has the characteristics of integrated stream and batch, high throughput, low latency, fault tolerance, large-scale and complex computing, and provides data distribution, communication and other functions on the data stream.

2. Explain the concepts of data flow, flow batch integration, fault tolerance and so on?

Data flow: All generated data naturally has the concept of time, and arranging events in chronological order forms an event stream, also known as data flow.

Stream-batch integration:

First of all, you must first understand what bounded data and unbounded data are

class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_png/1flHOHZw6Rs5RdXziarqcHeG2ByniaSET5pu27OAMQvpwmn2ravPaTI2QH6kxoC9icV9LfBLgxhCZ04I9zuIuxpiaQ/640?wx_fmt=png”>

Bounded data, that is, the data flow within a certain time range, has a beginning, an end, once determined will not change again, generally batch processing is used to process bounded data, as shown in the bounded stream in the figure above.

Unbounded data, is the continuous generation of data flow, data is infinite, there is a beginning, no end, general stream processing is used to process unbounded data. As shown in the figure unbounded stream.

Flink’s design philosophy is to focus on streaming, batch is a special case of streaming, good at processing unbounded and bounded data, Flink provides precise time control and stateful computing to easily handle unbounded data flow, while providing a window to handle bounded data flow. Therefore, it is called a stream batch integration.

Fault tolerance:

In distributed systems, hardware failures, process exceptions, application exceptions, network failures and other exceptions are everywhere, and the Flink engine must ensure that it can not only restart the application after the failure, but also ensure that its internal state is consistent, and restart from the last correct point in time Flink

provides cluster-level fault tolerance and application-level fault tolerance

Cluster-level fault tolerance: Flink is closely connected to cluster managers, such as YARN and Kubernetes, and automatically restarts a new process when the process is hung up. High availability eliminates all single points of failure

and

application-level fault tolerance: Flink uses lightweight distributed snapshots and is designed for checkpoints for reliable fault tolerance.

Flink uses the checkpointing feature to provide Exactly-once semantics at the framework level, that is, end-to-end consistency, ensuring that data is processed only once, is not duplicated or lost, and even if it fails, it is guaranteed to write data only once.

3. What is the difference between Flink and Spark Streaming?

The

biggest difference between Flink and Spark Sreaming is that Flink

is a

standard real-time processing engine, event-driven, stream-core, Spark

Streaming’s RDD is actually a collection of small batches of RDDs, which is a micro-batch model with batch as the core.

Below we introduce the main differences between the two frameworks

:

1. The main

roles of Spark Streaming at runtime include:

service architecture cluster and resource management master / Yarn application master;

Work / Node Manager;

Task scheduler Driver; Task executor

Flink mainly includes: client client, job management jobmanager, task management taskmanager.

2. Task scheduling

Spark Streaming

continuously generates tiny batches of data to build a directed acyclic graph DAG, and Spark Streaming will create DStreamGraph and JobScheduler in turn.

Flink generates StreamGraph according to the code submitted by the user, generates JobGraph after optimization, and then submits it to JobManager for processing, JobManager will generate ExecutionGraph according to JobGraph, ExecutionGraph is the core data structure of Flink scheduling, JobManager schedules jobs according to ExecutionGraph, Deploy to the Task manager according to the physical execution diagram to form a specific Task execution.

3、 Time Mechanism

Spark Streaming supports a limited time mechanism and only supports processing time.

Flink supports three definitions of stream handler time: event time EventTime, ingestionTime IngestionTime, and processing time ProcessingTime. A watermark mechanism is also supported to handle lagged data.


4. Fault tolerance mechanism

For Spark Streaming tasks, we can set the checkpoint, and then if it fails and restarts, we can restore from the last checkpoint, but this behavior can only make the data not lost, may be repeated processing, can not do exactly once processing semantics.

Flink uses a two-phase commit protocol to solve this problem.

4. What does Flink’s architecture include?

The Flink architecture is divided into two parts: technical architecture and operational architecture.

1. The

technical architecture

is as follows as follows: Flink technical architecture:

As a distributed computing engine integrating stream batches, Flink must provide an API layer for developers, and also need to interact with external data storage, need connectors, after job development and testing, need to submit cluster execution, need deployment layers, and also need operation and maintenance personnel to be able to manage and monitor, and also provide graph computing, machine learning, SQL, etc., requiring application framework layers.

2.

The running architecture

is as follows for the Flink running architecture:

class=”rich_pages wxw-img” src=”https://welikepress.blob.core.windows.net/image/d9257c06-2edd-4fea-8708-7768e8900d31.png”>

The Flink cluster adopts the Master-Slave architecture, the role of the Master is JobManager, responsible for cluster and job management, the role of Slave is TaskManager, responsible for performing computing tasks, and Flink provides a client client to manage the cluster and submit tasks, and JobManager and TaskManager are the processes of the cluster.

(1) Client

Flink client is a

CLI command line tool provided by Flink, which is used to submit Flink jobs to the Flink cluster, and is responsible for the construction of StreamGraph and JobGraph in the client.

(2)

JobManager

JobManager decomposes the Flink application submitted by the Flink client into subtasks according to the degree of parallelism, applies for the computing resources required from the Resource Manager ResourceManager, and after the resources are available, starts to distribute the task to the TaskManager to execute the Task, and is responsible for application fault tolerance, tracks the execution status of the job, and resumes the job if an exception is found.

(3) TaskManager TaskManager

receives the subtasks distributed by JobManage and manages the life cycle stages of subtasks such as starting, stopping, destroying, and abnormal recovery according to its own resources. There must be a TaskManager in the Flink program.

5. What is the degree of parallelism of Flink, and what is the introduction?

When the Flink program is executed, it is mapped to a Streaming Dataflow. A Streaming Dataflow is made up of a set of Stream and Transformation Operators. Starts with one or more Source Operators and ends with one or more Sink Operators.

Flink programs are parallel and distributed in nature, and during execution, a stream contains one or more stream partitions, and each operator contains one or more operator subtasks. Operation subtasks are independent of each other, executed in different threads, or even on different machines or different containers.

The number of operator subtasks is the degree of parallelism of this particular operator. Different operators in the same program have different levels of parallelism.

A Stream can be divided into multiple Stream partitions, known as Stream Partitions. An Operator can also be divided into multiple Operator Subtasks.

As shown in the figure above, Source is divided into Source1 and Source2, which are the Operator Subtask of Source, respectively. Each Operator Challenge is executed independently in a different thread. The parallelism of an operator is equal to the number of operator subtasks.

The

parallelism of Source in the figure above is 2. The parallelism of a Stream is equal to the parallelism of the operator it generates. There are two modes when data is passed between

two operators:

(1) One to One mode: when two operators use this mode to pass, the number of partitions of the data and the ordering of the data will be maintained; As shown in Source1 to Map1 in the figure above, it preserves the partitioning characteristics of Source and the orderliness of partitioning element processing.

(2) Redistributing mode: This mode changes the number of partitions of the data; Each operator subtask will send data to different target subtasks according to the selection transformation, such as keyBy() will be repartitioned via hashcode, and broadcast() and rebalance() methods will be randomly repartitioned;

6. How to set the parallelism of Flink?

In the actual production environment, the degree of parallelism can be set at four different levels:

Operator

Level,

Execution Environment

Level,

Client Level, and

System Level

Priorities to pay attention to: operator level> environment level> client level, > system level.

7. Do you understand the Flink programming model?

The Flink application consists of three main parts, source source, transformation transformation, and destination sink. These streaming dataflows form a directed graph that starts with one or more sources and ends with one or more sinks.

8, DataStream in Flink jobs, Transformation introduction?

The Flink job consists of two basic blocks: DataStream and Transformation.

DataStream is a logical concept that provides an API interface for developers, and Transformation is an abstraction of processing behavior, including reading, calculating, and writing out data. Therefore, the DataStream API call in the Flink job actually builds multiple data processing pipelines composed of Transformations

present Flink supports the implementation of 8 partitioning strategies, and the data partitioning system is as follows:

class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_png/1flHOHZw6Rs5RdXziarqcHeG2ByniaSET5gQUE1HTbyibvSp8IQUAyNX9Fmiaqt0wicGgJa03xYaWoK4KnZEZUVbhKw/640?wx_fmt=png”>

(1) The GlobalPartitioner

data will be distributed to the first instance of the downstream operator for processing.

(2)

ForwardPartitioner

At the API level, ForwardPartitioner is applied to DataStream, generating a new DataStream.

The Partitioner is special and is used for data forwarding between upstream and downstream operators in the same OperatorChain, in fact, the data is directly passed to the downstream, requiring the same degree of upstream and downstream parallelism.

(3) ShufflePartitioner

randomly partitions elements to ensure that downstream tasks can obtain data evenly, using the following code:

dataStream.shuffle();

1

(4) RebalancePartitioner

allocates partitions to each element in a round-robin manner to ensure that downstream tasks can obtain data evenly and avoid data skew. The usage code is as follows:

dataStream.rebalance();

1

(5) RescalePartitioner partitions

according to the number of upstream and downstream tasks, use Round-robin to select a downstream task for data partitioning, such as 2 Source. upstream and 6 Maps downstream, then each Source will be assigned 3 fixed downstream Maps, and will not write data to partitions that are not assigned to themselves. This is different from ShufflePartitioner and RebalancePartitioner, which write to all partitions downstream.

Run the code as follows:

dataStream.rescale();

1

(6) BroadcastPartitioner

broadcasts the record to all partitions, that is, there are N partitions, and copies N copies of the data, 1 copy per partition, and its use code is as follows:

dataStream.broadcast();

1

(7) KeyGroupStreamPartitioner At the API level,

KeyGroupStreamPartitioner

is applied to KeyedStream

to generate a new KeyedStream.

KeyedStream partitions according to the keyGroup index number and outputs data to downstream operator instances according to the hash value of Key. The partitioner is not intended for user use.

KeyedStream uses the KeyedGroup partition form by default when constructing Transformation, so as to support the job rescale function at the bottom level.

(8) CustomPartitionerWrapper

user-defined partitioner. Users need to implement the Partitioner interface themselves to define their own partitioning logic.

10. Describe the steps included in the implementation of Flink wordcount?

It mainly includes the following steps:

(

1) Get the running environment StreamExecutionEnvironment

(

2) Access to the

source source (3) Perform conversion operations, such as map(),

flatmap(), keyby(), sum

() (

4) Output sink source, such as print().

(5) Execute

to provide an example:

class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_png/1flHOHZw6Rs5RdXziarqcHeG2ByniaSET55wictlvWx9NbabgHJWhseqmuJB3F78QpH9n6ictYtJ8Mtnibe40xXZQww/640?wx_fmt=png”>

11、 What are the commonly used operators of Flink?

Divided into two parts

:

(1) data reading, which is the starting point of Flink stream computing applications, commonly used operators are

: read from memory: fromElements, read from

files: readTextFile, Socket access: socketTextStream, you can also customize reading: addSource, mainly from kafka to get data

(2) Operators for processing data, mainly used in the conversion process

Commonly used operators include: Map (single input and single output), FlatMap (single input, multiple output), Filter (filter), KeyBy (group), Reduce (aggregate), Window (window), Connect (connection), Split (split), etc.

12. How does Flink calculate real-time topN?

To implement the TopN function, Flink mainly does the following operations: Flink

receives kafka data sources;

Based on EventTime processing, specify Watermark, where DataStream’s assignTimestampsAndWatermarks method is called, to extract the time and set watermark.

Convert Kafka’s json-formatted data into entity class objects.

Grouping is based on user Username, and a sliding window can be used for real-time statistics TopN. Set the window length to 10s, slide (slide) 5s each time, that is, update the ranking data in the past 10s in 5 seconds.

.keyBy(“username”)

.timeWindow(Time.seconds(10), Time.seconds(5))

.aggregate(new CountAgg(), new WindowResultFunction())

Using .aggregate(AggregateFunction af, WindowFunction wf) for incremental aggregation operations, it can use AggregateFunction to aggregate data in advance and reduce the storage pressure of the state.

CountAgg implements the AggregateFunction interface, which counts the number of entries in the window, that is, adds one when it encounters a piece of data.

WindowFunction outputs the aggregated results of each key and each window with other information. The WindowResultFunction implemented here encapsulates the user name, window, and access volume into a UserViewCount for output.

In order to count the active users under each window, we need to group by window again and perform the keyBy() operation according to the windowEnd in UserViewCount. Then use ProcessFunction to implement a custom TopN function TopNHotItems to count the top 3 users in terms of clicks and format the ranking results into strings for subsequent output.

.keyBy(“windowEnd”)

.process(new TopNHotUsers(3))

.print();

ProcessFunction is a low-level API provided by Flink that mainly provides the functionality of timer timers. Use timer to determine when the access data of all users under a certain window is collected. Since the progress of the watermark is global, in the processElement method, whenever a piece of data ItemViewCount is received, a timer of windowEnd+1 is registered. When the timer of windowEnd+1 is triggered, it means that the watermark of windowEnd+1 is received, that is, all user window statistics under the windowEnd are collected. Then use onTimer() to sort all the products and clicks collected, select TopN, and format the ranking information into a string for output.

Use ListState to store each UserViewCount message received, guaranteeing non-loss and consistency of state data in the event of a failure. ListState is a State API similar to the Java List interface provided by Flink, which integrates the framework checkpoint mechanism to guarantee exactly-once semantics.

Original address: https://blog.csdn.net/qq_32727095/article/details/122680277

end




 

public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords.

like + Looking, less bugs 👇