Source: https://xie.infoq.cn/article/19e95a78e2f5389588debfb1c

IM Core Concepts

User: User message of the system:

refers to the content of communication between users. Usually in the IM system, messages will have the following categories: text messages, emoji messages, picture messages, video messages, file messages, etc.

Conversations: usually refers to the association group established by

chat between two users: usually refers to the association

established

by chat between multiple users

Terminal : refers to the machine where the user uses the IM system. Usually there are Android, iOS, Web, etc.

unread: refers to the

number of messages that the user has not readUser

status: refers to the state relationship chain such as whether the user is currently online, offline or suspended

 : refers to the relationship between users, usually one-way friend relationship, two-way friend relationship, attention relationship, and so on. The difference from conversation needs to be noted here, where a user only generates a conversation when they initiate a chat, but relationships do not need to be established in a chat. For the storage of relational chains, you can use a graph database (Neo4j, etc.), which can naturally express relationships in the real world, and it is easy to model

single chat: one-to-one chat group chat

: multi-person

chat

Customer service: In the field of e-commerce, it is usually necessary to provide users with pre-sales consultation, after-sales consultation and other services. At this time, it is necessary to introduce customer service to deal with the user’s

consultation message

split: in the field of e-commerce, a store usually has multiple customer services, and at this time, it is the message distribution to decide which customer service will handle the user’s consultation. Usually message triage will determine which customer service the message will be diverted to according to a series of rules, such as whether the customer service is online (if the customer service is not online, it needs to be re-triaged to another customer service), whether the message is pre-sales consultation or after-sales consultation, the busyness of the current customer service, etc.

Mailbox: The mailbox in this article refers to a Timeline, a queue for sending and receiving messages

Read diffusion vs write diffusion

read diffusion

Let’s start with reading diffusion. As shown in the figure above, A has a mailbox (some blog posts will be called Timeline) with each chat person and group, and A needs to read all the mailboxes with new messages when viewing chat messages. Here reading diffusion needs to pay attention to the difference with the Feeds system, in the Feeds system, everyone has a writing box, writing only needs to write to their own mailbox once, reading needs to read from the mailbox of all concerned people. But read diffusion in IM systems usually has one mailbox for every two associated people, or one mailbox per group.

Advantages of reading diffusion:

    write

  • operation (sending messages) is very lightweight, whether it is a single chat or a group chat, you only need to write to the corresponding mailbox once Each
  • mailbox is naturally the chat record of two people, which can facilitate the viewing of chat records and the search of chat records

Disadvantages of Read

Diffusion: Write Diffusion Next look at

Write Diffusion

.

In write diffusion, everyone only reads messages from their own mailboxes, but when writing (sending messages), the handling of single chat and group chat is as follows

:

    > single chat: write a message to both your own mailbox and the other party’s mailbox, and at the same time, if you need to view the chat history of two people, you need to write another one (of course, If you can also trace back all the chat history of two people from the personal mailbox, but this will be very inefficient).
  • Group chat: You need to write a message to the email address of all group members, and if you need to view the chat history of the group, you need to write another copy. It can be seen that write diffusion greatly amplifies write operations for group chats.

Write diffusion advantages: Write diffusion disadvantages:

Note that in the feeds system

:

  • write diffusion is also called: Push, Fan-out, or Write-fanout
  • Read
    diffusion is also called: Pull, Fan-in, or Read-fanout

Unique

ID design

Usually, ID design mainly has the following categories:

  • UUID
  • is based on Snowflake’s ID generation method
  • Based on the generation method of

  • applying for DB step,
  • the specific implementation method and advantages and disadvantages of

  • generating

    unique

  • IDs based on the special rules of Redis or DB auto-increment ID generation method

can refer to a previous blog post: distributed unique ID Parsing

the

need for a unique ID in the IM system is mainly:

message ID

Let’s take a look at three issues that need to be considered when designing message IDs.

Is it okay if the message ID is not incremented

, let’s first see what happens if

it is not incremented

:

    use

  • strings, waste storage space, and cannot use the characteristics of the storage engine to store adjacent messages together, reducing the write and read performance of messages
  • Use numbers,

  • but the numbers are random, and you cannot use the characteristics of the storage engine to store adjacent messages together, which will increase random IO and reduce performance; And random IDs are not good at ensuring the uniqueness of IDs

, so it is best to increment the message ID.

Global increment vs

user-level increment vs session-level increment Global increment

: means that the message ID is incremented over time throughout the IM system. For global increments, you can generally use Snowflake (of course, Snowflake is only a worker-level increment). At this time, if your system is read and spread, in order to prevent message loss, then each message can only carry the ID of the previous message, and the front end determines whether there is a missing message according to the previous message, and if there is a message loss, it needs to be pulled again.

User-level increment: means that the message ID is only guaranteed to be incremented in a single user, and does not affect and may be duplicated between different users. Typical representative: WeChat. If you write a diffusion system, the mailbox timeline ID and message ID need to be designed separately, the mailbox timeline ID user level is incremented, and the message ID is increased globally. If you read the diffusion system, you feel that it is not very necessary to use user level increments.

Session level increment: means that the message ID is only guaranteed to be incremented in a single session, and it does not affect and may be duplicated between different sessions. Typical representative: QQ.

Continuous increment vs monotonic

increment

Continuous increment means that the ID is generated in the manner of 1,2,3…n; Monotonic increment means that as long as the ID generated later is larger than the ID generated earlier, it does not need to be continuous.

As far as I know, QQ’s message ID is used at the session level in continuous increments, which has the advantage that if a message is lost, when the next message comes and finds that the ID is not continuous, it will go to the server to avoid losing the message. At this time, some people may think, can’t I use the way of timing pull to see if any messages are lost? Of course not, because if the message ID only increases continuously at the session level, if a person has thousands of conversations, how many times will it have to be pulled, the server will definitely not be able to resist.

For read diffusion, using continuous increments of message IDs is a good approach. If monotonic increments are used, the current message needs to be accompanied by the ID of the previous message (i.e. the chat message forms a linked list) in order to determine whether the message is missing.

To summarize

:

  • write diffusion: the mailbox timeline ID is incremented using the user level, and the message ID is incremented globally, at this time, as long as the monotonic increment is
  • guaranteed, it can be read and spread: the message ID can be incremented at the session level and preferably continuously

Let’s take a look at the problems that need to be paid attention to when designing session

IDs:

among them, session IDs have a relatively simple way to generate (special rules generate unique IDs): concatenate from_user_id and to_user_id:

  1. If both from_user_id and to_user_id are 32-bit shaped data, it can be easily stitched together into a 64-bit session ID by bit arithmetic, i.e.: conversation_id = ${from_user_ id} << 32 | ${to_user_id} (make sure that the user ID with a relatively small value is from_user_id before concatenation, so that any two users starting a session can easily know the session ID)
  2. if from_user_id followed by to_user_id  If it is all 64-bit shaping data, it can only be spliced into a string, and if it is spliced into a string, it will hurt, and the performance of wasting storage space is not good.

The

former owner is using the first method above, and the first method has a hard flaw: with the global expansion of the business, if the 32-bit user ID is not enough to expand to 64 bits, it needs to be drastically changed. 32-bit plastic IDs seem to be able to hold 2.1 billion users, but usually in order to prevent others from knowing the real user data, the IDs used are usually not continuous, and at this time, 32-bit user IDs are completely insufficient. Therefore, the design relies entirely on the user ID and is not a desirable design approach.

Therefore, the session ID can be designed to use a global increment method, add a mapping table, and save the relationship between from_user_id, to_user_id and conversation_id.

Push mode vs pull mode

vs push and pull combination

mode

In the IM system, there are usually three possible ways to get new messages

:

  • push mode: when there is a new message, the server actively pushes to all ends (iOS, Android, PC, etc.).
  • Pull mode: The front-end

  • actively initiates the request to pull the message, in order to ensure the real-time nature of
  • the message, the push mode is generally used, and the pull mode is generally used to obtain the historical message push and pull

  • combination mode: when there is a new message, the server will first push a notification with a new message to the front-end, and the front-end will pull the message to the server after receiving the notification

The simplified diagram of the push mode is as follows:

As shown in the figure above, under normal circumstances, the message sent by the user will be pushed to all ends of the receiver after operations such as server storage. However, push may be lost, the most common situation is that the user may be pseudo-online (meaning that if the push service is based on a persistent connection, and the persistent connection may have been disconnected, that is, the user has been disconnected, but generally needs to go through a heartbeat cycle before the server can perceive, at this time the server will mistakenly think that the user is still online; Pseudo-online is a concept that I thought of myself, but I did not think of the right words to explain). Therefore, if you use the push mode alone, you may lose messages.

The simplified diagram of the push-pull combination mode is as follows:


You can use the push and pull combination mode to solve the problem that push mode may lose messages. When the user sends a new message, the server pushes a notification, and then the front end requests the latest news list, in order to prevent the loss of messages, you can actively request it every once in a while. It can be seen that it is best to use write diffusion to use the push-pull combination mode, because write diffusion only needs to pull a timeline of personal mailboxes, while read diffusion has N timelines (one for each mailbox), and the performance will be poor if it is also pulled regularly.

Industry Solutions

Earlier to understand the common design problems of IM systems, let's look at how the industry designs IM systems. Studying the industry's mainstream solutions helps us to gain a deeper understanding of IM system design. The following studies are based on the information that has been made public on the Internet, which is not necessarily correct, and it is good for your reference only.

Although many of WeChat's

basic frameworks are self-developed, this does not prevent us from understanding WeChat's architecture design. It can be seen from the article "[From 0 to 1: The Evolution of WeChat Background System] (#" published by WeChat that WeChat mainly adopts: writing diffusion + push and pull combination. Since group chat also uses write diffusion, and write diffusion consumes resources, there is a maximum number of people in WeChat groups (currently 500). So this is also an obvious disadvantage of writing about diffusion, which is more difficult if you need ten thousand people.

It can also be seen from the text that WeChat adopts a multi-data center architecture:

Each data center of WeChat is autonomous, each data center has a full amount of data, and the data is synchronized between data centers through self-developed message queues. In order to ensure data consistency, each user belongs to only one data center, and can only read and write data in his own data center, and if the user is connected to other data centers, the user will be automatically guided to access the data center to which he belongs. If you need to access other users' data, you only need to access your own data center. At the same time, WeChat uses the three-park disaster recovery architecture and Paxos to ensure data consistency.

It can be seen from the article "Trillion-level Call System: WeChat Serial Number Generator Architecture Design and Evolution" published by WeChat that WeChat's ID design adopts: generation method based on application DB step size + user level increment. As shown in the following figure:

WeChat's serial number generator generates a routing table by the quorum service (the routing table holds the full mapping of uid segments to AllocSvr), and the routing table is synchronized to AllocSvr and Client. If AllocSvr goes down, the quorum service will reschedule the uid segment to another AllocSvr.

DingTalk

does not have much public information, from the article "Ali DingTalk Technology Sharing: Enterprise IM King - DingTalk's Excellence in Backend Architecture" we can only know that DingTalk initially used the write diffusion model, in order to support tens of thousands of people, it later seemed to be optimized into read diffusion.

But when it comes to Ali's IM system, I have to mention Ali's self-developed Tablestore. Under normal circumstances, the IM system will have a self-increasing ID generation system, but Tablestore creatively introduces primary key column auto-increase, that is, the generation of IDs is integrated into the DB layer and supports user-level increments (traditional MySQL and other DBs can only support table-level auto-increment, that is, global auto-increment). For details, you can refer to: "How to optimize the architecture of high-concurrency IM system"

Twitter

what? Isn't Twitter a feeds system? Isn't this article about IM? Yes, Twitter is a feeds system, but the feeds system and the IM system actually have a lot of design commonalities, and studying the feeds system helps us to refer to it when designing the IM system. Besides, there is no harm in studying the Feeds system, expanding the technical horizon.

Twitter's self-incrementing ID design is probably familiar to everyone, that is, the famous Snowflake, so the ID is globally incremented.


As can be seen from the video sharing "How We Learned to Stop Worrying and Love Fan-In at Twitter", Twitter initially used the write diffusion model, and the Fanout Service was responsible for diffusion writing to Timelines Cache (using Redis), Timeline The Service is responsible for reading the Timeline data, which is then returned to the user by API Services.

However, because write diffusion is too expensive for big V, Twitter later used a combination of write diffusion and read diffusion. As shown in the following figure:

For users with a small number of followers, if Twitter uses a writing diffusion model, the Timeline Mixer service integrates the user's Timeline, the writing Timeline of the big V and the system recommendation, and finally the API Services returns

to the user

IM to solve the problem

In the choice of communication protocol, we mainly have the following choices

:

  1. use TCP socket communication, design their own protocol: 58 to home, etc.
  2. Use UDP Socket communication: QQ and so on
  3. Use HTTP long round-robin: WeChat web version and so on

Either way, we're able to notify messages in real time. But what affects the real-time nature of our messages may be in the way we process them. For example, if we use MQ to process and push a message of 10,000 people when pushing, and it takes 2ms to push one person, then it takes 20s to push 10,000 people, then the subsequent messages will block 20s. If we need to push within 10ms, then the concurrency of our push should be: number of people: 10000 / (total push duration: 10 / single person push duration: 2) = 2000

Therefore, when we choose a specific implementation solution, we must evaluate the throughput of our system, and every link of the system must be evaluated and stress-tested. Only by evaluating the throughput of each link can we ensure the real-time nature of message push.

How to ensure message timing

The following situations may cause messages to

be out of order:

  • sending messages may be out of order if you use HTTP instead of a persistent connection. Because the backend is generally a cluster deployment, the request may hit different servers using HTTP, and due to network latency or different server processing speeds, the later messages may be completed first, resulting in message out-of-order. Solution:
  • The front-end processes the messages sequentially, sending one message and then sending the next message. This approach degrades the user experience and is generally not recommended.
  • Bring a front-end generated sequential ID and let the receiver sort according to that ID. This way of front-end processing is a bit more cumbersome, and during the chat, a message may be inserted in the middle of the receiver's historical message list, which is strange, and the user may miss the message. However, this situation can be solved by rearranging when the user switches windows, and the receiver appends the message to the back each time it receives it.
  • Usually in order to optimize the experience, some IM systems may adopt an asynchronous sending confirmation mechanism (e.g. QQ). That is, as long as the message arrives at the server, and then the server sends it to MQ, it is sent successfully. If the sending fails due to permissions and other issues, the backend will push another notification. In this case, MQ needs to choose the appropriate Sharding
  • strategy:

  • Sharding by to_user_id: If you need to do multi-terminal synchronization using this strategy, the synchronization of multiple ends of the sender may be out of order, because the processing speed of different queues may be different. For example, the sender sends m1 first and then m2, but the server may process m2 first and then process m1, where the other side will receive m2 and then m1, and the session list of the other ends will be messed up.
  • Sharding by conversation_id: Using this strategy will also cause multi-terminal synchronization to be out of order.
  • Sharding by from_user_id: In this case, it is a better choice to use this strategyUsually
  • in

  • order to optimize performance, you may push to MQ before pushing, in which case to_user_id is used is a better choice.
How to do user online status

Many IM systems need to show the status of users: whether they are online, whether they are busy, etc. Redis or distributed consistency hashing can mainly be used to store the user's online status.

Looking at the picture above, some people may wonder, why do you need to update Redis every heartbeat? If I'm using a TCP persistent connection, won't I have to update every heartbeat? Indeed, under normal circumstances, the server only needs to update Redis when a new connection is created or disconnected. However, due to server exceptions, or network issues between the server and Redis, event-based updates can go wrong, resulting in incorrect user status. Therefore, if you need the user's online status to be accurate, it is best to update the online status through a heartbeat.

Since Redis

is stored on a stand-alone machine, in order to improve reliability and performance, we can use Redis Cluster or Codis.

When using distributed consistency hashing, you need to pay attention to the migration of user state when scaling or scaling down Status Server Cluster, otherwise the user state will be inconsistent during the initial operation. Virtual nodes are also required to avoid data skew.

How

to do multi-terminal synchronization

Read diffusion has also been mentioned earlier, for read

diffusion

, the

synchronization of messages is mainly based on push mode, the message ID of a single session is increased sequentially, and the front-end receives the push message if it finds that the message ID is not continuous and requests the back-end to re-retrieve the message. But this may still lose the last message of the session, in order to increase the reliability of the message, you can bring the ID of the last message in the session of the historical session list, the front end will first pull the latest conversation list when receiving a new message, and then judge whether the last message of the conversation exists, if it does not exist, the message may be lost, the front end needs to pull the message list of the conversation again; If the last message ID of the conversation is the same as the last message ID in the message list, the frontend will no longer process it. The performance bottleneck of this approach will be to pull the historical session list, because each new message needs to be pulled from the backend once, if you look at the magnitude of WeChat, the message alone may have 200,000 QPS, if the historical session list is placed in traditional DB such as MySQL, it will definitely not be resisted. Therefore, it is best to store the list of historical sessions in a Redis cluster with AOF (data may be lost if you use RDB). Here we can only lament that performance and simplicity cannot be combined.

Write diffusion

For write diffusion, multi-terminal synchronization is simpler. The front end only needs to record the last synchronization site, bring the synchronization site when synchronizing, and then the server will return all the data behind the site to the front end, and the front end can update the synchronization site.

How to deal with unread

In an IM system, the handling of unread is very important. Unreads are generally divided into session unreads and total unreads, and if not handled properly, session unreads and total unreads may be inconsistent, seriously reducing the user experience.

Read diffusion For read

diffusion

, we can store both session unread and total unread in the

backend, but the backend needs to ensure the atomicity and consistency of the two unread updates, which can generally be achieved by the following two methods:

  1. using Redis' multi-transaction function, transaction update failure can be retried. However, please note that transactions are not supported if you are using a Codis cluster.
  2. Use Lua to embed scripts. In this way, you need to ensure that the session unread and the total unread are on the same Redis node (Hotis can use a hashtag). This approach leads to logical fragmentation and increases maintenance costs.

Write

Diffusion For write diffusion, the server usually weakens the concept of sessions, that is, the server does not store a list of historical sessions. The calculation of unread can be handled by the front end, marking read and marking unread can record only one event to the mailbox, and each end handles session unread by replaying the event. Using this method may cause inconsistent readings on all ends, at least WeChat will have this problem.

If the write diffusion also stores unreadable readings through the historical session list, then the user timeline service is tightly coupled with the session service, and if you need to ensure atomicity and consistency, you can only use distributed transactions, which will greatly reduce the performance of the system.

How to store historical message

read

diffusion For read diffusion

, you only need to store one copy by session ID.

Write

Diffusion For write diffusion, two copies need to be stored: a message list with the user as the Timeline and a message list with the session as the Timeline. Message lists with user Timeline can be Sharding with user IDs, and message lists with session Timeline can be Sharding with session IDs.

For IM, the storage of historical messages has strong time series characteristics, and the longer the time, the lower the probability of the message being accessed and the lower the value.

If we need to store historical messages for several years or even permanent (more common in e-commerce IM), then it is very necessary to separate the hot and cold of historical messages. The separation of hot and cold data is generally an HWC (Hot-Warm-Cold) architecture. For the messages just sent, you can put them on the Hot storage system (you can use Redis) and the Warm storage system, and then the Store Scheduler will periodically migrate the cold data to the Cold storage system according to certain rules. When getting messages, you need to access the Hot, Warm, and Cold storage systems in turn, and the data is integrated by the Store Service and returned to the IM Service.

How to

achieve load balancing of the access layer is mainly done in the following ways

:

  1. hardware load balancing: for example, F5, A10, and so on. Hardware load balancing has strong performance and high stability, but the price is very expensive, and it is not recommended to use it for local tycoon companies.
  2. Use DNS to achieve load balancing: Using DNS to achieve load balancing is relatively simple, but using DNS to implement load balancing

  3. will be slow to take effect if you need to switch or expand capacity, and the number of supported IP addresses supported by using DNS to implement load balancing is limited, and the supported load balancing strategy is relatively simple.
  4. DNS + Layer 4 load balancing +

  5. Layer 7 load balancing architecture: for example, DNS + DPVS + Nginx or DNS + LVS + Nginx. Some people may wonder why you should join Layer 4 load balancing? This is because Layer 7 load balancing is CPU-intensive, and often needs to be scaled up or down, and for large websites, you may need many Layer 7 load balancing servers, but only a small number of Layer 4 load balancing servers are needed. Therefore, the architecture is useful for large applications with short connections such as HTTP. Of course, if the traffic is not large, only use DNS + 7 layer load balancing. But for persistent connections, adding Layer 7 load balancing Nginx is not very good. Because Nginx often needs to change the configuration and reload the configuration, the TCP connection will be disconnected when reloading, resulting in a large number of disconnections.
  6. DNS + Layer 4 load

  7. balancing: Layer 4 load balancing is generally stable, rarely changed, and is more suitable for long-term connections.

For the access layer of persistent connections, if we need a more flexible load balancing strategy or need to do grayscale, we can introduce a scheduling service, as shown in the following figure

Access Schedule Service can be implemented to assign Access Services according to various policies, for example:

    according to

  • grayscale policies to allocate according
  • to the principle of proximity
  • Allocated based on the minimum number of connections

Finally, share the architecture experience of doing large-scale applications:

    >grayscale! Grayscale! Grayscale!
  1. Monitor! Monitor! Monitor!
  2. Alarm! Alarm! Alarm!
  3. Cache! Cache! Cache!
  4. Throttling! Fusing! Demote!
  5. Low coupling, high cohesion!
  6. Avoid monolithic points and embrace statelessness!
  7. Assess! Assess! Assess!
  8. Stress test! Stress test! Stress test!

end

 


public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.

like + Looking, less bugs 👇