As the initial product of Internet information transmission, will SMS always exist?

Ran Xing | technical author

IMMENSE | content editing

In 1992, the world’s first text message was successfully sent.

In 2001, China’s mobile SMS business volume reached 18.9 billion, and by 2008, this number directly soared to more than 590 billion.

In the era when the Internet was not popular, SMS as an instant messaging tool was indispensable, especially in the hot holiday SMS sending volume is even more explosive.

However, 2012 was a turning point in history, and the number of SMS sent by mobile phone users in China began to decline after reaching a peak of 900 billion.

With the development of mobile Internet and the iteration of instant messaging tools, “SMS” has suddenly been pulled off the altar.

Is “text messaging” a thing of the past?

The data tells us: No.

Whether it is common user login, verification code identity authentication, or SMS products for various specific scenarios such as finance, e-commerce, logistics, and education, the SMS market demand for enterprise business is extremely large.

Before the concept of “cloud” appeared, the communication of enterprises to the outside world was still stuck in the exchange of information and the transmission of data, which could not be achieved in real time, let alone interaction.

The advent of cloud communication solves this problem.

Cloud communication is a voice and data communication function service based on Internet cloud services, which PAAS and SAAS the basic communication capabilities provided by operators through resource integration, technological innovation, risk control, etc., and provide customers with unified, reliable, convenient, secure and innovative communication services.

With the continuous business iteration of the communication industry, the new track has brought new changes to the business, and the scale of ecological cooperation and channels has brought model innovation to the system, but also doubled the pressure.

At the same time, the regional environment of the International Station and the factors of local policies and regulations have also brought new opportunities and challenges to the construction of globalization.

This article will explore the gateway technology in the cloud-native era, facing the background of globalization, platformization and refinement, how to tap the opportunity of self-transformation in the cloud-native era, and how to drag the heavy technical debt to achieve nirvana rebirth, and achieve high-performance, high-availability, low-cost architecture evolution and technological breakthroughs.

This article will also be combined with the practical experience of gateway technology under the double 11 top flow over the years, hoping to be helpful to readers.


New trends and challenges in the development of cloud communication gateways

Alibaba Cloud Communication SMS Gateway is a cloud-native gateway built based on a leading communication architecture and large-scale distributed gateway processing technology, providing stable communication service capabilities, with redundant, recoverable, and switchable high-availability service assurance capabilities, realizing the SLA guarantee requirements of customers, and ultimately maximizing the utilization of resources and profits.

High performance, high availability, low cost

– It is a trend and a challenge.

❖ High performance, 100,000-level concurrent, second-level touch

Alibaba Cloud Communication started in 2017, was first incubated in Alibaba Communications, and then integrated with Alibaba Cloud, and after just a few years of development, it is now one of the hottest cloud service products on Alibaba Cloud. In 2019, it ushered in large-scale development, which achieved a historical peak on the day of the Tmall Double 11 event, covering more than 200 countries around the world.

From a technical point of view, the cloud communication SMS gateway supports the traffic distribution of 100,000 QPS in Double 11, and this kind of concurrency is not a simple query, but needs to interact with operators or other third-party systems. Such a large amount of service traffic and resource scheduling, in addition to the system assurance, but also to ensure the low latency of transmission and response, to achieve global coverage, second-level reach, which is a big challenge.

The appeal is: to meet both high concurrency and high performance.

So what is the main bottleneck of the current problem?

1. The current gateway architecture is mainly to exchange scale for performance, which requires large-scale cluster distributed deployment to provide high concurrency capabilities.

2. In the transmission of communication network, it is necessary to rely on the communication standard protocol and other lengths of connection mode to transmit through the Internet.

❖ High availability, minute-level fault isolation and recovery

With the development of the business, the cloud communication resource node will reach the 10,000-level, how to achieve the stability of the 10,000-level node under the concurrency of the 100,000-level is a very big problem. In addition, cloud communication has a business scenario of mutant traffic similar to the second kill, such as marketing text messages, which will send a large number of SMS requests within a few minutes, and this instantaneous traffic will often form a flood peak and impact the system.

From a technical point of view, the cloud communication SMS gateway uses microservice distributed architecture for field split deployment, and a large number of asynchronous programming multi-threaded concurrent scheduling model, the system complexity can be seen, such a large cluster scale and dense communication network, in addition to doing a good job of service fault monitoring coverage, alarm accuracy rate of 100%, but also to ensure fault isolation and rapid recovery, to achieve the overall system high availability, which is another big challenge.

So what are the hidden dangers of the current systemic risks?

1. The current gateway architecture is mainly a multi-center and multi-group deployment architecture, which requires isolated deployment of different dimensions of services, scenarios, and customers.

2. Secondly, in terms of data storage resources, we need to focus on the stability of the database.

❖ Low cost, container resources are elastically scalable

With the exponential growth of computing scale, especially the hundreds of servers deployed during the Double 11 period, when the traffic and resources are further doubled, the cost consumption will also rise. However, after the big promotion, after the tide rises and falls, it is the expansion and contraction of container resources, but for stateful services, the cost and difficulty of resource migration corresponding to scaling and scaling is not an easy task.

From a technical point of view, stateful services are bundled with resources, the reason is that SMS is a long-connected asynchronous full-duplex communication mode, the essential conflict is the utilization of resources under tidal traffic, in the face of this stateful service and expensive resource costs, in addition to doing a good job of traffic and resource optimal matching, reduce the cost of idle resources, improve CPU utilization, but also to achieve stateless container resource elastic scalability, further reduce O&M costs, which is another big challenge.

So what are the technical difficulties now?

1. The current gateway deployment is mainly in the DevOps model, and you need to apply for resources in advance and then deploy the image container.

2. In the management of resource connections, you need to preallocate the resource connections to realize the binding of resource connections to container IPs.


Breaking the Game: Cloud-native edge gateway architecture

Below, let’s talk about what technical advantages SMS gateways have established in combination with cloud-native technical characteristics.

❖ Easy to deploy, wide coverage, minute-level service deployment

Cloud native is a set of cloud-based technical methodologies, and Alibaba Cloud has centers and edge nodes all over the world, so how to build a lightweight edge gateway and cloud network deployment architecture based on cloud-native technologies such as containerization, service mesh, and microservices, combined with edge cloud, to achieve easy deployment, wide coverage of the global nearby access and distribution capabilities, so as to achieve the development goal of improving gateway performance and reducing O&M costs.

In order to achieve the goal of heterogeneous deployment, there are two main points here: one is that the system architecture supports the easy deployment of cloud nativeness, and the other is that the DevOps platform supports the easy deployment of the application environment.

First of all, at the system architecture level, the SMS gateway has realized a two-tier architecture system to support the service, and the lightweight gateway architecture created is easier to deploy in various regions, enabling customers to achieve nearby access and ensure a low-latency SMS sending experience. This is shown in the following figure:

In summary, the edge gateway is based on Alibaba Cloud edge nodes, which helps services sink to a place 10 kilometers away from users, reduces latency and bandwidth costs, and achieves technology cost reduction and global multi-node rapid points while ensuring stability.

❖ Easy scheduling, low latency, millisecond response

Cloud native is a set of technical methodologies for the cloud, as mentioned above, SMS gateway is a multi-group deployment solution, which is deployed independently in areas close to the user to carry out low-latency high-quality docking with suppliers. Then there is a problem here; How are edge nodes of this scale scheduled? How complex is scheduling?

For complex traffic scheduling scenarios, reduce the complexity of business architecture, realize the decoupling of business logic and traffic control logic through architecture upgrade, and turn complex scheduling into an observable and controllable unified traffic scheduling model, so as to achieve easy-to-dispatch and low-latency discovery goals.

In order to achieve the goal of easy scheduling, it is also necessary to solve two key points: one is that the system architecture supports easy scheduling of cloud nativeness, and the other is that the communication network architecture supports the easy scheduling of the application environment.

Firstly, at the system architecture level, the routing addressing scheduling algorithm based on the three-level policy is implemented to realize the data link communication between nodes, nodes and resources, and resources and connections. As well as the dynamic perception algorithm based on multi-factor multi-weight routing collaborative control to achieve stable and reliable route addressing in abnormal situations.

In addition, SMS for the scene: verification code, notification, marketing, etc., for the timeliness of the requirements are very high, technically we have realized the adaptive elastic flow control algorithm based on the scene priority, multiple message queues are no longer isolated, the flow rate control of each queue will be affected by the operation of other queues, the higher the priority of the queue flow rate control, the lower the priority of the queue flow rate control, and can be dynamically adjusted with the system operation, with high timeliness adaptive adjustment ability. In fact, no matter which algorithm, the main goal is to make the traffic smoother and more instantaneous.

Secondly, at the level of communication network architecture, we mainly use open source middleware products on the cloud, such as Nacos, Redis, MNS, etc., and in the VPC networking process, we also use a large number of network acceleration technologies such as EIP, NAT, SLB, VPN, IPSec, to ensure low latency of communication.

We know that cloud services are usually deployed in independent VPCs, VPC access needs to go through SLB/NAT, public network users actively access the traffic of resources on the cloud is forwarded through SLB, and the traffic of resources on the cloud to actively access the public network is forwarded through NAT. For the cross-region network mutual access on the cloud, the method we use is to call the cross-region gateway first to the bullet of the Region gateway, and then to the bullet outside the region gateway, so that the performance of the network transmission will be guaranteed.

❖ Easy O&M, cost-saving, second-level auto scaling

As mentioned above, SMS gateway has a huge cluster size and global nodes, in addition to scheduling considerations, there is another question: such a large-scale edge node is for cost control? How is auto scaling and maintenance under tidal traffic?

In essence, the core difficulty of SMS gateway operation and maintenance is because the connection is stateful, and stateful will produce various complex problems, the biggest difficulty of which is that stateful containers cannot be elastically expanded. Therefore, one of the goals of achieving cost savings also lies in this. In order to achieve the goal of easy operation and maintenance, it is necessary to solve two key points: one is that the system architecture supports cloud-native easy operation and maintenance, and the other is that observable technology supports digital and intelligent easy operation and maintenance.

First of all, at the system architecture level, we realize the cloud reconstruction of traditional communication gateways through the distributed loosely coupled gateway architecture, decouple the service processing module and the communication protocol session module, the service processing layer does not need to care about the communication connection state, can dynamically expand and shrink according to the traffic, and the self-developed data connector provides the ability of route discovery and scheduling.

For more lightweight deployment and design, we split the cloud network architecture into independent domain modules as a whole, and each module solves its own domain problems independently. For some business service areas that are synergistically related, we adopt the service integration and extension mode for inter-service communication, rather than developing in the local network gateway, so as to ensure the lightweight and exclusive attributes of the local network gateway, and thus make it easier to operate and maintain.

Secondly, at the level of digital intelligent operation and maintenance, the first thing to think about is why we should dig deep into observable technology? What is the coverage of observable data? Is the data isolated? Or is it aggregated? What does the network structure look like?

“Observable” is a relatively large and complete concept, including application performance indicators, link tracking, container monitoring, system monitoring, log monitoring, etc., each of which is a single point, but for the business application system, what we want to do should be a comprehensive observable system.

Specifically from the level point of view, the top is “seeing”, can see the indicators, can alert; The next layer is that you can “analyze”, you can track the call chain, you can analyze RT, where the exception is; For some more clear scenarios, the root cause analysis based on orchestration and automatic fault location based on orchestration are realized through system automation.

To sum up, observable should be a multi-faceted, we actually solve how to aggregate and analyze these observable data and react to the service gateway, can achieve automated AIOps operation and maintenance control.

After evolution, the gateway has always been committed to the development of scale, marginalization and digital intelligence:

Realize global multi-site and multi-node network topology deployment through the cloud network architecture of cloud gateway and edge gateway;

Focus on the evolution of the marginalized architecture to help the rapid and convenient deployment capabilities of large-scale gateways, and at the same time rebuild the cloud network communication mode to achieve the elastic horizontal scaling capability of cloud gateways.

Finally, the global gateway nodes are monitored, buried by metrics and traces by observable technology, and the root cause analysis capability based on orchestration is constructed.

I hope that through the above content, we can help everyone have a new understanding of cloud communication, and if there are interested students, they are also welcome to communicate.

Past Recommendations:

Only fast is not broken, a history of the evolution of “network highways”

Terminal cloudification, a minimalist way to break the shackles of hardware

From cloud-native to intelligent, in-depth interpretation of the industry’s first “video live broadcast technology best practice map”