I. Preface

Our company’s cluster is always on the verge of collapse, through nearly three months of mastery, it was found that the reasons for the instability of our company’s cluster are as follows:

1. The

release process is unstable

2. The lack of monitoring platform [the most important reason]

3. Lack of log system

4. Extreme lack of

relevant operation documents

5, the

request route is unclear

In general, the main reason for the problem is the lack of a predictable monitoring platform, and problems are always known. The secondary causes are the unclear role of the server and the instability of the release process.

Second, the solution

1. The release process is unstable

Refactoring the release process. The business is fully K8S, and a CI/CD process with Kubernetes as the core is built.

1) Release processThe

release process is as follows:

Analysis: R&D personnel submit code to the developer branch (always ensure that the developer branch is in the latest code), the developer branch is merged into the branch corresponding to the release environment, trigger the enterprise WeChat alarm, trigger the gitlab-runner pod deployed in the k8s cluster, and start the runner pod to perform CI/CD operations. There are three steps in this process: test the case, package the image, and update the pod.

The first deployment of the service in the K8S cluster environment may be required: create a namespace, create an imagepullsecret, create a pv (storageclass), create a deployment (pod controller), create an svc, create an ingress, etc. Among them, the image is packaged and pushed to the Alibaba Cloud repository, and the image downloaded from the Alibaba Cloud repository is accessed using a VPC, without going to the public network and without network speed restrictions. When the process is complete, the runner pod is destroyed, and gitlab returns the result.

It should be emphasized that the resource list here does not contain configmap or secret, involving security issues, should not appear in the code repository, our company is using rancher as a k8s multi-cluster management platform, the above security problems in rancher’s dashboard by operation and maintenance.

2) Service

deployment logic diagram

The service deployment logic diagram is as follows:

According to the analysis of the release process, and then according to the logic diagram, the release process can be clarified. Here we see that our company uses kong instead of nginx to do authentication, authentication, and proxy. And SLB’s IP is bound to Kong. 0,1,2 belongs to the test job; 3 belongs to build job; 4,5,6,7 belong to the change pod phase. Not all services need to be stored, and they need to be determined according to the actual situation, so they need to write judgments in the kubernetes.sh.

Here I’m trying to use one set of CI applications and all environments, so I need to use a lot of judgment in kubernetes.sh, and .gitlab-ci.yml seems too much. The recommendation is to use a CI template and apply it to all environments, after all, how to save trouble. Also consider your own branching pattern.

2. Lack of monitoring and early warning platform

Build a reliable federated monitoring platform that conforms to our cluster environment to realize simultaneous monitoring and pre-fault alarm for several cluster environments, and intervene in advance.

1) Monitoring and early warning logic chart

The monitoring and early warning logic diagram is as follows:

In general, the monitoring scheme I use here is prometheus+shell script or go script + sentry. The alarm method used is WeChat or enterprise email. The three colored lines in the figure above represent three monitoring methods to pay attention to. Scripts are mainly used for backup alarms, certificate alarms, and thieves. Prometheus here uses a list of Prometheus resources modified according to Prometheus-opertor, and the data is stored on the NAS. Sentry strictly speaking, it belongs to the log collection platform, here I classify it as the monitoring class, because I fancy its ability to collect the crash information of the underlying code of the application, belongs to the business logic monitoring, aims to collect and summarize the error logs generated during the operation of the business system and monitor the alarm.

Note that a federal monitoring platform is used here, and a normal monitoring platform is deployed.

2) Logic diagram of the federal

monitoring and early

warning platform multi-cluster federal monitoring and early warning platform is as follows:

Because our company has several K8S clusters, if a set of monitoring and early warning platforms are deployed on each cluster, it is too inconvenient to manage, so the strategy I adopt here is to use the strategy of implementing a federation of each monitoring and early warning platform and using a unified visual interface management. Here I will implement three levels of hunger monitoring: operating system level, application level, and business level. Traffic monitoring can be done directly for Kong, template 7424.

3. Lack of logging system

With the advancement of comprehensive K8S services, the

demand for log systems will become more desirable, and the characteristic of K8S is that service fault logs are difficult to obtain. Establishing an observable and filterable log system can make it easier to analyze failures.

The logic diagram for the logging system is as follows:

class=”rich_pages wxw-img js_insertlocalimg” src=”https://mmbiz.qpic.cn/mmbiz_png/ufWcjcomw8bX4WAVb7V7yOz099AicmhibqWCvvKFlI6u9QlS8Kdzpxrqtm2LqBJOk0AibbvAmLEicue9xjBLbCNTcQ/640?wx_fmt=png”>

Analysis: After the K8S business is comprehensively implemented, it is convenient to manage and maintain, but the difficulty of log management is appropriately increased. We know that pod restarts are multi-factorial and uncontrollable, and each pod reboot will be re-logged, that is, the logs before the new pod are not visible. Of course, there are multiple ways to store logs for a long time: remote storage logs, local mount logs, etc. For visualization, analysis, and more, you chose to build a log collection system using Elasticsearch.

4. Extreme lack of relevant operation documents

Establish a document center centered on the relevant materials of Yuque-> O&M, and record relevant operations, problems, scripts, etc. in detail for viewing at any time.

For security reasons, it is not convenient for too many colleagues to consult. The work of operation and maintenance is relatively special, and security and documentation must be guaranteed. I think whether it is operation and maintenance or operation and maintenance development, writing documents must be mastered, for yourself or for him. The documentation can be abbreviated, but it must include the steps of the bud core. I still think every step of the operation and maintenance should be documented.

5. The request route is unclear

According to the

new idea of cluster reconstruction, the cluster-level traffic request route is reorganized to build a traffic management that integrates authentication, authentication, proxy, connection, protection, control, and observation, and effectively controls the scope of fault explosion.

The request route logic diagram is as follows:

class=”rich_pages wxw-img js_insertlocalimg” src=”https://mmbiz.qpic.cn/mmbiz_png/ufWcjcomw8bX4WAVb7V7yOz099Aicmhibq9K904SjCHvWWtVia3r0Ku8JuSoYoibVBB9XKH6Ljlib52u9loicA4JqicWw/640?wx_fmt=png”>

Analysis: Customer access https://www.cnblogs.com/zisefeizhu After the Kong gateway authentication, enter a specific namespace (distinguish items by namespace), because the service has been split into microservices, the communication between services has been authenticated and authorized by istio, and those who need to interact with the database to find the database, those who need to write or read the storage to find PV, and those who need to convert the service to find the conversion service… Then return a response.

III. Summary

To sum up, the construction of CI/CD distribution process with Kubernetes as the core, the federal monitoring and early warning platform with Prometheus as the core, the log collection system with Elasticsearch as the core, the document management center with Yuque as the core, and the north-south east-west traffic integration service with Kong and Istio as the core can be found in Gaopingfa. High reliability is well guaranteed.

Attached: Overall architecture logic diagram

class=”rich_pages wxw-img js_insertlocalimg” src=”https://mmbiz.qpic.cn/mmbiz_png/ufWcjcomw8bX4WAVb7V7yOz099AicmhibqSLDJBeL5T7yiaM8v2xk0iarhhcjkmKqOCC4tkA7Zp9lwzY78dSVvHkyw/640?wx_fmt=png”>

Note: Please analyze according to arrows and colors.

Analysis: The above figure seems to be too chaotic, calm down, according to the above split module layer by layer analysis can still be seen clearly. Here I use different colored lines to represent the system of different modules, and it is quite clear to walk according to the arrows.

According to our current business traffic, the above functional modules can theoretically achieve the stability of the cluster. I personally believe that this solution can ensure the stable operation of the business on the k8s cluster for a period of time, and any more problems belong to the code level. Middleware is not used here, but the cache redis is used but not drawn. I plan to add middleware kafka or rq to the logging system and conversion service after the above figure is completed.

Author丨Purple Fliggy
dbaplus community welcomes the majority of technical staff to contribute, the submission email:


public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.

like + Looking, less bugs 👇