class=”rich_media_content ” id=”js_content”>
the current situation of the online education industry With the introduction of the Internet in the 90s, online education products were also born based on the Internet. With the development of Internet technology, new models of online education products have also emerged. Online education began to develop from the original simple text form to pictures and audio. The online education has also further promoted the development of dataization, and content, as the core asset of education enterprises, is constantly improving both the degree of dataization and the scale of dataization. At the same time, the improvement of user usage time provides a large amount of source data for educational AI. According to statistics, the average daily online time of online education users in March this year exceeded 2 million. Such massive data provides a good soil for the intelligent development of the industry, and promotes the development of intelligent analysis such as teaching content, course marketing, teacher management, and quality evaluation.
Liulishuo Company Introduction Liulishuo is the world’s leading technology-driven education company, as an advocate of intelligent education, Liulishuo has a leading artificial intelligence team in the industry, after years of accumulation, Liulishuo has a huge “Chinese English pronunciation database”, accumulating about 3.7 billion minutes of dialogue and 50.4 billion sentence recordings. In 2013, Liulishuo launched its first product, “English Liulishuo”, which integrates a variety of core technologies such as speech recognition, scoring and adaptive learning. It has rich content such as contextual dialogues and pronunciation guidance lessons, and provides artificial intelligence English teachers and gamified learning experiences to provide users with more fun in English learning. This interesting and effective product quickly occupied the market at that time and gained high recognition from users. However, with the rapid development of business, the number of users has increased significantly, and the
number of users of the platform has increased from one million to more than 100 million, the change of data traffic, business complexity and analysis difficulty during the high and low peak hours of the business have brought great challenges to the IT architecture.
As a company without a separate operation and maintenance department, the unified monitoring platform of the basic platform is mainly completed by the research and development of the cloud-infra team, and the core requirements of the team are not only SLA, performance monitoring, alarming and providing relevant data for problem location, but also the technical value operation of cloud-infra, such as utilization, cost savings, business relationship network, etc. Under these core requirements, the unified monitoring platform will have high requirements: 1. Collect and monitor various heterogeneous data sources, including machine metrics on K8s and ECS, utilization rate, Istio-related call logs, self-built middleware-related metrics, indicators provided by cloud services, and business trace data, in addition to real-time collection of various cost data. 2. Dynamic discovery and dynamic collection of various resources, including organizational relations and other department-related data also need to be updated in real time, so as to be able to feedback the most accurate relevant indicators and belonging relationships in real time. 3. Large-scale data storage and analysis, due to the large scale of Liulishuo’s business, the various cloud resources used and the amount of data generated by the business are very huge, tens of terabytes per day, and the solution needs to meet the real-time analysis and presentation capabilities at this scale.
4. The monitoring platform is responsible for stability problems, and its own stability also needs to be done well, so it is necessary to eliminate the single point problem of each part, and has the ability to recover abnormally quickly.
The unified monitoring platform for technology selection is not only time-series related data, but also the core business availability data needs to be calculated and analyzed through various logs, so it is necessary to choose two data schemes: Logs and Metrics. There are different community or business solutions for these two types of data, such as ES, Loki, SLS, Prometheus, OpenTSDB, InfluxDB, etc. The final log solution selected Alibaba Cloud SLS, and the timing solution selected Prometheus + SLS, mainly for the following reasons: 1. SLS has the ability to store and analyze all kinds of data in a unified manner, and can be associated with Metrics and Logs data on SLS, which is not available on other platforms2. SLS platform can adapt to very large data scale, performance is much better than ES, is also O&M-free service, eliminating the problem of maintaining ES high reliability.3 The timing scheme is based on Prometheus, Prometheus’ ecology is very perfect, and PromQL is also brief to use. SLS timing library can be used as a remote high-reliability storage of Prometheus, which can solve the reliability problem of Prometheus
4.SLS solution has the function of data processing, can be combined with external data sources for analysis and processing, can better handle a variety of complex logs, and add logs to the overall architecture of directory-related information
architecture of the current Liulishuo unified monitoring platform is shown in the figure above:
1. In order to achieve automation, we have developed a set of dynamic discovery mechanism for IaaS and PaaS resources suitable for cloud scenarios, which can add newly purchased and created resources to monitoring and collection in real time to avoid most manual operations
2. Log correlation: •Logs from different services are directly collected into different log vaults through SLS Logtail • Not all logs need to be stored and indexed for a long time, so we classify logs, and for audit requirements, they will be sent to OSS for long-term storage; Troubleshooting logs are only stored for 2 weeks, and full-text indexing is enabled. AccessLog only enables the indexing of some fields, which can save a lot of indexing costs.
•For NGINX access logs that need to calculate SLA and PXX metrics, data processing is used to map URLs in NGINX access logs to corresponding departments, applications, and methods with some mapping rules, departments, applications, and other catalog information that has been stored in RDS.
3. Monitoring related • Monitoring solution selected Prometheus, for the
Liulishuo scenario, we have developed some Exporter for obtaining Metrics from various cloud products and self-built components • At the same time, in order to better use Prometheus and integrate with the internal CICD system, we have added a sidecar to Prometheus to monitor changes in the Git repository. In order to improve query speed, various Recording Rules are configured on Prometheus and Git Management • AlertManager alerts are used to directly dock with the internal alarm center, which can do advanced functions such as typesetting and upgrading
In order to solve the problem of Prometheus single point and the problem of association analysis with Catalog later, we use the SLS timing library and directly let Prometheus Remote Write into the SLS time series library
4. Indicator calculation • Part of the calculation of core indicators comes from NGINX AccessLog, from the entrance you can get the QPS, error rate, Latency (average, PXX, etc.) of each business, for the business without any intrusiveness • Resource utilization, middleware, infrastructure and other indicators come from the time series library written by Prometheus, based on Catalog can aggregate and calculate the relevant indicators of each department and business
• Indicator information after the calculation is completed Because the amount of data is very small, it can be easily stored in MySQL and ES, and a copy can be sent to OSS to back up
a problem.3 FinOps: In the Cloud Infra department, the most challenged problem is the overhead problem. Therefore, cost optimization is also one of our core work, the main practice is: calculate the resource utilization of each department and team, including the average utilization rate and the utilization of various PXX (
as shown in the table below), so as to judge the resource usage of each department and promote the optimization of costs of each department.
The technology behind Liulishuo unified monitoring Liulishuo unified monitoring
is built on Alibaba Cloud SLS, which is positioned as a cloud-native observation and analysis platform to provide large-scale, low-cost, real-time platform services for Log/Metric/Trace and other data. Provide one-stop data collection, processing, analysis, alarm visualization and delivery functions, and comprehensively improve the digital capabilities of R&D, operation and maintenance, operation and security. Among them, unified monitoring uses a variety of core functions of SLS, mainly including:
all-round log collection
SLS supports unified collection of Log/Metric/Trace, supports server/Kubernetes/application/mobile device/web page/IoT and other data source access, supports Alibaba Cloud products/open source systems/cloud/cloud log data access, the core features are:
40+ mature access solutions, Multi-client unified collection, supporting intranet, public network, global accelerated transmission and other transmission methods
are reliable: Alibaba’s self-use infrastructure has been tested by many Double 11 and Spring Festival Gala activities. Supports resumable upload, which can be elastically scaled and open according to business traffic
: multi-protocol (HTTP/Syslog/Prometheus/OpenTelemetry) seamless access, complete docking with the open source ecological Prometheus
time series solution
SLS Time Series Storage was designed from the beginning to meet the time series storage needs of Alibaba internal and many leading enterprise customers, and with the help of years of technology accumulation within Alibaba, it can adapt to most enterprise-level timing monitoring/analysis requirements. The main characteristics of SLS time series storage are: 1. Rich upstream and downstream: SLS supports many collection methods on data access, including various open source agents and Alibaba Cloud internal monitoring data channels; at the same time, the stored time series data supports docking with various stream computing and offline computing engines. Data is completely open 2. High performance: SLS storage and computing separation architecture gives full play to cluster capabilities, especially in the bottom-to-end speed of a large amount of data3. O&M-free: SLS time series storage is completely service-based, without users to operate and maintain instances, and all data is 3 copies of highly reliable storage, do not worry about data reliability 4. Open source friendly: SLS time series storage natively supports Prometheus writing and querying, and supports SQL92 analysis methods, which can natively dock with visualization solutions such as Grafana
5. Intelligence: SLS provides a variety of AIOps algorithms, such as multi-period estimation, forecasting, anomaly detection, time series classification and other time series algorithms, based on these algorithms can quickly build an intelligent alarm and diagnosis platform suitable for the company’s business
Real-time data analysis
Query analysis provides keywords, SQL92, AIOps functions and other methods, supporting real-time query and analysis of text+structured data, abnormal inspection and intelligent analysis. The main features are as follows: 1. High performance: second-level analysis of billion-level data, and complete support for SQL, PromQL and other analysis interfaces, HTTP, Kafka, JDBC, Prometheus and other protocols 2. Stable and reliable: enterprise-level design, multi-tenant isolation, petabyte-level capacity design, tens of thousands of enterprise users choose 3. Intelligence: AIOps capabilities practiced by Alibaba economies support intelligent abnormal inspection and root cause analysis
Through flexible syntax, data transformation supports various complex data extraction, parsing, enrichment, distribution, and other requirements without writing code, and supports structured analysis. The main features of data processing are as follows: 1. Flexible: Provide rich operators and out-of-the-box scenario-based UDFs (Syslog, non-standard json, AccessLog UA/URI/IP parsing, etc.). Scalable syntax to cope with various complex formats2. O&M-free: Fully managed cloud services without the need to invest additional O&M resources. Supports automatic scaling based on
traffic3. Scalability: Supports logic such as multi-level nesting and traffic splitting, and supports complex data dispatch and orchestration requirements
In the cloud native era, digitalization is driving business innovation in various industries. Only by improving the user experience, accelerating innovation, updating infrastructure and architecture, and leveraging diverse data can we stand out in the overall environment. The intelligent O&M platform launched by Alibaba Cloud is not only to help engineers reduce their workload, but also to free O&M engineers from various mechanized work. We will take care of all the “dirty work”, greatly reduce the time of failure, and allow O&M personnel to focus more creativity on digital innovation and enterprise business innovation, so as to provide enterprises with better competitiveness.
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇