vivo Internet Security Team – Xie Peng

With the advent of the big data era of the Internet, web crawlers have also become an important industry in the Internet, which is a crawler program that automatically obtains web page data information and is an important part of the website search engine. Through crawlers, you can get the relevant data information you want, so that crawlers can assist in their work, thereby reducing costs, improving business success rates and improving business efficiency.

On the one hand, this paper explains how to efficiently crawl the open data on the network from the perspective of crawler and anti-crawling, and on the other hand, it will also introduce the technical means of anti-crawling, and provide some suggestions for preventing the server from overloading the server by the process of collecting data in large quantities by external crawlers.

Crawler refers to the automatic capture of World Wide Web information in accordance with certain rules of the program, this time will mainly from the technical principles and implementation of crawlers, anti-crawlers and anti-crawlers two aspects of a simple introduction, the introduction of the case are only for security research and learning, and will not carry out a large number of crawlers or applied to business.

First, the technical principle and implementation of crawlers

1.1 Definition of a crawler

Crawlers are divided into two categories: general crawlers and focused crawlers, the former aims to crawl as many sites as possible while maintaining a certain content quality, such as Baidu such as search engines are this type of crawler, as shown in Figure 1 is the infrastructure of the general search engine:

First, select a part of the web pages on the Internet, and use the link address of these web pages as the seed URL;

Put these seed URLs into the URL queue to be crawled, and the crawlers read from the URL queue to be crawled in turn;

The URL is resolved through DNS, and the link address is converted to the IP address corresponding to the website server;

The web page downloader downloads the web page through the website server, and the downloaded web page is in the form of a web document;

Extract URLs from web documents and filter out URLs that have been crawled;

Continue looping URLs that are not crawled until the URL queue to be crawled is empty.

Figure 1: Infrastructure of a common search engine

Crawlers typically start with one or more URLs and continue to put new and compliant URLs into the queue to be crawled during the crawl process until the program’s stop condition is met.

The crawler we see every day is basically the latter, and the goal is to maintain the accuracy of content quality as much as possible while crawling a small number of sites. Typical example is shown in Figure 2 ticket snatching software, which is to use crawlers to log in to the ticketing network and crawl information, thereby assisting business.

Figure 2: Ticket snatching software

Now that we know the definition of a crawler, how should we write a crawler to crawl the data we want? We can first understand the commonly used crawler framework, because it can write some common crawler function implementation code, and then leave some interfaces, when doing different crawler projects, we only need to handwrite a small number of code parts that need to be changed according to the actual situation, and call these interfaces as needed, that is, you can implement a crawler project.

1.2 Introduction to the crawler framework

The commonly used search engine crawler framework is shown in Figure 3, first of all, Nutch is a crawler specially designed for search engines, not suitable for precision crawlers. Both Pyspider and Scrapy are crawler frameworks written in the Python language and both support distributed crawlers. In addition, Pyspider is more user-friendly than Scrapy’s full command line operation due to its visual operation interface, but the function is not as powerful as Scrapy.

Figure 3: Comparison of crawler frames

1.3 A simple example of a crawler

 In addition to using the crawler framework for crawling, you can also write a crawler from scratch, the steps are shown in Figure 4:

Figure 4: The basic principle of crawlers

Next through a simple example to actually demonstrate the above steps, we are going to crawl a list of an application market, to take this as an example, because this site does not have any anti-crawling means, we can easily crawl the content through the above steps.

Figure 5: Web page and its corresponding source code

The source code for the web page and its corresponding is shown in Figure 5, and for the data on the web page, suppose we want to crawl the name of each app on the leaderboard and its classification.

We first analyze the web source code, found that you can directly search for the name of the app such as “vibrato” in the web page source code, and then see that the app name, app category, etc. are in a tab, so we only need to request the web page address, get the returned web page source code, and

  • then make regular matching of the web page source code, extract the desired data, and save it, as shown in Figure 6.

    Figure 6: The code of the crawler and the result

    Second, anti-crawler related technologies

    Before understanding the specific anti-crawler measures, we will first introduce the definition and meaning of anti-crawler, and the behavior of restricting the crawler’s access to server resources and data acquisition is called anti-crawling. The access rate and purpose of the bot are different from the access rate and purpose of normal users, and most crawlers will crawl the target application without restraint, which brings great pressure to the server of the target application. Network requests made by bots are referred to by operators as “junk traffic”. In order to ensure the normal operation of the server or reduce the pressure and operating costs of the server, developers have to use a variety of technical means to restrict crawlers’ access to server resources.

    So why do anti-crawler, the answer is obvious, crawler traffic will increase the load of the server, too large crawler traffic will affect the normal operation of the service, resulting in revenue loss, on the other hand, some core data leakage, will make the data owner lose competitiveness.

    Common anti-crawler tactics, as shown in Figure 7. It mainly includes text obfuscation, page dynamic rendering, verification code verification, request signature verification, big data risk control, js confusion and honeypot, etc., and its Chinese confusion includes css offset, picture camouflage text, custom fonts, etc., while the formulation of risk control strategies is often based on parameter verification, behavior frequency and pattern abnormality.

    Figure 7: Common anti-crawler means

    2.1 CSS offset anti-crawler

    When building a web page, you need to use CSS to control the position of various characters, and it is precisely this that CSS can be used to store the text displayed in the browser in an out-of-order manner in HTML, thereby limiting crawlers. CSS offset anti-crawler is an anti-crawler method that uses CSS styles to type out of order text into the normal reading order of human beings. This concept is not very easy to understand, and we can deepen our understanding of this concept by comparing two paragraphs:

    Text in HTML text: My student ID is 1308205, I am studying at Peking University.

    The text displayed in the browser: My student ID is 1380205, I am studying at Peking University.

    The browser should display the correct information in the above two paragraphs, if we follow the crawler steps mentioned earlier, analyze the web page and then extract the information regularly, we will find that the student number is wrong.

    Looking at the example shown in Figure 8, if we want to crawl the ticket information on this page, we first need to analyze the web page. The price 467 shown in the red box corresponds to the ticket from Shijiazhuang to Shanghai of Civil Aviation of China, but the analysis of the source code of the page found that there were 3 pairs of b labels in the code, 3 pairs of i labels in the 1st pair of b labels, and the numbers in the i label were 7, which means that the display result of the 1st pair of b labels should be 777. With the number 6 in the 2nd pair b label and the number in the 3rd pair b label being 4, we wouldn’t be able to get the correct ticket price directly from regular matching.

    Figure 8. CSS offset anti-crawler example

    2.2 Image camouflage anti-crawler

    Image camouflage anti-crawler, its essence is to replace the original content with the picture, so that the crawler can not get it normally, as shown in Figure 9. The principle of this anti-crawler is very simple, that is, the part that should be ordinary text content is replaced with a picture in the front-end page, and in this case, the text in the picture can be directly identified with ocr to bypass. And because it is displayed with a picture replacement text, the picture itself will be relatively clear, without a lot of noise interference, and the result of ocr recognition will be very accurate.

    Figure 9. Image camouflage anti-crawler example

    2.3 Custom font anti-crawler

    In the CSS3 era, developers can use @font-face to specify fonts for web pages. Developers can place their favorite font files on a web server and use them in CSS styles. When the user uses the browser to access the web application, the corresponding font will be downloaded to the user’s computer by the browser, but when we use the crawler, because there is no corresponding font mapping relationship, direct crawling will not be able to obtain valid data.

    As shown in Figure 10, the number of reviews, per capita, taste, environment and other information of each store in the webpage are garbled characters, and crawlers cannot directly read the content.

    Figure 10. Custom font anti-crawler example

    2.4 Page dynamic rendering anti-crawler

    Web pages can be roughly divided into client-side and server-side rendering according to the rendering method.

    Server rendering, the result of the page is returned by the server after rendering, the valid information is included in the requested HTML page, and the data and other information can be directly viewed by viewing the web page source code;

    Client-side rendering, the main content of the page is rendered by JavaScript, the real data is obtained through the Ajax interface and other forms, by viewing the web source code, there is no valid data information.

    The most important difference between client-side rendering and server-side rendering is who completes the complete stitching of the html file, if it is done on the server side and then returned to the client, it is server-side rendering, and if the front-end has done more work to complete the stitching of html, it is client-side rendering.

    Figure 11: Client-side rendering example

    2.5 Captcha anti-crawler

    Almost all applications pop up a verification code when it comes to an operation that is safe for user information to identify to ensure that the operation is human and not a machine running at scale. So why is there a verification code? In most cases, this is because the site is visited too often or behaves abnormally, or because it directly restricts some automated behavior. The categories are as follows:

    In many cases, such as login and registration, these verification codes are almost inevitable, and its purpose is to limit malicious registration, malicious explosion and other behaviors, which is also a means of anti-climbing.

    Some websites encounter too high frequency of access behavior, may directly pop up a login window, requiring us to log in to continue to access, at this time the verification code is directly bound to the login form, which is even if the exception is detected after the use of forced login to reverse crawl.

    Some more conventional websites if they encounter a slightly higher frequency of access to the situation, will take the initiative to pop up a verification code for users to identify and submit, verify whether the current access to the website is a real person, used to limit the behavior of some machines, to achieve anti-crawling.

    Common forms of verification codes include graphic verification codes, behavioral verification codes, text messages, scanning code verification codes, etc., as shown in Figure 12. For whether the verification code can be successfully passed, in addition to being able to accurately complete the corresponding click, selection, input, etc. according to the requirements of the verification code, it is also crucial to pass the verification code risk control; For example, for the slider verification code, the verification code risk control may detect the sliding trajectory, and if the detection of the trajectory is not artificial, it will be judged to be a high risk, resulting in the inability to pass successfully.

    Figure 12: Captcha anti-crawler means

    2.6 Request signature check anti-crawler

    Signature verification is one of the effective ways to prevent servers from being maliciously linked and tampering with data, and it is also one of the most commonly used protection methods for back-end APIs. Signing is a process of calculation or encryption based on the data source, and the user will sign a consistent and unique string, which is the identity symbol of your access to the server. By its consistency and uniqueness of these two characteristics, so that you can effectively avoid the server-side, the forged data or tampered with the data in the first normal data processing.

    The website mentioned earlier in Section 2.4 renders the web page through the client side, and the data is obtained through ajax requests, which increases the difficulty of crawlers to a certain extent. Next, analyze the ajax request, as shown in Figure 13, you will find that its ajax request is signed with the request, analysis is the encrypted parameter, and if you want to crack the request interface, you need to crack the encryption method of the parameter, which undoubtedly further increases the difficulty.

    Figure 13. An ajax request for list data

    2.7 Honeypot anti-reptile

    Honeypot anti-crawler, is a means of hiding the link used to detect the crawler in the web page, the hidden link will not be displayed in the page, normal users can not access, but the crawler may put the link into the queue to be crawled, and make a request to the link, developers can use this feature to distinguish between normal users and bots. As shown in Figure 14, looking at the web source code, there are only 6 products on the page,

    but there are 8 pairs of col-md-3 tags. The CSS style hides the labels, so we only see 6 items on the page, and the bot extracts the URLs of 8 products.

    Figure 14: Honeypot anti-reptile example

    Third, anti-anti-climbing related technologies

    For the anti-crawling related technologies mentioned in the previous section, there are the following types of anti-anti-crawling technical means: css offset anti-anti-crawling, custom font anti-anti-crawling, page dynamic rendering anti-anti-crawling, verification code cracking, etc., the following is a detailed introduction to these types of methods.

    3.1 CSS offset anti-crawl

    3.1.1 Introduction to CSS offset logic

    So for the above 2.1css offset anti-crawler example, how to get the correct ticket price. A closer look at the css style shows that each label with a number is styled, and the style of the i label pair in the 1st pair of b tags is the same, all width: 16px; Also, notice that the outermost span label pair has a style of width:48px.

    If you follow the clue of css style, the 3 pairs of i labels in the 1st pair of b labels occupy the position of the span label pair, as shown in Figure 15. The price displayed in the page should now be 777, but since there are values in the 2nd and 3rd pairs b labels, we also need to calculate their positions. Because the position style of the 2nd pair b label is left:-32px, the value 6 in the 2nd pair b label overwrites the 2nd number 7 in the original 1st pair b label, and the number the page should display is 767.

    According to this law, the position style of the 3rd pair b label is left:-48px, and the value of this label will override the 1st number 7 in the 1st pair b label, and the last fare displayed is 467.

    Figure 15: Offset logic

    3.1.2 CSS offset anti-crawl code implementation

    So next we write the code to crawl the page according to the above css-style rules, the code and the result is shown in Figure 16.

    Figure 16. CSS offsets the anti-crawl code with the result

    3.2 Custom font anti-crawling

    In view of the above 2.3 custom font anti-crawler situation, the solution is to extract the custom font file (usually WOFF file) in the web page, and include the mapping relationship in the crawler code, you can get the effective data. The steps to resolve this are as follows:

    Identify problems: Check the web source code and see that key characters are replaced by encoding, such as &#xefbe

    Analysis: Check the web page and find that css custom character set hiding is applied

    Lookup: Find the css file url and get the URL corresponding to the character set, such as PingFangSC-Regular-num

    Find: Find and download character set URLs

    Comparison: Compare the characters in the character set with the encoding in the web page source code, and find that the last four bits of the encoding correspond to the characters, that is, the taste corresponding to the web page source code is 8.9 points

    3.3 Page dynamic rendering anti-crawling

    Anti-crawler for client-side rendering, where the page code is not visible in the browser source code, needs to perform rendering and further obtain post-render results. For this anti-crawler, there are several ways to crack it:

    In the browser, directly view the specific request methods, parameters and other contents of ajax through the developer tools;

    Simulate the real person operating the browser through selenium, get the results after rendering, and then the operation steps are the same as the process of rendering on the other side;

    If the rendered data is hidden in the JS variable of the html result, it can be directly regularized;

    If there are encryption parameters generated by JS, you can find out the code of the encryption part, and then use pyexecJS to simulate the execution of JS and return the execution result.

    3.4 Captcha cracking

    The following example of identifying a slider verification code, as shown in Figure 17, is an example of the result of using a object detection model to identify the location of a slider verification code, and this way of cracking the slider verification code corresponds to the way of simulating a real person. The reason for not using interface cracking is that it is difficult to crack the encryption algorithm on the one hand, and on the other hand, the encryption algorithm may change every day, so the time cost of cracking is also relatively large.

    Figure 17. Identify gaps in the slider verification code through the object detection model

    3.4.1 Crawl the slider captcha image

    Because the object detection model used yolov5 is supervised learning, it is necessary to crawl the image of the slider verification code and mark it, and then enter it into the model for training. Crawl part of the captcha in a scene by simulating a real person.

    Figure 18. Crawled slider verification code image

    3.4.2 Manual marking

    This time it is labelImpg to manually label the pictures, manual marking takes a long time, 100 pictures generally take about 40 minutes. The automatic marking code is more complicated to write, mainly need to extract all the background pictures and notch pictures of the verification code separately, and then randomly generate the gap position as a label, and at the same time put the gap into the corresponding gap position, generate a picture, as input.

    Figure 19. Tag the captcha image and the xml file generated after tagging

    3.4.3 Object detection model yolov5

    Directly from github under the official code of clone yolov5, it is based on the pytorch implementation.

    The next steps to use it are as follows:

    Data format conversion: convert manually labeled pictures and tag files to yolov5 received data format, resulting in 1100 pictures and 1100 yolov5 format tag files;

    New dataset: Create a new custom.yaml file to create your own dataset, including the directory of training and validation sets, the number of categories, and the category name;

    Training tuning: After modifying the model configuration file and training file, perform training and tune hyperparameters based on the training results.

    Part of the script to convert xml files to yolov5 format:

    Training parameter settings:

    3.4.4 Training results of the object detection model

    The model has basically reached bottlenecks in precision and recall and mAP at the time of 50 iterations. The prediction results also have the following problems: most of them can accurately frame the gap, but there are also a small number of box errors, two gaps in the box, and no gap in the box.

    Figure 20. Top: Chart of the training results of the model;

    Bottom: The model’s prediction for a partial validation set

    IV. Summary

    This brief introduction to the technical means of crawlers and anti-crawlers, the technologies and cases introduced are only for safety research and learning, and will not be used in large numbers of crawlers or applied to business.

    For crawlers, in order to crawl the public data on the network for data analysis, etc., we should abide by the website robots agreement, in order not to affect the normal operation of the website and comply with the law to crawl data; For anti-crawlers, because as long as humans can access the web pages normally, crawlers can definitely crawl with the same resources. Therefore, the purpose of anti-crawling is to prevent crawlers from overloading the server in the process of collecting website information in large quantities, so as to prevent crawling behavior from hindering the user’s experience and improve the satisfaction of users with website services.

    Strong open source a small program!

    2021-11-07

    Highly recommended a well-established Logistics (WMS) management project (with code)

    2021-10-23

    Recommend a Spring Boot + MyBatis + Vue music website

    2021-10-19

    Share a family banking system (with source code)

    2021-09-20

    Recommend an open source payment system at the Internet enterprise level

    2021-09-04

    Recommend an open source common background management system (with source code)

    2021-08-21

    A fairy to pick up private work software, hanging to no!

    2021-07-31

    SpringBoot-based imitation Douban platform [source code sharing]

    2021-07-18

    Kill WordPress! This open source website builder is a bit hanging!

    2021-06-18

    20 practical projects from friends, quick collar!

    2021-06-12

    If there is any gain, click on it, sincerely thank you