Map-Reduce is a practical use case of map-reduce ideas in ABAP programming

ABAP IS AN ENTERPRISE APPLICATION PROGRAMMING LANGUAGE WITH VERSION 740 RELEASED IN 2013 WITH MANY NEW SYNTAX AND KEYWORDS:

One of the highlights is the newly introduced REDUCE keyword. This keyword acts like the Reduce operation in the programming model widely used in the field of parallel computing of large-scale data sets, and can be taken literally as .Map-Reduce归约

What is Map-Reduce Thought?

Map-Reduce is a programming model and related implementation for generating and processing large-scale datasets using parallel distributed algorithms on a cluster.

A Map-Reduce program consists of a Map procedure and a Reduce method. The Map process is responsible for performing filtering and sorting, such as sorting students by name into queues, each maintained by a queue.

The Reduce method is responsible for performing summary operations, such as counting the number of students. The Map-Reduce system orchestrates distributed servers to run tasks in parallel, manages all communication and data transfer between parts of the system, and provides data redundancy for fault tolerance.

The following figure is a working step of the Map Reduce framework to count the number of word occurrences in a massive input dataset (e.g. greater than 1TB). The work steps include Splitting, Mapping, Shuffling, Reducing to get the final output.

The Map-Reduce programming model has been widely used in tools and frameworks in the field of big data processing, such as Hadoop.

Map-Reduce is a practical application in CRM systems

Let’s look at the actual tasks in an author’s work. I need to do a statistic on a CRM test system to list the number of rows in the database table CRM_JSTO that have the same value in the OBTYP(Object Type) and STSMA(Status Schema) columns. You can compare this description to the words that appear repeatedly in the above figure.OBTYP 和 STSMA 两列具有相同值的内表行

the following illustration shows some of the rows of a database table in the system:CRM_JSTO

the following figure is the statistical result that the author finally completed:

THE TOTAL NUMBER OF ROWS OF DATABASE TABLES ON THE TEST SYSTEM EXCEEDED 550,000 ROWS, OF WHICH 90,279 WERE MAINTAINED, AND ONLY OBTYP WAS MAINTAINED AS TGP, NOT STSMA.

IN SECOND PLACE WAS THE COMBINATION OF COH AND CRMLEAD, WHICH APPEARED 78,722 TIMES.

how is this result calculated in the above figure?

FRIENDS WHO HAVE DONE A LITTLE BIT OF ABAP DEVELOPMENT WILL DEFINITELY WRITE THE FOLLOWING CODE IMMEDIATELY:

Leverage Statistics directly at the database layer. This is also the recommended practice of SAP, the so-called guideline, that is, the operation that can be put into the HANA database level, and try to put it in to take full advantage of the powerful computing power of HANA. Under the premise that the database can complete the calculation logic, try to avoid putting the calculation logic into the Netweaver ABAP application layer.SELECT COUNTCode pusudown

HOWEVER, WE ALSO NEED TO BE AWARE OF THE LIMITATIONS OF THIS APPROACH. SAP CTO ONCE FAMOUSLY SAID:

There is no future with ABAP aloneThere is no future in SAP without ABAP

THE FUTURE OF ABAP WILL MOVE TOWARD AN OPEN, INTERCONNECTED PATH. GOING BACK TO THE REQUIREMENT ITSELF, ASSUMING THAT THE INPUT DATA TO BE RETRIEVED IS NOT FROM THE ABAP DATABASE TABLE, BUT FROM AN HTTP REQUEST, OR AN IDOC SENT BY A THIRD-PARTY SYSTEM, WE CAN NO LONGER USE THE SELECT COUNT OPERATION OF OPEN SQL ITSELF, BUT CAN ONLY SOLVE THIS PROBLEM IN THE ABAP APPLICATION LAYER.

HERE ARE TWO SOLUTIONS FOR ACCOMPLISHING THIS IN THE ABAP PROGRAMMING LANGUAGE.

the first way is more traditional, implemented in method get_result_traditional_way:

ABAP’s LOOP AT GROUP BY keyword combination is almost like tailor-made for this need: give GROUP BY two columns, obtyp and stsma, and then LOOP AT will automatically group the row records entered into the inner table according to the values of these two columns, and the number of row records in each group is automatically calculated by the keyword GROUP SIZE, and the values of each set of obtyp and stsma, as well as the number of entries recorded in the group, are stored Reference INTO specifies the variable in group_ref. All ABAP developers need to do is simply store these results in the output table.

THE SECOND APPROACH, AS DESCRIBED IN THE TITLE OF THIS ARTICLE, IS TO USE THE NEW REDUCE KEYWORD INTRODUCED BY ABAP 740:

REPORT zreduce1.
DATA: lt_status TYPE TABLE OF crm_jsto.
SELECT * INTO TABLE lt_status FROM crm_jsto.
DATA(lo_tool) = NEW zcl_status_calc_tool( ).
lo_tool = REDUCE #( INIT o = lo_tool local_item = VALUE zcl_status_calc_tool=>ty_status_result( ) FOR GROUPS <group_key> OF <wa> IN lt_status GROUP BY ( obtyp = <wa>-obtyp stsma = <wa>-stsma ) ASCENDING NEXT local_item = VALUE #( obtyp = <group_key>-obtyp stsma = <group_key>-stsma count = REDUCE i( INIT sum = 0 FOR m IN GROUP <group_key> NEXT sum = sum + 1 ) ) o = o->add_result( local_item ) ).
DATA(ls_result) = lo_tool->get_result( ).

复制代码

The above code may seem a bit obscure at first glance, but after careful reading, it is found that this method essentially adopts the same grouping strategy as loop AT GROUP BY – according to obtyp and stsma grouping, these subgroups are identified by variables, and then through the REDUCE keyword on line 10, through the accumulation method, manually calculate the number of entries in this group – a large input set is reduced to a smaller subset according to the conditions specified by GROUP BY, The subsets are then computed separately— this is the processing idea that the REDUCE keyword literally passes to the ABAP developer.group_key

SUMMARIZE AND COMPARE THESE THREE IMPLEMENTATIONS: WHEN THE DATA SOURCE TO BE COUNTED IS AN ABAP DATABASE TABLE, THE OPEN SQL METHOD MUST BE PREFERRED, SO THAT THE CALCULATION LOGIC IS DONE AT THE DATABASE LAYER TO OBTAIN THE BEST PERFORMANCE.

When the data source is not an ABAP database table and the requirements for grouped statistics are simple count operations (COUNT), LOOP AT … GROUP BY … GROUP SIZE, so that the counting operation is done in the ABAP kernel through GROUP SIZE for better performance.

WHEN THE DATA SOURCE IS NOT AN ABAP DATABASE TABLE, AND THE REQUIREMENTS FOR GROUPED STATISTICS ARE CUSTOM LOGIC, USE THE THIRD REDUCE SOLUTION DESCRIBED IN THIS ARTICLE TO WRITE THE CUSTOM STATISTICAL LOGIC AFTER THE NEXT KEYWORD IN LINE 11.

performance evaluation of three solutions

i wrote a simple report for performance evaluation:

DATA: lt_status TYPE zcl_status_calc_tool=>tt_raw_input.
SELECT * INTO TABLE lt_status FROM crm_jsto.
DATA(lo_tool) = NEW zcl_status_calc_tool( ).
zcl_abap_benchmark_tool=>start_timer( ).DATA(lt_result1) = lo_tool->get_result_traditional_way( lt_status ).zcl_abap_benchmark_tool=>stop_timer( ).
zcl_abap_benchmark_tool=>start_timer( ).lo_tool = REDUCE #( INIT o = lo_tool local_item = VALUE zcl_status_calc_tool=>ty_status_result( ) FOR GROUPS <group_key> OF <wa> IN lt_status GROUP BY ( obtyp = <wa>-obtyp stsma = <wa>-stsma ) ASCENDING NEXT local_item = VALUE #( obtyp = <group_key>-obtyp stsma = <group_key>-stsma count = REDUCE i( INIT sum = 0 FOR m IN GROUP <group_key> NEXT sum = sum + 1 ) ) o = o->add_result( local_item ) ).
DATA(lt_result2) = lo_tool->get_result( ).zcl_abap_benchmark_tool=>stop_timer( ).
ASSERT lt_result1 = lt_result2.

copy the code

the test data is as follows:

the performance of these three solutions decreases sequentially, but the applicable cases and flexibility increase sequentially.

LOOP AT ... GROUP BY ... GROUP SIZE THIS SOLUTION, ON THE ABAP TEST SERVER I WORKED ON, PROCESSED 550,000 RECORDS IN SECONDS, WHILE REDUCE TOOK SECONDS, AND THE PERFORMANCE OF THE TWO SOLUTIONS WAS WITHIN THE SAME ORDER OF MAGNITUDE.0.30.8

summary

Map-Reduce is a programming model and related implementation for generating and processing large-scale datasets using parallel distributed algorithms on a cluster. The ABAP programming language supports reduce operations on large-scale data at the language level. This article shares a real-world example of how I used the Map-Reduce approach to work with large-scale datasets in my work and compared it to two other traditional solutions. With performance not inferior to traditional solutions, Map-Reduce-based solutions offer a wider range of applications and scalability. We hope that the content shared in this article has been enlightening when you use ABAP to deal with similar issues, thank you for reading.