Do not deliberately pursue certain things, falling leaves, belongs to me, it’ll always be back.
Don’t deliberately pursue something, the leaves fall back to the roots, belong to mine, it will always come back.
Talk to your heart every day
Be your own hero. The only thing you need to save is the idea that you need to be saved.
Editor-in-charge: Le Le | From: blog.csdn.net/weixin_44848900/article/details/117701981
Tweet 1729 of Circle of Programming (ID: study_tech).
Recap: This outsourcing company is disgusting. Please think twice when you go inside!
In project development, you often need to use the import and export of data, import is to import from Excel to DB, and export is to query data from DB and then use POI to write to Excel.
The background of this article is because I encountered the import and export of big data at work, and since the problem is to escape, it is better to kill it!!!
As long as this time is solved, the same problem will be solved later.
No more nonsense, start picking it up!!!
In fact, when you think of the import and export of data, you will naturally think of apache’s poi technology and the version of Excel.
Since we want to import and export, let’s first take a look at the version of traditional poi technology and the advantages and disadvantages!
First of all, we know that the interface that we are most familiar with in POI is WorkBook, and our POI version has also updated the implementation class of this several mouths at the same time:
This implementation class is our most used object in the early days, and it can manipulate all Excel versions before (including 2003). Before 2003, the version of Excel suffix was still .xls
This implementation class can now be found in use in many companies, it is the version of the operation between Excel 2003 – Excel 2007, Excel extension is .xlsx
This implementation class is available after POI3.8, which can operate all versions of Excel after Excel 2007 with the extension .xlsx
After knowing roughly that we will use these three implementation classes when importing export operations and the Excel versions and suffixes that they can operate, we will analyze them from the advantages and disadvantages
It is the most commonly used way in the POI version, though:
Starting with POI version 3.8, an XXSSF-based SXSSF-based approach to low memory footprint is available:
After understanding and knowing the advantages and disadvantages of these three types of workbooks, it still depends on which way to use it:
I generally make analytical choices based on the following situations:
1. When we often import and export data that does not exceed 7w, we can use HSSFWorkbook or XSSFWorkbook;
2. When the data volume has been checked by 7w and the exported Excel does not involve the style, formula, formatting and other operations of Excel, it is recommended to use SXSSFWorkbook;
3, when the amount of data checked 7w, and we need to do Excel in the header, style, formula, etc., then we can use XSSFWorkbook to cooperate with batch query, batch writing to Excel to do;
I’ve done a lot of work, so let’s talk about the import and export solution of more than a million data I encountered in my work:
If we want to solve the problem, we must first understand what the problem we are encountering.
1, I encountered a super amount of data, the use of traditional POI way to complete the import and export is obviously memory overflow, and the efficiency will be very low;
2, the amount of data directly use select * from tableName is definitely not OK, all at once to find out 300w data will certainly be very slow;
3, 300w data export to Excel must not be written in a sheet, so the efficiency will be very low; It is estimated that it takes a few minutes to open;
4. 300w data export to Excel can certainly not be exported to Excel row by row. Frequent IO operations are absolutely not OK;
5. When importing, 3 million data is stored in the DB if the loop is inserted one by one;
6. If you use Mybatis batch insertion of 300w data when importing, it is definitely not OK, because the batch insertion of Mybatis is actually a SQL loop; Same very slow.
In fact, the problem is memory overflow, we just need to use the POI method described above, the main problem is that the native POI is quite troublesome to solve.
After reviewing the data to see Ali’s POI packaging tool EasyExcel, the above problem waited until it was solved;
Can’t query all the data at once, we can query in batches, but the problem of querying several times, and there are many paging plug-ins on the market. This issue is well resolved.
You can write 300w pieces of data to different sheets, and each sheet can write one million.
You can’t write one line at a time to Excel, we can write the data of the batch query to Excel in batches.
When importing into DB, we can store data read in Excel into the collection, and to a certain number, batch insert directly into the DB.
Instead of using Mybatis for bulk insertion, we can use JDBC’s bulk insertion to complete bulk insertion to DB with transactions. That is, Excel read batch + JDBC batch insert + transaction.
Attach GitHub address: https://github.com/alibaba/easyexcel
The tutorial and instructions on the GitHub address are very detailed, and there is read and written demo code, which I will not go into detail here.
As for how the bottom layer of EasyExcel is implemented, it remains to be studied.
EasyExcel completes the export of 300w data. The technical difficulty is already known, and the next step is to provide your own solution ideas for this difficult point.
Export solution ideas for 300w data:
1. We need to calculate the number of sheets and the number of loop writes. In particular, the number of writes to the last Sheet
Because you don’t know how much data is written to the last Sheet, it may be 100w, or it may be 25w, because the 300w here is only analog data, and it is possible to export more data than 300w or less
2, we need to calculate the number of writes, because we use a paged query, so we need to pay attention to the number of writes.
Oh! N open source projects necessary for private work! Hurry up and collect
In fact, how many times to query the database is how many times to write
The following is the tester configuration
Database I use Oracle19C to check on the Internet In fact, in the case of no more than 100 million data, the performance of Mysql and Oracle is actually not much different, more than 100 million, Oracle’s advantages in all aspects will be obvious.
So the time impact of using the database can be ignored here, and the test can be done using mysql, without the need to install Oracle separately.
In this test, in terms of query, I use the simulation query 300w data carried out by rownum, which is actually not very efficient, and there is actually a lot of optimization space to speed up the query.
For example, to explicitly query specific fields, do not use asterisks, often query fields to increase indexes, etc. to improve the query efficiency as much as possible, and the time may be shorter.
Below is the time it takes for 300w data to be exported from DB to Excel
As can be seen from the above results, the data export time of 300w takes 2 minutes and 15 seconds, and this is in the case of not applying entities as mappings, and if using entity mapping does not apply loop wrapping it will be faster (of course, this is also in the case of not setting other table styles such as table headers)
All in all, the speed is acceptable.
Check a lot of information on the Internet There is a blogger test using EasyExcel to export 102w data in 105 seconds, you can see the link:
Take a look at the export effect: the file is still quite large 163M
After testing EasyExcel is still very fast, and it is quite convenient to use, the author also provides a special method of off-flow, does not require us to manually turn off the flow, but also avoids a series of problems caused by we often forget to shut down.
The export test is here, for data with a data volume of less than 300W can be exported in one sheet. I will not demonstrate it here.
The code is not important first or foremost the idea
300W data import solution ideas
1, the first is to read and read 300w data in Excel in batches, this point EasyExcel has its own solution, we can refer to Demo, just need to put its batch of parameters 3000 can be adjusted. I am using 20w; (The code can be understood at a glance) In addition, the search for the official account Linux should learn to reply to “monkey” in the background and get a surprise package.
2, followed by insertion into the DB, how to insert these 20w data, of course, can not be a loop, should be batch inserted into the 20w data, the same can not use Mybatis batch insertion, because the efficiency is also low. You can refer to the following link [Myabtis batch insert and JDBC batch insert performance comparison]
3. Use JDBC+ transaction batch operation to insert data into the database. (Batch reads + JDBC batch inserts + manual transaction control)
As shown in the figure
The following is the 300w data while reading and writing time:
Total time from the start of the read to the middle batch import to the end of the program: (1623127964725-1623127873630)/1000=91.095 seconds
300w of data is exactly 15 minutes inserted synthesis: 8209 milliseconds or 8.209 seconds
The calculated 300w data read time: 91.095-8.209=82.886 seconds
The results are obvious:
EasyExcel read 300W of data in batches in just 82.886 seconds
Using JDBC batch-+ transaction operations to insert 300w pieces of data synthesis took only 8.209 seconds
Let’s see if the data in the database is really stored in 300w
You can see that the data is 300W more than before import, and the test is successful
Specific I did not look at the test of other people on the Internet, this thing is generally very few people are willing to test, but this speed for me to solve the company’s big data import and export at that time is enough, of course, the company’s business logic is very complex, the amount of data is also more, the table field is also more, the speed of import and export will be a little slower than the current test, but it is also within the range of human acceptance.
I was also impressed by the problems I encountered in this work and also added to my career.
At the very least, your resume can be written that you have handled the import and export of millions of pieces of data.
Finally, let’s talk about how the company did it before, and the company’s previous practice was
Limiting the number of downloads a user can only have up to four people at a time download, and controlling the maximum export data per user can only be 20w, while they also use JDBC batch import, but do not manually control transactions.
I can understand controlling the number of downloaders at the same time, but controlling the maximum download data of 20w is a bit of a chicken rib.
This is also the problem I want to solve later.
Well, at this point, I believe that the gods have done better than me, and it remains to be studied how the inside of EasyExcel is implemented (I will study it again when I have time).
PS: Welcome to leave your views in the message area and discuss the improvement together. If today’s article gives you new inspiration, feel free to share it with more people.
Copyright Notice: The content comes from the network, and the copyright belongs to the original owner. Unless it cannot be confirmed, we will indicate the author and source, if there is any infringement, please let us know, we will delete it immediately and apologize. Thank you!
Welcome to join the back-end architect exchange group, reply to “learn” in the background.
Don’t look for it, you want to get the easiest Java factory interview questions learning materials ever
Scan the QR code below to reply to the “interview”
Guess you still want to see it
Alibaba, Tencent, Baidu, Huawei, JD.com’s latest interview questions
Look at people’s permission management system, that’s called an elegant (source code)!
Share a Vue3.x-based data visualization big screen project
This is perhaps the most beautiful open source back-office management UI
An invoicing and inventory management system for SMBs that supports the efficient development of complex functions!
This awesome enterprise-grade back-end system is open sourced!
Stop writing your own tool class, Spring Boot built-in tool class has everything, recommended collection!!
SpringBoot sets dynamic timed tasks, don’t write anymore ~
Why is Linux development recommended?
Don’t use SpringMVC, it’s too low! Spring has made a more bullish framework!
Recommend an open source universal back-end management system!
SpringBoot interface data encryption and decryption tricks, so easy!
Hey, are you watching?