Problem with reading large file
Large file can be any plain text or binary file which is huge in size and can not fit in JVM memory at once. For example if a java application allocated with 256 MB memory and it tries to load a file completely which is more or close to that memory in size then it may throw out of memory error.Points to be remembered
- Never read the whole file at once.
- Read file line by line or in chunks, like reading few lines from text file or reading few bytes from binary file.
- Do not store the whole data in memory, like reading all lines and keeping as string.
CSV file
In this example I am going to read a CSV file which is around 500 MB in size. Sample is as given below.
"year_month","month_of_release","passenger_type","direction","citizenship","visa","country_of_residence","estimate","standard_error","status"
"2018-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",1,0,"Provisional"
"2019-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",0,0,"Provisional"
File reading and counting year wise
We will read CSV file and provide the count year wise using the first column in this CSV file. We will see it in two different ways, one is synchronous way and another is asynchronous way using CompletableFuture. We will see the code in next sections.Instance Variables
private final long mb = 1024*1024; private final String file = "/Users/Downloads/sample.csv";
Common Methods
Below method is used to collect the count year wise in map.public void yearCount(String line, Map<String, Integer> countMap){ String key = line.substring(1, 5); if(countMap.containsKey(key)) { countMap.put(key, countMap.get(key)+1); }else countMap.put(key, 1); }
Below method I have annotated with EventListener to get it invoked itself when application is up and ready. This method also calculates memory consumption and execution time.
@EventListener(ApplicationReadyEvent.class) public void testLargeFile() throws Exception{ long premem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory(); long start = System.currentTimeMillis(); System.out.println("Used memory pre run (MB): "+(premem/mb)); //PLEASE UNCOMMENT THE 1 LINE OUT OF BELOW 2 LINES AT A TIME TO TEST //THE DESIRED FUNCTIONALITY // System.out.println("Year count: "+simpleYearCount(file));//process file synchronously and print details // System.out.println("Year count: "+asyncYearCount(file));//process file asynchronously and print details long postmem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory(); System.out.println("Used memory post run (MB): "+(postmem/mb)); System.out.println("Memory consumed (MB): "+(postmem-premem)/mb); System.out.println("Time taken in MS: "+(System.currentTimeMillis()-start)); }
Synchronous processing
Below is the code which reads the file using NIO API and calculates year count synchronously. Code looks pretty simple and small.public Map<String, Integer> simpleYearCount(String file) throws IOException { Map<String, Integer> yearCountMap = new HashMap<>(); Files.lines(Paths.get(file)) .skip(1)//skip first line .forEach((s)->{ yearCount(s, yearCountMap); }); return yearCountMap; }
Output
Used memory pre run (MB): 41 Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656} Used memory post run (MB): 304 Memory consumed (MB): 262 Time taken in MS: 1971
Asynchronous processing
Here we are going to read the file using NIO API and then will process it asynchronously using CompletableFuture. For example will read 10000 lines and then process them asynchronously, then next 5000 and so on. See the below code.public Map<String, Integer> asyncYearCount(String file) throws IOException, InterruptedException, ExecutionException { try { List<CompletableFuture<Map<String, Integer>>> futures = new ArrayList<>(); List<String> items = new ArrayList<>(); Files.lines(Paths.get(file)) .skip(1)//skip first line .forEach(line->{ items.add(line); if(items.size()%10000==0) { //add completable task for each of 10000 rows futures.add(CompletableFuture.supplyAsync(yearCountSupplier(new ArrayList<>(items), new HashMap<>()))); items.clear(); } }); if(items.size()>0) { //add completable task for remaining rows futures.add(CompletableFuture.supplyAsync(yearCountSupplier(items, new HashMap<>()))); } return CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()])) .thenApply($->{ //join all task to collect result after all tasks completed return futures.stream().map(ftr->ftr.join()).collect(Collectors.toList()); }) .thenApply(maps->{ Map<String, Integer> yearCountMap = new HashMap<>(); maps.forEach(map->{ //merge the result of all the tasks map.forEach((key, val)->{ if(yearCountMap.containsKey(key)) { yearCountMap.put(key, yearCountMap.get(key)+val); }else yearCountMap.put(key, val); }); }); return yearCountMap; }) .get(); } catch (IOException e) { e.printStackTrace(); } return new HashMap<>(); }
//Supplier method to count the year in given rows public Supplier<Map<String, Integer>> yearCountSupplier(List<String> items, Map<String, Integer> map){ return ()->{ items.forEach((line)->{ yearCount(line,map); }); return map; }; }
Output
Used memory pre run (MB): 120 Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656} Used memory post run (MB): 262 Memory consumed (MB): 142 Time taken in MS: 1549
Conclusion
Now we have seen how to read and process the huge file. We also learnt to do it synchronously and asynchronously. If we compare the output of both execution we can notice the memory consumption and execution time difference which says that async execution is faster however it may use more memory as multiple threads are processing at same time. Async execution may be more useful when you have more heavy files then difference is significant.I would suggest if you have less memory then go with synchronous execution otherwise use async execution for better performance. You may use async execution also with less memory but it may not be that much beneficial due to small chunks and too many threads.
wow this java program very nice to read...i very much egarly waiting too see ur next post..
ReplyDeleteAngularJS training in chennai | AngularJS training in anna nagar | AngularJS training in omr | AngularJS training in porur | AngularJS training in tambaram | AngularJS training in velachery
i'm dazzled. I don't assume Ive met each body who knows as an incredible arrangement simply extra or considerably less this present circumstance as you reach. you are in goal of truth handily proficient and colossally sparkling. You thought of something that individuals ought to get and made the trouble energizing for us all. absolutely, satisfying website you have came.. Razer Surround Pro Key Generator
ReplyDeletebasically I experience your page besides practicing and helpful appraisal. it is defended colossally advantageous calculation when a combination of our assets.thanks for part. I participate in this make perceived. Tally GsT Crack
ReplyDelete