Skip to main content

How to read a large file in Java

Problem with reading large file

Large file can be any plain text or binary file which is huge in size and can not fit in JVM memory at once. For example if a java application allocated with 256 MB memory and it tries to load a file completely which is more or close to that memory in size then it may throw out of memory error.

Points to be remembered

  • Never read the whole file at once.
  • Read file line by line or in chunks, like reading few lines from text file or reading few bytes from binary file.
  • Do not store the whole data in memory, like reading all lines and keeping as string. 
Java has many ways to read the file in chunks or line by line, like BufferedReader, Scanner, BufferedInputStream, NIO API. We will use NIO API to read the file and Java stream to process it. We will also see how to span the processing with multiple threads to make the processing faster.

CSV file

In this example I am going to read a CSV file which is around 500 MB in size.  Sample is as given below.
"year_month","month_of_release","passenger_type","direction","citizenship","visa","country_of_residence","estimate","standard_error","status"
"2018-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",1,0,"Provisional"
"2019-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",0,0,"Provisional"

File reading and counting year wise

We will read CSV file and provide the count year wise using the first column in this CSV file. We will see it in two different ways, one is synchronous way and another is asynchronous way using CompletableFuture. We will see the code in next sections.

Instance Variables

    private final long mb = 1024*1024;
    private final String file = "/Users/Downloads/sample.csv";

Common Methods

Below method is used to collect the count year wise in map.
    public void yearCount(String line, Map<String, Integer> countMap){
        String key = line.substring(1, 5);
        if(countMap.containsKey(key)) {
            countMap.put(key, countMap.get(key)+1);
        }else
            countMap.put(key, 1);
    }

Below method I have annotated with EventListener to get it invoked itself when application is up and ready. This method also calculates memory consumption and execution time.
    @EventListener(ApplicationReadyEvent.class)
    public void testLargeFile() throws Exception{
        long premem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
        long start = System.currentTimeMillis();

        System.out.println("Used memory pre run (MB): "+(premem/mb));
        //PLEASE UNCOMMENT THE 1 LINE OUT OF BELOW 2 LINES AT A TIME TO TEST
        //THE DESIRED FUNCTIONALITY
//        System.out.println("Year count: "+simpleYearCount(file));//process file synchronously and print details
//        System.out.println("Year count: "+asyncYearCount(file));//process file asynchronously and print details
        
        long postmem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();

        System.out.println("Used memory post run (MB): "+(postmem/mb));
        System.out.println("Memory consumed (MB): "+(postmem-premem)/mb);
        System.out.println("Time taken in MS: "+(System.currentTimeMillis()-start));
    }

Synchronous processing 

Below is the code which reads the file using NIO API and calculates year count synchronously. Code looks pretty simple and small.
    public Map<String, Integer> simpleYearCount(String file) throws IOException {
        Map<String, Integer> yearCountMap = new HashMap<>();
        
        Files.lines(Paths.get(file))
        .skip(1)//skip first line
        .forEach((s)->{
                yearCount(s, yearCountMap);
        });

        return yearCountMap;
    }

Output

Used memory pre run (MB): 41
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 304
Memory consumed (MB): 262
Time taken in MS: 1971

Asynchronous processing 

Here we are going to read the file using NIO API and then will process it asynchronously using CompletableFuture. For example will read 10000 lines and then process them asynchronously, then next 5000 and so on. See the below code.
    public Map<String, Integer> asyncYearCount(String file) throws IOException, InterruptedException, ExecutionException {
        try {
            List<CompletableFuture<Map<String, Integer>>> futures = new ArrayList<>();
            
            List<String> items = new ArrayList<>();
            Files.lines(Paths.get(file))
            .skip(1)//skip first line
            .forEach(line->{
                items.add(line);
                if(items.size()%10000==0) {
                    //add completable task for each of 10000 rows
                    futures.add(CompletableFuture.supplyAsync(yearCountSupplier(new ArrayList<>(items), new HashMap<>())));
                    items.clear();
                }
            });
            if(items.size()>0) {
                //add completable task for remaining rows
                futures.add(CompletableFuture.supplyAsync(yearCountSupplier(items, new HashMap<>())));
            }
            return CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()]))
                .thenApply($->{
                    //join all task to collect result after all tasks completed
                    return futures.stream().map(ftr->ftr.join()).collect(Collectors.toList());
                })
                .thenApply(maps->{
                    Map<String, Integer> yearCountMap = new HashMap<>();
                    maps.forEach(map->{
                        //merge the result of all the tasks
                        map.forEach((key, val)->{
                            if(yearCountMap.containsKey(key)) {
                                yearCountMap.put(key, yearCountMap.get(key)+val);
                            }else
                                yearCountMap.put(key, val);
                        });
                    });
                    return yearCountMap;
                })
                .get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return new HashMap<>();
    }
    //Supplier method to count the year in given rows
    public Supplier<Map<String, Integer>> yearCountSupplier(List<String> items, Map<String, Integer> map){
        return ()->{
            items.forEach((line)->{
                yearCount(line,map);
            });
            return map;
        };
    }

Output

Used memory pre run (MB): 120
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 262
Memory consumed (MB): 142
Time taken in MS: 1549

Conclusion

Now we have seen how to read and process the huge file. We also learnt to do it synchronously and asynchronously. If we compare the output of both execution we can notice the memory consumption and execution time difference which says that async execution is faster however it may use more memory as multiple threads are processing at same time. Async execution may be more useful when you have more heavy files then difference is significant.
I would suggest if you have less memory then go with synchronous execution otherwise use async execution for better performance. You may use async execution also with less memory but it may not be that much beneficial due to small chunks and too many threads. 

Comments

  1. i'm dazzled. I don't assume Ive met each body who knows as an incredible arrangement simply extra or considerably less this present circumstance as you reach. you are in goal of truth handily proficient and colossally sparkling. You thought of something that individuals ought to get and made the trouble energizing for us all. absolutely, satisfying website you have came.. Razer Surround Pro Key Generator

    ReplyDelete
  2. basically I experience your page besides practicing and helpful appraisal. it is defended colossally advantageous calculation when a combination of our assets.thanks for part. I participate in this make perceived. Tally GsT Crack

    ReplyDelete

Post a Comment

Popular Posts

Setting up kerberos in Mac OS X

Kerberos in MAC OS X Kerberos authentication allows the computers in same domain network to authenticate certain services with prompting the user for credentials. MAC OS X comes with Heimdal Kerberos which is an alternate implementation of the kerberos and uses LDAP as identity management database. Here we are going to learn how to setup a kerberos on MAC OS X which we will configure latter in our application. Installing Kerberos In MAC we can use Homebrew for installing any software package. Homebrew makes it very easy to install the kerberos by just executing a simple command as given below. brew install krb5 Once installation is complete, we need to set the below export commands in user's profile which will make the kerberos utility commands and compiler available to execute from anywhere. Open user's bash profile: vi ~/.bash_profile Add below lines: export PATH=/usr/local/opt/krb5/bin:$PATH export PATH=/usr/local/opt/krb5/sbin:$PATH export LDFLAGS=&

Why HashMap key should be immutable in java

HashMap is used to store the data in key, value pair where key is unique and value can be store or retrieve using the key. Any class can be a candidate for the map key if it follows below rules. 1. Overrides hashcode() and equals() method.   Map stores the data using hashcode() and equals() method from key. To store a value against a given key, map first calls key's hashcode() and then uses it to calculate the index position in backed array by applying some hashing function. For each index position it has a bucket which is a LinkedList and changed to Node from java 8. Then it will iterate through all the element and will check the equality with key by calling it's equals() method if a match is found, it will update the value with the new value otherwise it will add the new entry with given key and value. In the same way it check for the existing key when get() is called. If it finds a match for given key in the bucket with given hashcode(), it will return the value other

Entity to DTO conversion in Java using Jackson

It's very common to have the DTO class for a given entity in any application. When persisting data, we use entity objects and when we need to provide the data to end user/application we use DTO class. Due to this we may need to have similar properties on DTO class as we have in our Entity class and to share the data we populate DTO objects using entity objects. To do this we may need to call getter on entity and then setter on DTO for the same data which increases number of code line. Also if number of DTOs are high then we need to write lot of code to just get and set the values or vice-versa. To overcome this problem we are going to use Jackson API and will see how to do it with minimal code only. Maven dependency <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.9.9</version> </dependency> Entity class Below is

Multiple data source with Spring boot, batch and cloud task

Here we will see how we can configure different datasource for application and batch. By default, Spring batch stores the job details and execution details in database. If separate data source is not configured for spring batch then it will use the available data source in your application if configured and create batch related tables there. Which may be the unwanted burden on application database and we would like to configure separate database for spring batch. To overcome this situation we will configure the different datasource for spring batch using in-memory database, since we don't want to store batch job details permanently. Other thing is the configuration of  spring cloud task in case of multiple datasource and it must point to the same data source which is pointed by spring batch. In below sections, we will se how to configure application, batch and cloud task related data sources. Application Data Source Define the data source in application properties or yml con