Join the Stack Overflow Community
Stack Overflow is a community of 6.5 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I need to read a file one character at a time and I'm using the read() method from BufferedReader. *

I found that read() is about 10x slower than readLine(). Is this expected? Or am I doing something wrong?

Here's a benchmark with Java 7. The input test file has about 5 million lines and 254 million characters (~242 MB) **:

The read() method takes about 7000 ms to read all the characters:

@Test
public void testRead() throws IOException, UnindexableFastaFileException{

    BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));

    long t0= System.currentTimeMillis();
    int c;
    while( (c = fa.read()) != -1 ){
        //
    }
    long t1= System.currentTimeMillis();
    System.err.println(t1-t0); // ~ 7000 ms

}

The readLine() method takes only ~700 ms:

@Test
public void testReadLine() throws IOException{

    BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));

    String line;
    long t0= System.currentTimeMillis();
    while( (line = fa.readLine()) != null ){
        //
    }
    long t1= System.currentTimeMillis();
    System.err.println(t1-t0); // ~ 700 ms
}

* More specifically I need to know the length of each line, including the newline characters \n or \r\n. Since EOL chars are not returned by BufferedReader.readLine() I'm resorting on the read() method.

** The gzipped file is here http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz. For those who may be wondering, I'm writing a class to index fasta files.

share|improve this question
7  
Please read up on how to write accurate Java benchmarks. – Louis Wasserman 5 hours ago
4  
@Louis Wasserman Admittedly I didn't care too much about being accurate in my benchmarks. JUnit and currentTimeMillis() are not ideal but I figured that a 8-10x time difference on a fairly big file is large enough to ask the question. – dariober 5 hours ago
3  
You're asking why reading a file a character a time is slower than reading it a line at a time? It's because you're reading it a character a time instead of a line at a time. – pvg 5 hours ago
2  
@davmac calls are not free. the call ratio is 51:1. – pvg 5 hours ago
3  
After a quick check: The test is probably (!) not only flawed, I assume it is totally flawed. Try running the readLine test before the read test, and see whether the timings are different. This might just be related to HDD caches or the JIT (For me, the time difference on an old, slow HDD is 1:7 during the first run, but about 1:2 in subsequent runs. So in fact, try running testRead();testReadLine();testRead();testReadLine();testRead‌​();testReadLine(); and tell us about the results...) – Marco13 4 hours ago

The important thing when analyzing performance is to have a valid benchmark before you start. So let's start with a simple JMH benchmark that shows what our expected performance after warmup would be.

One thing we have to consider is that since modern operating systems like to cache file data that is accessed regularly we need some way to clear the caches between tests. On Windows there's a small little utility that does just this - on Linux you should be able to do it by writing to some pseudo file somewhere.

The code then looks as follows:

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Mode;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

@BenchmarkMode(Mode.AverageTime)
@Fork(1)
public class IoPerformanceBenchmark {
    private static final String FILE_PATH = "test.fa";

    @Benchmark
    public int readTest() throws IOException, InterruptedException {
        clearFileCaches();
        int result = 0;
        try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
            int value;
            while ((value = reader.read()) != -1) {
                result += value;
            }
        }
        return result;
    }

    @Benchmark
    public int readLineTest() throws IOException, InterruptedException {
        clearFileCaches();
        int result = 0;
        try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
            String line;
            while ((line = reader.readLine()) != null) {
                result += line.chars().sum();
            }
        }
        return result;
    }

    private void clearFileCaches() throws IOException, InterruptedException {
        ProcessBuilder pb = new ProcessBuilder("EmptyStandbyList.exe", "standbylist");
        pb.inheritIO();
        pb.start().waitFor();
    }
}

and if we run it with

chcp 65001 # set codepage to utf-8
mvn clean install; java "-Dfile.encoding=UTF-8" -server -jar .\target\benchmarks.jar

we get the following results (about 2 seconds are needed to clear the caches for me and I'm running this on a HDD so that's why it's a good deal slower than for you):

Benchmark                            Mode  Cnt  Score   Error  Units
IoPerformanceBenchmark.readLineTest  avgt   20  3.749 ± 0.039   s/op
IoPerformanceBenchmark.readTest      avgt   20  3.745 ± 0.023   s/op

Surprise! As expected there's no performance difference here at all after the JVM has settled into a stable mode. But there is one outlier in the readCharTest method:

# Warmup Iteration   1: 6.186 s/op
# Warmup Iteration   2: 3.744 s/op

which is exaclty the problem you're seeing. The most likely reason I can think of is that OSR isn't doing a good job here or that the JIT is only running too late to make a difference on the first iteration.

Depending on your use case this might be a big problem or negligible (if you're reading a thousand files it won't matter, if you're only reading one this is a problem).

Solving such a problem is not easy and there are no general solutions, although there are ways to handle this. One easy test to see if we're on the right track is to run the code with the -Xcomp option which forces HotSpot to compile every method on the first invocation. And indeed doing so, causes the large delay at the first invocation to disappear:

# Warmup Iteration   1: 3.965 s/op
# Warmup Iteration   2: 3.753 s/op
share|improve this answer
    
readCharTest should be readTest() ? (I'll remove this comment soon) – Marco13 31 mins ago
    
Good news! I was able to reproduce your results but I think the noise introduced makes them largely invalid - you're measuring the cache clearing and you've added non-equivalent processing that overwhelms the thing being actually measured. I have two general criticisms - one is (the more 'opinion-based' one) is that this is not really a microbenchmark so the methodology itself is non-representative. The other one is, even if we accept the methodology, it's not that hard to come up with performance differences between 50% and 300% - i.e. these specific measurements are non-representative. – pvg 3 mins ago
    
I'll try to write up my results tomorrow and post them. – pvg 3 mins ago

Java JIT optimizes away empty loop bodies, so your loops actually look like this:

while((c = fa.read()) != -1);

and

while((line = fa.readLine()) != null);

I suggest you read up on benchmarking here and the optimization of the loops here.


As to why the time taken differs:

  • Reason one (This only applies if the bodies of the loops contain code): In the first example, you're doing one operation per line, in the second, you're doing one per character. This this adds up the more lines/characters you have.

    while((c = fa.read()) != -1){
        //One operation per character.
    }
    
    while((line = fa.readLine()) != null){
        //One operation per line.
    }
    
  • Reason two: In the class BufferedReader, the method readLine() doesn't use read() behind the scenes - it uses its own code. The method readLine() does less operations per character to read a line, than it would take to read a line with the read() method - this is why readLine() is faster at reading an entire file.

  • Reason three: It takes more iterations to read each character, than it does to read each line (unless each character is on a new line); read() is called more times than readLine().

share|improve this answer
1  
If java optimized away these loops, there would be no timing difference. – pvg 4 hours ago
    
@pvg Please see the edit. read and readLine read the file differently. And they're still being called in the loops. – Luke Melaia 4 hours ago
    
I don't think the empty loop matters. I put if(line.contains(">")){ System.out.println(line); } inside the loop of the readLine() test and if(c == '>'){ System.out.println(c); }; inside the read(). Results stay the same. – dariober 4 hours ago
    
The empty loop (actually, a loop in general) is only part of the problem. – Luke Melaia 4 hours ago
2  
I think you need to be more careful with terminology. To say a loop is "optimized away" normally means the entire loop - the termination condition check included. The code in OP's question has an empty body but the loop has a side effect that cannot be optimized away without changing the semantics of the code. It is meaningless to say that the "loops body is optimized away" since the loops body is empty to begin with; there is nothing to optimise away. – davmac 4 hours ago

It is not supprising to see this difference, if you think about it. One test is iterating the lines in a text file, while the other is iterating characters.

Unless each line contains one character, it is expected that the readLine() is waay faster than the read() method.(although as pointed out by the comments above, it is arguable since a BufferedReader buffers the input, while the physical file reading might not be the only performance taking operation)

If you really want to test the difference between the 2 I would suggest a setup where you iterate over each character in both tests. E.g. something like:

void readTest(BufferedReader r)
{
    int c;
    StringBuilder b = new StringBuilder();
    while((c = r.read()) != -1)
        b.append((char)c);
}

void readLineTest(BufferedReader r)
{
    String line;
    StringBuilder b = new StringBuilder();
    while((line = b.readLine())!= null)
        for(int i = 0; i< line.length; i++)
            b.append(line.charAt(i));
}

Besides the above, please use a "java performance diagnostic tool" to benchmark your code. Also readup on how to microbenchmark java code.

share|improve this answer
2  
This is not really a microbenchmark. The posters approach, however primitive, is not unreasonable for the timescales and time ratios involved. You can use the unix time command for this with decent confidence you're seeing a significant effect. – pvg 4 hours ago

Using read() method on BufferedReader is not a good idea, it wouldn't cause you any harm but it certainly wastes the purpose of class.

Whole purpose in life of BufferedReader is to reduce the i/o by buffering the content. You can read here in Java tutorials. You may also notice that read() method in BufferedReader is actually inherited from Reader while readLine() is BufferedReader's own method.

If you want to use read() method then I would say you better use FileReader, which is meant for that purpose. You can read here in Java tutorials.

So, I think answer to your question is very simple (without going into bench-marking and all that explainations) -

  • Each read() is handled by underlying OS and triggers disk access, network activity, or some other operation that is relatively expensive.
  • When you use readLine() then you save all these overheads, so readLine() will always be faster than read(), may not be substantially for small data but faster.
share|improve this answer
    
As already mentioned in the comments: The goal behind the Buffered (!) reader is that it buffers some data. So repeated read() calls will not cause the bytes to be read from the disc one by one. Instead, it regularly reads "chunks" of data. You can even trace it down to see tha in both, the read and the readLine approach, the underlying FileReader is doing the same read calls, each reading 8192 bytes. – Marco13 35 mins ago
    
@Marco13 There are hell lot of comments in this post and I even didn't read a few, I did read answers though. If your point is that read also do some buffering then I am not sure, however I cannot rule out there there could be some optimization, but still the basics remains same about the purpose of BufferedReader and FileReader classes, and why read is slower than readLine - because of more i/o involved. – hagrawal 20 mins ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.