Improve efficiency of Query for large result sets - using Stream
Description
Dear all, I pray that you are well.
Executive Summary
Convenience and performance of Query could be improved by adding a stream() method and implementing Iterable.
Detail
I was looking at how the core PO code works recently. Forgive me if I have overlooked or misunderstood something, but I think that there is a way to improve performance and convenience on large datasets.
At the moment there are three main methods on Query for handling multiple results:
list() - which buffers all of the POs in memory;
iterate() - which first loads (and buffers) all of the primary keys for the results in memory, and then loads the individual POs.
scroll() - which returns a POResultSet.
Each of these has advantages and disadvantages:
list()
list() is perhaps the most convenient to use, because it implements a common java.util.Collection interface which allows you to use all the familiar collections-related functions. However, it does this at the expense of loading and buffering the entire result set in memory (in an ArrayList<>()) before returning. This has two disadvantages when working with large datasets:
The most obvious is the memory usage.
Not so obvious, but also important, is the latency - because the method buffers the entire result set before returning, you have to wait until all the results are processed before you can start working on the first result.
The List interface is definitely convenient, as it can be used in a foreach construct. But the List interface also specifies random access to the entire resultset, which is usually not required - most consumers of large datasets will operate one element at a time and have no need to refer back to an element once it has been processed. So buffering the entire resultset is an expensive way of achieving the convenience of the foreach construct.
iterate()
iterate() improves the memory situation by only buffering the primary key and fetching the rest of the data on-demand. But it still has the following disadvantages:
Memory usage could still be an issue on very large datasets.
Latency, while much improved, is still unnecessarily high as you have to fetch all IDs before you even start working on the first one.
Performance is worse because an additional db fetch is required at each iteration to fetch the data for the PO.
Iterators cannot be directly used in a foreach construct. You need to use a lambda to convert them to an Iterable.
So this method makes some gains in memory and latency, but overall slows down performance.
scroll()
scroll() addresses the memory and performance limitations of list() and iterate() by using a custom POResultSet, which is basically a thin wrapper over the ResultSet. But it does this at the expense of convenience - it does not implement any of the common collection interfaces, so using it results in more verbose and ugly looking code.
Having our cake and eating it
I believe that it ought to be possible to combine the strengths of each of the above approaches - convenience, low memory footprint and low latency.
I think the best way is to that Query() should have a stream() method. Streams are more flexible than iterators -and in particular, they implement AutoCloseable (which iterators do not) which helps with resource management. Moreover, I often find myself using the list() method only so that I can call stream() on the result - which means I'm buffering the entire result set for no reason.
There doesn't seem to be a need to buffer the IDs (as the current iterator implementation does) - the stream could be layered directly on top of the result set.
The list() and iterator() (and perhaps POResultSet) implementations could be easily layered on top of the stream() method - the Stream implementation has an iterator() method to return an iterator.
Streams also support parallelism - I'm not sure if this will work out-of-the-box, but it gives a possible way of future enhancement.
I also think that it would improve usability if Query directly implemented the Iterable interface so that Query could be used directly in a foreach construction.
I can fix the Javadoc for stream() if the current PR gets merged so that I can branch off of it.
Carlos Ruiz
May 18, 2022 at 2:20 PM
Thanks @Heng Sin Low - pushed the pull request.
This part is still pending:
I think that perhaps I should also update the Javadoc for stream() and iterable() to make these resource-handling implications more explicit for the users.
Dear all, I pray that you are well.
Executive Summary
Convenience and performance of Query could be improved by adding a stream() method and implementing Iterable.
Detail
I was looking at how the core PO code works recently. Forgive me if I have overlooked or misunderstood something, but I think that there is a way to improve performance and convenience on large datasets.
At the moment there are three main methods on Query for handling multiple results:
list() - which buffers all of the POs in memory;
iterate() - which first loads (and buffers) all of the primary keys for the results in memory, and then loads the individual POs.
scroll() - which returns a POResultSet.
Each of these has advantages and disadvantages:
list()
list() is perhaps the most convenient to use, because it implements a common java.util.Collection interface which allows you to use all the familiar collections-related functions. However, it does this at the expense of loading and buffering the entire result set in memory (in an ArrayList<>()) before returning. This has two disadvantages when working with large datasets:
The most obvious is the memory usage.
Not so obvious, but also important, is the latency - because the method buffers the entire result set before returning, you have to wait until all the results are processed before you can start working on the first result.
The List interface is definitely convenient, as it can be used in a foreach construct. But the List interface also specifies random access to the entire resultset, which is usually not required - most consumers of large datasets will operate one element at a time and have no need to refer back to an element once it has been processed. So buffering the entire resultset is an expensive way of achieving the convenience of the foreach construct.
iterate()
iterate() improves the memory situation by only buffering the primary key and fetching the rest of the data on-demand. But it still has the following disadvantages:
Memory usage could still be an issue on very large datasets.
Latency, while much improved, is still unnecessarily high as you have to fetch all IDs before you even start working on the first one.
Performance is worse because an additional db fetch is required at each iteration to fetch the data for the PO.
Iterators cannot be directly used in a foreach construct. You need to use a lambda to convert them to an Iterable.
So this method makes some gains in memory and latency, but overall slows down performance.
scroll()
scroll() addresses the memory and performance limitations of list() and iterate() by using a custom POResultSet, which is basically a thin wrapper over the ResultSet. But it does this at the expense of convenience - it does not implement any of the common collection interfaces, so using it results in more verbose and ugly looking code.
Having our cake and eating it
I believe that it ought to be possible to combine the strengths of each of the above approaches - convenience, low memory footprint and low latency.
I think the best way is to that Query() should have a stream() method. Streams are more flexible than iterators -and in particular, they implement AutoCloseable (which iterators do not) which helps with resource management. Moreover, I often find myself using the list() method only so that I can call stream() on the result - which means I'm buffering the entire result set for no reason.
There doesn't seem to be a need to buffer the IDs (as the current iterator implementation does) - the stream could be layered directly on top of the result set.
The list() and iterator() (and perhaps POResultSet) implementations could be easily layered on top of the stream() method - the Stream implementation has an iterator() method to return an iterator.
Streams also support parallelism - I'm not sure if this will work out-of-the-box, but it gives a possible way of future enhancement.
I also think that it would improve usability if Query directly implemented the Iterable interface so that Query could be used directly in a foreach construction.
Thoughts?