Improve efficiency of Query for large result sets - using Stream

Description

Dear all, I pray that you are well.

Executive Summary

Convenience and performance of Query could be improved by adding a stream() method and implementing Iterable.

Detail

I was looking at how the core PO code works recently. Forgive me if I have overlooked or misunderstood something, but I think that there is a way to improve performance and convenience on large datasets.

At the moment there are three main methods on Query for handling multiple results:

list() - which buffers all of the POs in memory;
iterate() - which first loads (and buffers) all of the primary keys for the results in memory, and then loads the individual POs.
scroll() - which returns a POResultSet.

Each of these has advantages and disadvantages:

list()

list() is perhaps the most convenient to use, because it implements a common java.util.Collection interface which allows you to use all the familiar collections-related functions. However, it does this at the expense of loading and buffering the entire result set in memory (in an ArrayList<>()) before returning. This has two disadvantages when working with large datasets:

The most obvious is the memory usage.
Not so obvious, but also important, is the latency - because the method buffers the entire result set before returning, you have to wait until all the results are processed before you can start working on the first result.

The List interface is definitely convenient, as it can be used in a foreach construct. But the List interface also specifies random access to the entire resultset, which is usually not required - most consumers of large datasets will operate one element at a time and have no need to refer back to an element once it has been processed. So buffering the entire resultset is an expensive way of achieving the convenience of the foreach construct.

iterate()

iterate() improves the memory situation by only buffering the primary key and fetching the rest of the data on-demand. But it still has the following disadvantages:

Memory usage could still be an issue on very large datasets.
Latency, while much improved, is still unnecessarily high as you have to fetch all IDs before you even start working on the first one.
Performance is worse because an additional db fetch is required at each iteration to fetch the data for the PO.
Iterators cannot be directly used in a foreach construct. You need to use a lambda to convert them to an Iterable.

So this method makes some gains in memory and latency, but overall slows down performance.

scroll()

scroll() addresses the memory and performance limitations of list() and iterate() by using a custom POResultSet, which is basically a thin wrapper over the ResultSet. But it does this at the expense of convenience - it does not implement any of the common collection interfaces, so using it results in more verbose and ugly looking code.

Having our cake and eating it

I believe that it ought to be possible to combine the strengths of each of the above approaches - convenience, low memory footprint and low latency.

I think the best way is to that Query() should have a stream() method. Streams are more flexible than iterators -and in particular, they implement AutoCloseable (which iterators do not) which helps with resource management. Moreover, I often find myself using the list() method only so that I can call stream() on the result - which means I'm buffering the entire result set for no reason.

There doesn't seem to be a need to buffer the IDs (as the current iterator implementation does) - the stream could be layered directly on top of the result set.

The list() and iterator() (and perhaps POResultSet) implementations could be easily layered on top of the stream() method - the Stream implementation has an iterator() method to return an iterator.

Streams also support parallelism - I'm not sure if this will work out-of-the-box, but it gives a possible way of future enhancement.

I also think that it would improve usability if Query directly implemented the Iterable interface so that Query could be used directly in a foreach construction.

Thoughts?

Environment

None

Activity

Heng Sin Low

May 19, 2022 at 2:18 AM

Created https://idempiere.atlassian.net/browse/IDEMPIERE-5294 to address the performance issue of POIterator.

Jeremy Krieg

May 18, 2022 at 3:04 PM

Javadoc fixed: [IDEMPIERE-5014] Update Javadoc for new stream() method by kriegfrj · Pull Request #1337 · idempiere/idempiere (github.com)

Carlos Ruiz

May 18, 2022 at 2:31 PM

Hi @Jeremy Krieg - it is already merged

Jeremy Krieg

May 18, 2022 at 2:27 PM

I can fix the Javadoc for stream() if the current PR gets merged so that I can branch off of it.

Carlos Ruiz

May 18, 2022 at 2:20 PM

Thanks @Heng Sin Low - pushed the pull request.

This part is still pending:

I think that perhaps I should also update the Javadoc for stream() and iterable() to make these resource-handling implications more explicit for the users.

Resize issue view side panel

Fixed

Details

Assignee

Jeremy Krieg

Reporter

Jeremy Krieg

Components

Core

Fix versions

Priority

Major

Created October 24, 2021 at 11:35 PM

Updated July 1, 2022 at 2:30 PM

Resolved May 25, 2022 at 7:36 AM