Multiple filters vs multiple retrieves

2
Hi all, Just out of curiosity: I know this question has been asked before (More specific: I've seen the answer before), but it has slipped my mind. I have a microflow in which I, let's say, retrieve a list of Book from database. The Book entity has two booleans: 1. haveRead and 2. toRead. In the end I want two seperate lists to iterate over, filtered by the booleans being true. What is the best and optimal way to achieve this performance-wise? Retrieve two seperate lists with XPaths: [haveRead] and [toRead], or retrieve all Book and filter them twice? Side note: the Book entity always has at least one of the two booleans set to true, so all Books are retrieved throughout the microflow and I never need the complete list, just the two filtered ones.
asked
5 answers
5

Hi Willem,

In short, the first option retrieve 2 lists will put a wee bit more load on the database. Because you will retrieve it as two different data sets. This also creates 2 object lists in memory (the one with haveRead and the one with toRead).

The second option will retrieve all objects at once, slightly decreasing the load on the database. However you will have 1 list in memory (containing all objects) and then create 2 more lists: haveRead and toRead in memory. Putting more load on the memory. It is advised to clear the list containing all objects (since you won't be needing it, and this allows the garbage collector to collect these objects and free memory) after you created the two filtered lists.

I'm not sure if there is one right answer, many factors can influence the speed (the hardware, mendix memory and database performance, indexes, size of the data set). I would recommend to test both cases and measure the performance. This works best if you have a production-like data set available.

With kind regards, Stephan

answered
3

A lot of things can be said about this question:

  • Only try to optimize something like this if there really is a need to. In this specific example both choices will be fine, because a database is very good at retrieving based on an indexed boolean, and a cpu is very good at filtering lists based on a boolean.

  • Benchmarking these two cases against each other may not make any sense, since it is likely that in practice you're measuring a .001% impact on total processing time for handling the complete request/scheduled event against a .002 % impact. So even if one method is twice as slow as the other in a benchmark, it may still not be relevant.

  • If you're just interested in this out of curiosity or because you are sure this really is a performance bottleneck, the only way to be sure would be to do a benchmark on an environment that is a good representation of you production environment.

  • In general, it is better to go for a model that is the best reflection of what you're intending to do, not what performs best. Real world performance issues generally do not occur because of a choice for something like this. Way more likely are additions to a model that's not well understood, or things like missing indexes, way more data then expected, or an inefficient algorithm.

  • In a well performed benchmark, my money would be on the two separate database retrieves being faster.

answered
1

I would recommend the 2 database retrieves every time. It will be the more efficient route probably more than 99% of the time.

A relational database is designed and optimized to work with sets (especially when filtering by a Boolean that is indexed), while the in-memory Java processes must use loops filter records.

answered
0

As a default, I would retrieve twice rather than retrieve once filter twice...uses less memory I believe. Curious to see what others say.

answered
0

https://forum.mendix.com/link/ideas/2000

answered