Question

PostgreSQL SQL optimization needs for sub-select (need to add order by clause)

1

Good day, I'm the DBA/infrastructure-engineer (that's what they call me, not that I'm an engineer ;() for a Mendix application, so no, I've not yet opened the modeller, just deploying the .mda with m2ee and handling the PostgreSQL database(s) :) Below two explain analyze on a ~165GB database (including all the indexes) for a query from 6.10.10. The first is the query as issued by the Mendix app, the second contains a "order by" clause in the sub-select query. As you can see, it's 26x times faster than the query that Mendix created. Now my questions: 1) What should I ask the developer to do to give you better information to debug/analyze this issue? 2) Anything that the developer(s) can do to get the "order by" clause added to the query? Mendix 6.10.10 query: aquacheckweb=# explain analyze select "field$datapoint"."soiltemperature1400", "field$datapoint"."maximumdepth", "field$datapoint"."powersavingsmode" FROM "field$datapoint" WHERE "field$datapoint"."id" IN (SELECT "c1field$datapoint_profile"."field$datapointid" FROM "field$datapoint_profile" "c1field$datapoint_profile" WHERE "c1field$datapoint_profile"."field$profileid" = 14918173765683509) ORDER BY "field$datapoint"."datadate" DESC, "field$datapoint"."id" DESC LIMIT 1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=227.00..820633.93 rows=1 width=29) (actual time=7471.294..7471.294 rows=1 loops=1) -> Nested Loop Semi Join (cost=227.00..11310950621.74 rows=13787 width=29) (actual time=7471.293..7471.293 rows=1 loops=1) -> Index Scan Backward using "idx_field$datapoint_datadate" on "field$datapoint" (cost=113.50..53086239.37 rows=97886374 width=29) (actual time=0.009..1540.537 rows=4257689 loops=1) -> Index Only Scan using "field$datapoint_profile_pkey" on "field$datapoint_profile" "c1field$datapoint_profile" (cost=113.50..115.01 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=4257689) Index Cond: (("field$datapointid" = "field$datapoint".id) AND ("field$profileid" = '14918173765683509'::bigint)) Heap Fetches: 1 Planning time: 0.426 ms Execution time: 7471.326 ms (8 rows) The optimized query: aquacheckweb=# explain analyze select "field$datapoint"."soiltemperature1400", "field$datapoint"."maximumdepth", "field$datapoint"."powersavingsmode" FROM "field$datapoint" WHERE "field$datapoint"."id" IN (SELECT "c1field$datapoint_profile"."field$datapointid" FROM "field$datapoint_profile" "c1field$datapoint_profile" WHERE "c1field$datapoint_profile"."field$profileid" = 14918173765683509 order by "c1field$datapoint_profile"."field$datapointid") ORDER BY "field$datapoint"."datadate" DESC, "field$datapoint"."id" DESC LIMIT 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1661339.87..1661340.37 rows=1 width=29) (actual time=287.819..287.820 rows=1 loops=1) -> Sort (cost=1661339.87..1668233.37 rows=13787 width=29) (actual time=287.817..287.817 rows=1 loops=1) Sort Key: "field$datapoint".datadate DESC, "field$datapoint".id DESC Sort Method: top-N heapsort Memory: 25kB -> Nested Loop (cost=27953.94..1647552.87 rows=13787 width=29) (actual time=231.228..286.999 rows=4167 loops=1) -> HashAggregate (cost=27840.44..34733.93 rows=13787 width=8) (actual time=231.152..232.457 rows=4167 loops=1) Group Key: "c1field$datapoint_profile"."field$datapointid" -> Index Only Scan using "idx_field$datapoint_profile_field$profile_field$datapoint" on "field$datapoint_profile" "c1field$datapoint_profile" (cost=113.50..14053.44 rows=13787 width=8) (actual time=1.649..229.249 rows=4167 loops=1) Index Cond: ("field$profileid" = '14918173765683509'::bigint) Heap Fetches: 2837 -> Index Scan using "field$datapoint_pkey" on "field$datapoint" (cost=113.50..116.48 rows=1 width=29) (actual time=0.013..0.013 rows=1 loops=4167) Index Cond: (id = "c1field$datapoint_profile"."field$datapointid") Planning time: 0.390 ms Execution time: 287.922 ms (14 rows) aquacheckweb=#

asked 2017-10-06

Hendrik Bofh Visage

4 answers

Andrej Koelewijn · Answer 1 · 2017-10-12

Hi Hendrik,

Thank you for providing this case. This is certainly something that we want to improve. Would it be possible for you to send us a test project so we can investigate why this sql was generated by mendix? (you can send it the andrej.koelewijn@mendix.com)

Also, did you also try replacing the IN clause with an EXISTS clause? What query plan does that produce?

Update: some feedback from our R&D devs - Interesting point, if you look at the query cost, the query generated by Mendix is a lot cheaper than the second version.

Another update: it looks like the Index Scan Backward is killing performance of the first query. Can you try replacing the index with an descending index (idx_field$datapoint_datadate)?

Thanks,

Andrej

(product manager for the datastorage component)

Erwin 't Hoen · Answer 2 · 2017-10-08

Hendrik,

The statements being used are created by the Mendix runtime. This is a translation of what the Mendix developer creates in the modeler. So rewriting the query directly is not an option for the Mendix developer, only possibility is as Nikel states writing it in java directly. I can understand that this is something you would like to avoid and should only be done if all else fails.

The mendix developer doesn't have direct influence on the translation of the action in the modeler to sql queries, but they could try to find a different logical; way to retrieve the information that would cause less of a burden on the db. But as I don't have the context of the app or query, I'm not able to provide you with additional options, but your mendix developers might have alternative way to get the same results in their app, but with different queries.

Eric Tieniber · Answer 3 · 2017-10-11

Hi Hendrik,

I have a few thoughts here. Firstly, I agree with you that the Mendix generated query should be faster, assuming the retrieve command and Mendix data model are built the way I think. Could you have your developer share the domain model? Specifically, I’m interested in the relationship between Profile and Datapoint. Is it 1 Profile to many Datapoint? And what indexes have been created on the Datapoint entity?

Secondly, in this case it seems you’re retrieving the latest data point for a particular Profile. The developer could implement a new (additional) 1-1 association between Profile and Datapoint, and set that association during each write of a new Datapoint. A retrieve using that new association should be faster than either query here.

Nikel Kruizinga · Answer 4 · 2017-10-08

Since you're not running in the Mendix cloud you actually have a few options:

- Write your own implementation of the jdbc driver you're using, and replace the existing one with it. This driver would recognize queries like the first one and change them into the second one. This is hard to do but makes for a good story.

- Implement the retrieve using your own Java actions that connect to the database directly. You would basically be writing your own ORM library. As long as you only do this for a few specific queries this is not too hard to do.

Both these options are definitely not something a regular Mendix developer would be able to do and would rightly be advised against by most people. A more sensible approach might be to upgrade your database server so it has more than 256GB of ram.