Processing over 1100 rows (excel) fails

1
This is a fairly complex scenario but I will try to lay out the important aspects as far as I have been able to determine. I have an ETL system that loads excel files into a simple staging entity via a scheduling system. The file itself is just saved, and then imported via the template using a scheduler (to prevent interference - will get to that). This all works fine. The problem comes with the second part of this scheduler's action: processing the resultant rows staging objects into the far more complex data environment of my main program. The method by which it does this is first to run through all the objects and just validate them (no missing data, formats correct, etc) and the run through them AGAIN and this time process and commit the results (inside a single loop). So the first loop just looks through everything and picks up on any data errors, the second loop does the actual processing and committing. So here is my problem. The first checking loop always runs perfectly, finding no errors. And as long as I run this for files smaller than a 1000 or less (actually about 1.1k I have established by some testing) everything works fine. Precisely as intended. The moment I try the same file with 500 additional rows (and I've taken various different sets of 500 additional rows) it processes about 1.1k entries, hits a sudden lag spike - logging revealed it does about 100 rows/30 seconds before that, but logging just stops for about 5 minutes - and then continues as if its entire list has been cleared. It zips through the remaining 400 or so nearly instantly as soon as it recovers, because, the list it is grabbing them from is suddenly empty. Trying to track the flow via the debugger has revealed that it seems to fail at different places in the processing each time and it never seems to fail when I am actively stepping through the debugger at its most granular level. Rather, whole microflows fail when I try to step over them. Does anyone have any ideas or hints what this could be? It sounds like a memory issue but I have specifically rewritten the process to not load everything into memory, instead dealing with a single row all the way to committing before moving to the next row so nothing needs to be kept in memory. As it is I am quite stumped.   Edit 1: I do make use of batches. Each of the two loops described there runs batched, the first of 1000 rows at a time, the second loop I've reduced it to batches of 100 at a time in an effort to get this right. It still runs into troubles. Inside each loop I use sub-microflows to do the actual processing work on an individual entry level. Edit 2: Before I rewrote the thing, it tried to process everything first and then commit it all afterwards. Needless to say, that was less than successful. Edit 3: First Loop:   Second loop: It first failed somewhere at the second microflow there, and the second I checked it failed somewhere at the first microflow. Edit 4: To clarify, whether I ran it with batches of 1000 or batches of 100, the same problem mentioned above arises at the same approximate place. I also sort the data on a unique identifier and use an offset variable to take chunks of it. Not sure how that compares to the "processed" flag, although I can see the merits in it. Not sure if it will solve the problem, but I will implement that tomorrow and see how the situation changes.   EDIT 5: Some new information has come to light. You see, I use 2 staging entities (primary and one connected via association and loaded via reference). After the big lag at about 1100 rows, that second entity appears to be empty! As in, the primary object still has a connection to some entry in the second entity and I can see the link via debugger, but if my microflow attempts to actually retrieve the object on the other side of this association, it returns empty! The only logical explanation is that the object is getting deleted while this scheduled task is running. This still does not explain why NOTHING is getting committed, but if both entities have been wiped clean in the database... well, actually that might explain it. It should not be possible. Is there a thing like "database timeout" in Mendix? When I try to shutdown the database after failing to run it, it tells me there is still an open JDBC connection. I make use of the "Commit in seperate transaction" Java thing. Thanks for the advice to use a "Processed" Flag, that has allowed me to fix a smaller bug that was up to now hiding a "Cannot change object as it no longer exists" bug. I've seen something like that in the known issues for Mendix 7.1.0 however, so I am going to attempt an upgrade to 7.3.0 to see if that solves the problem.
asked
4 answers
1

Hannes,

Sounds like a memory issue to me.

One technique I have used is as follows:

  • Put a boolean on your entity called Processed
  • When you have finished processing a single row (as you describe above) you can set Processed to true
  • This enables you to use Processed in your retrieve to work on a small number of rows at a time, i.e. you would retrieve with a custom range and get lets say 100 rows where Processed is false

This would enable you to work your way through the list in chunks without running into any memory issues.  You may also find that its not a memory issue, and that when you get to a certain records (close to or after the 1100th record in your set), that there is an issue with the data causing this issue.

Hope that helps,

Mike

answered
1

It is not clear to me if you use batches to proces the rows, how did you rewrite your proces?. If not that would be the first step to do. Because if everything is done in a single microflow everything is still kept in memory for a rollback of the whole microflow. 

And why not cut up the proces so that you validate first and then use a complete different microflow to do the processing later on? You can use the community commons to run tasks in queue. When you build it like this you can handle large Excel import. Ours goes over a million+ rows each night.

Regards,

Ronald

[EDIT]

Note that it is not advised to run batch inside a batch because the first loop then still keeps all the batches of the second loop in memory.

answered
1

I do make use of batches. Each of the two loops described there runs batched, the first of 1000 rows at a time, the second loop I've reduced it to batches of 100 at a time in an effort to get this right. It still runs into troubles. Inside each loop I use sub-microflows to do the actual processing work on an individual entry level.

Edit: Before I rewrote the thing, it tried to process everything first and then commit it all afterwards. Needless to say, that was less than successful.

answered
1

I'm going to break the rules a bit and answer my own question for the sake of everyone who finds this later due to having the same problem. My conclusion of the problem after a lot of trial and error:

  • Microflows running SYNCHRONOUSLY time out. Switch to Async. Otherwise you have a 10min or so cap on runtime and then you lose everything. - this was the biggest problem
  • Even committing things does not clear memory. Everything is still in memory until the transaction finishes. Using java actions and advanced actions allows you to manage and control these transactions.
  • Keep retrieves small(er). Big retrieves can instantly crash the system with the dreaded "out of heap error". Retrieve only what you need and only when you need. Try to avoid passing big lists around microflows. Use batching where necessary.
  • Keep your microflows smaller. If you have more than 10 activities in a microflow, consider making it smaller. If its more than 20 DEFINITELY split some parts off into sub-microflows.
answered