Best practice for killing long running Atlas jobs

Ajit_Londhe · October 26, 2021, 2:54pm

Hello,

When Atlas jobs get hung up on “PENDING” or “STARTING” status for many hours, is there a best practice for how to kill those jobs? Aside from restarting Tomcat?

Thanks,
Ajit

Chris_Knoll · October 26, 2021, 4:11pm

I don’t have a good answer, but I can explain the challenge:

The framework we use to do async jobs is ‘Spring Batch’. One function of it is to use thread pools to limit the concurrent executions of async tasks. When you see ‘Pending’ or ‘Starting’ in the UI, that means that the job is queued up in a waiting thread, but I’m not aware of any interface into Spring Batch that lets you cancel pending (thread-waiting) tasks. The way your ‘cancel’ function works in WebAPI is that the queries are already executing, and between execution steps, we check to see if the request was canceled, and we stop executing the remaining queries, and mark the job canceled. In other words, we don’t go back to the database server and cancel an executing statement, we wait for an opportunity to stop the batch query (ie: between queries). The technical reason here is that JDBC doesn’t provide a guaranteed way to do it.

So, getting back to your question: we’d need to know if there’s a way via Spring Batch API to take a jobID and call a function that would cancel it while it’s ‘pending’ in the thread queue.