Skip to content

Worker Pool: Find failed workers

At the moment the Worker Pool assumes it knows the state of its workers in the database. Clouds are not infallible, especially when you're using spot-priced workers, that will get terminated if the spot price goes up.

I think this isn't a serious issue, because the worker will be determined to be idle, after a while, and "terminated", but we could take better hold of the situation, by either:

  1. Explicitly terminating workers that haven't been connected in X time, and/or
  2. Occasionally verifying the cloud workers actually exist in the cloud.

If the worker was busy executing a task when it was terminated, we should automatically retry the task on another worker. This is probably the most urgent part of this issue.

Edited by Stefano Rivera
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information