Worker Pool: Find failed workers
At the moment the Worker Pool assumes it knows the state of its workers in the database. Clouds are not infallible, especially when you're using spot-priced workers, that will get terminated if the spot price goes up.
I think this isn't a serious issue, because the worker will be determined to be idle, after a while, and "terminated", but we could take better hold of the situation, by either:
- Explicitly terminating workers that haven't been connected in X time, and/or
- Occasionally verifying the cloud workers actually exist in the cloud.
If the worker was busy executing a task when it was terminated, we should automatically retry the task on another worker. This is probably the most urgent part of this issue.
Edited by Stefano Rivera