Redesign the Task mechanism
The current Task mechanism suffers from many flaws:
- it's not natively integrated with the database and yet most of the tasks are about updating data in the database in a coherent and atomic way
- it does not scale to multiple machines (for very slow tasks that we want to run elsewhere, think of lintian runs, test rebuild or autopkgtest)
- the lack of "batteries" (think Python's "batteries included" motto) for common tasks means that we have lots of duplication in all our task implementations
- its dependency/event system is brittle and uses an on-disk serialization mechanism (and not an in-database one)
- there's no locking ensuring that the same task is not running multiple times concurrently
- it's not very fault tolerant, if something fails transiently, either it fails everything (including the whole database transaction) which is fine but non-optimal, or it doesn't fail at all and in many cases the next run will not pick up the entry that failed transiently because it will believe that it has already been processed
We should redesign a better task mechanism that will overcome those shortcomings.
Here are some design ideas that I had in mind:
- clearly separate the scheduling part (happens on the main server, runs frequently) from the actual work (can happen on external workers, runs only when there's something to do)
- create a TaskData model with fields to track the following information at least:
- task name
- timestamp of last successful scheduling run
- timestamp of last successful worker run (meaningful in single worker case only)
- result of last worker run (meaningful in single worker case only)
- extra data (think custom data like list of packages already processed)
We should also offer some mixin classes implementing the bulk of the work for the usual Tasks:
- fetching a file from the network, parsing that file, creating PackageData and ActionItems
- processing every new SourcePackagein some way (e.g. to extract files)
- processing every new SourcePackageRepositoryEntry in the default Repository to generate a PackageData
- etc.