Comments from Charles

Hello everybody,

this is a complex - and interesting! - discussion, and a difficult problem to tackle, but an important one. Addressing it adequately would be a great acheivement for Debian.

Le Sat, Jun 08, 2019 at 10:07:13PM -0700, Mo Zhou a écrit :

1. Free datasets used to train FreeModel are not required to upload
   to our main section, for example those Osamu mentioned and wikipedia
   dump. We are not scientific data archiving organization and these
   data will blow up our infra if we upload too much.

how about storing only the data used to train the version that is released in Stable, and keeping this data in a dedicated archive, to avoid bloating mirrors ? There was a thread on debian-project on how to use Debian money, and I think that it could be a useful case.

For the versions in Unstable and Testing, the role of the package maintainer would be to ensure that the data is still available for download.

2. It's not required to re-train a FreeModel with our infra, because
   the outcome/cost ratio is impractical. The outcome is nearly zero
   compared to directly using a pre-trained FreeModel, while the cost
   is increased carbon dioxide in our atmosphere and wasted developer
   time. (Deep learning is producing much more carbon dioxide than we
   thought).

Optionally, we could even consider re-training the release candidate at the approach of the Freeze, for the sake of demonstrating that the training process functions well.

Stable point update might not need to be retrained depending on what the patches address.

Have a nice day,

Charles