Skip to content

Stricter availability requirements for Free Model training datasets

This issue may sound nitpicky, but it would be nice to address it. Currently, these are the requirements for the Free Model training datasets (a.k.a. "External-Data"):

the dataset is always publicly and anonymously available through network

These requirements are of course necessary, but, IMO, not sufficient: the dataset itself has to be Type-H Reproducible. I.e., link to the latest Wikipedia dump must not fulfill the requirements as it is subject to change over time. Instead, a link to a dump on specific timestamp should be provided, as one most definitely would want to try to reproduce the model using exactly the same training data.

However, I have little idea how to put it nicely in short in ML-Policy.