Commit 402f7627 authored by Bastian Blank's avatar Bastian Blank

Add salsa-postmortem-docker-registry to be reviewed

parent d5faa7b4
Title: Postmortem of failed Docker registry move
Slug: salsa-postmortem-docker-registry
Date: 2019-08-15 00:00
Author: Bastian Blank
Tags: debian, salsa
Status: draft
GitLab uses the registry tool out of [Docker distribution]( to provide a Docker image registry.
  • I would add a sentence to clarify the relationship between Salsa and Docker. E.g.

    "... to provide a Docker image registry. This way, the projects hosted in Salsa can enable CI/CD settings and run test builds or blah blah..."


    "... to provide a Docker image registry. Docker images can be used in Salsa for X, Y or Z if project owner enables feature T...."

    This way the end user can quickly find out if they are affected/part of the problem/can be affected by measures taken (by reading the post mortem I am stilstill not sure if this is something totally internal to Gitlab or is about the CI features for projects, and also I don't know if the recent mail about CI disabled is related to this or they were different issues.

  • Gitlab uses the Docker registration toolset in order to provide the registry for Docker images which are used for $foo in Debian

Please register or sign in to reply
This software supports multiple backends for file storage, including a local filesystem, S3 and Google Cloud Storage (GCS).
As [Salsa]( already uses GCS for data storage, we decided to move all the Docker registry data off to it too.
  • the Salsa Administrator team? Salsa team? (We need to be explicit here because "we" in bits.d.o usually means 'the whole Debian community').

  • No need for "including".

    9 This system supports multiple backends for file storage: Local, Amazon Simple Storage Service (Amazon S3), and Google Cloud Storage (GCS) 10 ... the Salsa administration team ... decided to move all of the Docker registry data there as well for $reason.

  • Oxford comma, really?

Please register or sign in to reply
## Migration and roolback
Please register or sign in to reply
On 2019-08-06 we started the migration process.
  • Same as above. Other option is to add a paragraph at top explaining "The Salsa admins provide the following report about..." and then keep all the "we" as is.

Please register or sign in to reply
The migration itself went fine, even if it took a bit longer then anticipated.
Please register or sign in to reply
Everything looked fine and user access worked fine.
However as not all parts of the migration had been properly tested,
a test of the garbage collection triggered a [bug]( in the software.
On 2019-08-10 we started to see problems with garbage collection.
The job running it timed out after one hour.
Within this timeframe it not even managed to collect information about all used layers to see what it can cleanup.
A source code analysis showed that this can't be fixed.
On 2019-08-13 we switched back to storing data on the filesystem.
## Docker registry data storage
The Docker registry stores all data in a file system like structure.
There is no sort of index of the contents.
There isn't anything that would make searching for stuff easy.
Everything is in the file system.
Within this structure it saves four kinds of information.
First are the manifests that make up images and show what it contains.
It saves tags that provide a name to manifests.
There are deduplicated layers or blobs, storing the real data.
Links show what deduplicated blobs belongs to an image.
All of that is stored without any reverse references.
The whole structure is built as append-only.
You can add blobs, you can also add manifests.
You can add, change and delete tags.
However cleanup anything up apart from tags is not really a things.
  • I'm not native speaker but this sentence looks too informal IMHO, maybe "is not considered"?

  • 17 However, (comma)

    20 The job running garbage collection started to time out after one hour, within that time frame it also did not collect the layered information required for cleanup. A source code analysis showed that the error(?) could not be fixed.

    29 Docker stores all of the registry data sans indexing or reverse references in a file system-like structure comprised of 4 separate types of information: Manifests of images and contents, tags for the manifests, deduplicaed layers (or blobs) which store the actual data, and lastly links which show which deduplicated blogs belong to their respective images, all of this does not allow for easy searching within the data.

    41 The file system structure is built as append-only which allows for adding blobs and manifests, addition, modification, or deletion of tags, however cleanup of items other than tags is not achievable within the maintenance tools.

    --apologies, I didn't read the entire document first and edited in-line. I pray this makes sense.

Please register or sign in to reply
There is a garbage collection process.
According to the documentation you must only use it while the registry is read-only.
It can cleanup unreferenced blobs.
Since the last release it can also cleanup unreferenced manifests.
However it can't cleanup links.
## Docker registry garbage collection on external storage
For a garbage collection the registry tool needs to read a lot of information.
Remember, there is no index it could use to see what's in there.
So it goes out to the storage and downloads … everything, or at least lists every object.
The registry attached to Salsa contains around 110k files.
  • 46 There is a garbage collection process which can be used to clean up unreferenced blobs, however according to the documentation the process can only be used while the registry is set to read-only and unfortunately it cannot be used to clean up erroneous? [suggestion] links.

    54 For the garbage collection the registry tool needs to read a lot of information as there is no indexing of the data. The tool connects to the storage medium and proceeds to download ... everything, every single manifest and referenced blob which take up over 1 second of processing time. We have not had the chance to count the currently available manifests, though we do note that the registry attached to Salsa contains around 110K files which adds considerably to the task.

Please register or sign in to reply
It has to download every manifest to get the referenced blobs.
This process somehow takes over a second for each manifest.
I haven't counted the currently available manifests,
Please register or sign in to reply
but it is clear that this may take a lot of time.
So in the used configuration with the external storage it is simply impossible to run any cleanup.
## Leasons learned
The Docker registry is a data storage tool that can only properly be used in append-only mode.
If you never cleanup, it works well.
  • I would remove this sentence and reword the next one:

    As soon as you want to actually remove data when performing cleanup tasks, it won't work well.

    (Not very happy with my own wording, either, though...)

  • 62-63 ...this will take up a significant amount of time which in the current configuration of external storage would make the clean up nearly impossible.

Please register or sign in to reply
As soon as you want to actually remove data, it goes bad.
For Salsa we actually want to remove stuff, as the registry currently grows about 20GB per day.
  • I think we need another paragraph to properly close the article, about what are the plans from now on (e.g. the team will study available options, put measures in place to reduce the growth of the registry...)

Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment