Skip to content

Make SALSA_CI_BUILD_TIMEOUT_ARGS automatically 10-15 minutes less than CI_JOB_TIMEOUT

In GitLab CI (or at least how Salsa CI pipeline is currently configured) when a job times out, the GitLab CI cache is not saved. For example in https://salsa.debian.org/otto/mariadb-server/-/jobs/7566665/viewer one can see:

/usr/lib/ccache/c++ -DBTR_CUR_ADAPT -DBTR_CUR_HASH_ADAPT -DHAVE_CONFIG_H -DHAVE_FALLOC_PUNCH_HOLE_AND_KEEP_SIZE=1 -DHAVE_LIBNUMA=1 -DHAVE_PMEM -DHAVE_SYSTEM_REGEX -DHAVE_URING -DPCRE_STATIC=1 -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE=1 -Dmariadb_backup_EXPORTS -I/builds/otto/mariadb-server/debian/output/source_dir/wsrep-lib/include -I/builds/otto/mariadb-server/debian/output/source_dir/wsrep-lib/wsrep-API/v26 -I/builds/otto/mariadb-server/debian/output/source_dir/builddir/include -I/builds/otto/mariadb-server/debian/output/source_dir/include/providers -I/builds/otto/mariadb-server/debian/output/source_dir/storage/innobase/include -I/builds/otto/mariadb-server/debian/output/source_dir/storage/innobase/handler -I/builds/otto/mariadb-server/debian/output/source_dir/libbinlogevents/include -I/builds/otto/mariadb-server/debian/output/source_dir/tpool -I/builds/otto/mariadb-server/debian/output/source_dir/include -I/builds/otto/mariadb-server/debian/output/source_dir/sql -I/builds/otto/mariadb-server/debian/output/source_dir/storage/maria -I/builds/otto/mariadb-server/debian/output/source_dir/extra/mariabackup/quicklz -I/builds/otto/mariadb-server/debian/output/source_dir/extra/mariabackup -g -O2 -ffile-prefix-map=/builds/otto/mariadb-server/debian/output/source_dir=. -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -Wdate-time -D_FORTIFY_SOURCE=2 -Wdate-time -D_FORTIFY_SOURCE=2 -pie -fPIC -fstack-protector --param=ssp-buffer-size=4 -Wconversion -Wno-sign-conversion -O3 -g -DNDEBUG -g -fno-omit-frame-pointer -fno-strict-aliasing -Wno-uninitialized -fno-omit-frame-pointer -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wenum-compare -Wenum-conversion -Wextra -Wformat-security -Wmissing-braces -Wno-format-truncation -Wno-init-self -Wno-nonnull-compare -Wno-unused-parameter -Wnon-virtual-dtor -Woverloaded-virtual -Wvla -Wwrite-strings -std=gnu++17   -Wdate-time -D_FORTIFY_SOURCE=2 -DHAVE_OPENSSL -DOPENSSL_API_COMPAT=0x10100000L -UMYSQL_SERVER -DHAVE_OPENSSL -DOPENSSL_API_COMPAT=0x10100000L -MD -MT extra/mariabackup/CMakeFiles/mariadb-backup.dir/backup_copy.cc.o -MF CMakeFiles/mariadb-backup.dir/backup_copy.cc.o.d -o CMakeFiles/mariadb-backup.dir/backup_copy.cc.o -c /builds/otto/mariadb-server/debian/output/source_dir/extra/mariabackup/backup_copy.cc
Terminated
make[4]: *** [extra/mariabackup/CMakeFiles/mariadb-backup.dir/build.make:319: extra/mariabackup/CMakeFiles/mariadb-backup.dir/encryption_plugin.cc.o] Error 143
make[4]: *** Waiting for unfinished jobs....
make[4]: Leaving directory '/builds/otto/mariadb-server/debian/output/source_dir/builddir'
make[3]: *** [CMakeFiles/Makefile2:7250: extra/mariabackup/CMakeFiles/mariadb-backup.dir/all] Error 2
make[3]: Leaving directory '/builds/otto/mariadb-server/debian/output/source_dir/builddir'
make[2]: *** [Makefile:169: all] Error 2
make[2]: Leaving directory '/builds/otto/mariadb-server/debian/output/source_dir/builddir'
make[1]: *** [debian/rules:133: override_dh_auto_build] Error 2
make[1]: Leaving directory '/builds/otto/mariadb-server/debian/output/source_dir'
make: *** [debian/rules:242: binary-indep] Error 2
dpkg-buildpackage: error: debian/rules binary-indep subprocess returned exit status 2
WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts
ERROR: Job failed: execution took longer than 2h0m0s seconds

This is not ideal, because for C/C++ builds that re-run of the build could actually be much faster if the ccache from the failed build was used to prime the next build. Therefore, in the Salsa CI pipeline the SALSA_CI_BUILD_TIMEOUT_ARGS variable was designed to be passed to the command timeout, which aborts a build that takes longer than the CI timeout.

However, there is no automation to ensure that the SALSA_CI_BUILD_TIMEOUT_ARGS is actually less than the job timeout. Therefore, as seen in the log linked above, it does run with $ su salsaci -c "timeout ${SALSA_CI_BUILD_TIMEOUT_ARGS} ${BUILD_COMMAND} && if [ "${BUILD_TWICE}" = "true" ]; then ${BUILD_COMMAND}; fi" |& OUTPUT_FILENAME=${BUILD_LOGFILE} filter-output but the ENV are:

  • SALSA_CI_BUILD_TIMEOUT_ARGS= -v 2.75h
  • CI_JOB_TIMEOUT=7200 (2.0h)

Main issue: fix timeout logic always stop builds so there is time to upload cache

We should fix this by

  1. Add automation to ensure that SALSA_CI_BUILD_TIMEOUT_ARGS is always ~10 min less than CI_JOB_TIMEOUT

OR

  1. Redesign the whole timeout feature using the new built-in variables RUNNER_SCRIPT_TIMEOUT and RUNNER_AFTER_SCRIPT_TIMEOUT introduced in GitLab 16.4

Extra: review cache name

Related, we should also make sure that the cache key makes sense. We don't want projects to have too many unique caches as it is wasteful, but also we don't want them to be overloaded so that builds constantly evict cached objects that are still useful for many builds.

Current..

.build-definition: &build-definition
  stage: build
  image: $SALSA_CI_IMAGES_BASE
  cache:
    when: always
    key: "build-${BUILD_ARCH}_${HOST_ARCH}"
    paths:
      - .ccache

..results in e.g.

Checking cache for build-amd64_-non_protected...
Downloading cache from https://storage.googleapis.com/debian-salsa-prod-runner-cache/project/74127/build-amd64_-non_protected  ETag="8fd5af663f88e9890bc428e1b6071216"

..when the build has BUILD_ARCH=amd64 and HOST_ARCH=.

Extra: show info about downloaded cache

Additionally, the current zeroing of ccache before the build starts is unhelpful for debugging what the downloaded cache contained:

$ ccache -z
Statistics zeroed

If would be better replace the above with ccache --show-stats --zero-stats as it would list things as "files in cache" and "cache size" while also zeroing the stats for the run that is about to start.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information