Commit fdf20c6f authored by Jérémy Bobbio's avatar Jérémy Bobbio

Initial version of reproducible-builds.org website

Ready for reviews.
parents
*.sw[p-z]
*~
#*
/_site
Source for the reproducible-builds.org website
==============================================
The website for reproducible-builds.org is made with [Jekyll]. It has
been initially created and tested with version 2.2.0 in Debian unstable
available in the `jekyll` package.
The boilerplate CSS is provided by [Skeleton].
[Jekyll]: http://jekyllrb.com/
[Skeleton]: http://www.getskeleton.com/
Viewing the website
-------------------
It's possible to view the website while making changes:
$ jekyll serve --watch
A local webserver will be started which can be accessed using a browser
to see changes as they are made.
Build the website
-----------------
Building the website is made by running:
$ jekyll build
The result will be available in the `_site` directory.
News
----
News are made using Jekyll blog posts. Adding news is a matter of
creating new pages in th `_posts` directory.
Documentation
-------------
The documentation homepage is in `docs.html`. Individual pages are kept
in the `_docs` folder.
The order in the index and the title of subsection is defined in
`_data/docs.yml`. The order defined using permalinks (and not
file names).
Projects
--------
The list of project involved in reproducible builds is kept as YAML
in `_data/projects.yml`.
Logos must be in the `images/logos` folder. Images should be 124×124
pixels large.
markdown: kramdown
highlighter: pygments
permalink: /news/:year/:month/:day/:title/
# Site settings
title: reproducible-builds.org
email: contact@reproducible-builds.org
description: |
“Reproducible builds” aim to provide a verifiable path from software source code to its compiled binary form.
baseurl: ""
url: "https://reproducible-builds.org/"
collections:
docs:
output: true
events:
output: true
exclude:
- README
- .git
- .gitignore
- title: Best practices
docs:
- plans
- buy-in
- test-bench
- title: Get deterministic builds
docs:
- deterministic-build-systems
- volatile-inputs
- stable-inputs
- value-initialization
- version-information
- timestamps
- timezones
- locales
- archives
- stable-outputs
- randomness
- build-path
- title: Define a build environment
docs:
- perimeter
- recording
- definition-strategies
- proprietary-os
- title: Distribute the environment
docs:
- build-toolchain-from-source
- virtual-machine-drivers
- formal-definition
- title: Comparison protocol
docs:
- checksums
- embedded-signatures
- sharing-certifications
- name: Debian
url: https://www.debian.org/
logo: debian.png
resources:
- name: Wiki pages
url: https://wiki.debian.org/ReproducibleBuilds
- name: Continuous tests
url: https://reproducible.debian.net/
- name: Tor Browser
url: https://www.torproject.org/
logo: tor.png
resources:
- name: Building guide
url: https://trac.torproject.org/projects/tor/wiki/doc/TorBrowser/Hacking#BuildingOfficialTorBrowserReleaseBinaries
- name: Bitcoin
url: https://github.com/bitcoin/bitcoin
logo: bitcoin.png
resources:
- name: Gitian building guide
url: https://github.com/bitcoin/bitcoin/blob/master/doc/gitian-building.md
- name: FreeBSD
url: https://www.freebsd.org/
logo: freebsd.png
resources:
- name: Base system
url: https://wiki.freebsd.org/ReproducibleBuilds
- name: Ports
url: https://wiki.freebsd.org/PortsReproducibleBuilds
- name: Continuous tests
url: https://reproducible.debian.net/freebsd/
- name: Coreboot
url: http:///www.coreboot.org/
logo: coreboot.png
resources:
- name: Continuous tests
url: https://reproducible.debian.net/coreboot/
- name: OpenWrt
url: https://openwrt.org/
logo: openwrt.png
resources:
- name: Continuous tests
url: https://reproducible.debian.net/openwrt/
- name: NetBSD
url: https://www.netbsd.org/
logo: netbsd.png
resources:
- name: Continuous tests
url: https://reproducible.debian.net/netbsd/
- name: Arch Linux
url: https://www.archlinux.org/
logo: archlinux.png
resources:
- name: Continuous tests
url: https://reproducible.debian.net/archlinux/
---
title: Archive metadata
layout: docs
permalink: /docs/archives/
---
Most archive formats record metadata that will capture details about the
build environment if care is not taken. File last modification time is
obvious, but file ordering, users, groups, numeric ids, and permissions
can also be concerns. Tar is going to be used as the main example but
these tips should apply to other archive formats as well.
File modification times
-----------------------
Most archive formats will, be default, record file last modification
times. Some will also record file creation times.
Tar has a way to specify the modification time that must be used for all
files:
{% highlight sh %}
$ tar --mtime='2015-10-21 00:00Z' -cf product.tar build
{% endhighlight %}
(Notice how `Z` is used to specify that time is in the UTC
[timezone]({{ "/docs/timezones/" | prepend: site.baseurl }}).)
For other achive formats, it is always possible to use `touch` to reset
the modification times to a [predefined value]({{ "/docs/timestamps/" |
prepend: site.baseurl }}) before creating the archive:
{% highlight sh %}
$ find build -print0 |
xargs -0r touch --no-dereference --date="@${SOURCE_DATE_EPOCH}"
$ zip -r product.zip build
{% endhighlight %}
In some cases, it can be preferable to keep the original times for files
that have not been created or modified during the build process:
{% highlight sh %}
$ find build -newermt "@${SOURCE_DATE_EPOCH}" -print0 |
xargs -0r touch --no-dereference --date="@${SOURCE_DATE_EPOCH}"
$ zip -r product.zip build
{% endhighlight %}
A patch has been written to make the latter operation easier with GNU
Tar. It is currently available in Debian since
[tar](https://packages.qa.debian.org/tar) version 1.28-1. Hopefully it
will be integrated upstream soon but you might want to use it with
caution. It adds a new `--clamp-mtime` flag which will only set time
when the file is more recent than what was given with `--mtime`:
{% highlight sh %}
# Only in Debian unstable for now
$ tar --mtime='2015-10-21 00:00Z' --clamp-mtime -cf product.tar build
{% endhighlight %}
This has the benefit of leaving the original file modification time
untouched.
File ordering
-------------
When asked to record directories, most archive formats will read their
content in the order returned by the filesystem which is [likely to be
different on every run]({{ "/docs/stable-inputs/" | prepend:
site.baseurl }}).
With version 1.28, GNU Tar has gained `--sort=name` option which will
sort filenames in a locale independent manner:
{% highlight sh %}
# Works with GNU Tar 1.28
$ tar --sort=name -cf product.tar build
{% endhighlight %}
For older versions or other archive formats, it is possible to use
`find` and `sort` to achieve the same effect:
{% highlight sh %}
$ find build -print0 | LC_ALL=C sort -z |
tar --null -T - --no-recursion -cf product.tar
{% endhighlight %}
Care must be taken to ensure that `sort` is called in the context of the
C locale to avoid any surprises related to collation order.
Users, groups and numeric ids
-----------------------------
Depending on the archive format, the user and group owning the file
can be recorded. Sometimes it will be using a string, sometimes using
the associated numeric ids.
When files belong to predefined system groups, this is not a problem,
but builds most often are made using regular users. Recording of the
account name or its associated ids might be a source of reproducibility
issues.
Tar offers a way to specify the user and group owning the file. Using
`root`/`root` and `--numeric-ids` is a safe bet, as it will effectively
record 0 as values:
{% highlight sh %}
$ tar --owner=root --group=root --numeric-ids -cf product.tar build
{% endhighlight %}
Full example
------------
The recommended way to create a Tar archive is thus:
<div class="correct">
{% highlight sh %}
# requires GNU Tar 1.28+
$ tar --sort=name \
--mtime="@${SOURCE_DATE_EPOCH}" \
--owner=root --group=root --numeric-ids \
-cf product.tar build
{% endhighlight %}
</div>
Post-processing
---------------
If tools do not support options to create reproducible archives, it is
always possible to perform post-processing.
[strip-nondeterminism](https://packages.debian.org/sid/strip-nondeterminism)
already has support to normalize Zip and Jar archives. Custom scripts
like Tor Browser's
[re-dzip.sh](https://gitweb.torproject.org/builders/tor-browser-bundle.git/tree/gitian/build-helpers/re-dzip.sh)
might also be an option.
Static libraries
----------------
Static libraries (`.a`) on Unix-like systems are *ar* archives. Like
other archive formats, they contain metadata, namely timestamps, UIDs,
GIDs, and permissions. None are actually required for using them as
libraries.
GNU `ar` and other tools from
[binutils](https://www.gnu.org/software/binutils/) have a *deterministic
mode* which will use zero for UIDs, GIDs, timestamps, and use consistent
file modes for all files. It can be made the default by passing the
`--enable-deterministic-archives` option to `./configure`. It is already
enabled by default for some distributions[^distros-with-default] and so
far it seemed to be pretty safe [except for
Makefiles](https://bugs.debian.org/798804) using targets like
`archive.a(foo.o)`.
When binutils is not built with deterministic archives by default, build
systems have to be changed to pass the right options to `ar` and
friends. `ARFLAGS` can be set to `Dcvr` with many build systems to turn on the
deterministic mode. Care must be also taken to pass `-D` if `ranlib` is
used to create the function index.
Another option is to do post-processing by using
[strip-nondeterminism](https://packages.debian.org/sid/strip-nondeterminism)
or `objcopy`:
objcopy --enable-deterministic-archives libfoo.a
[^distros-with-default]: Debian since [version 2.25-6](https://tracker.debian.org/news/675691), Ubuntu since version 2.25-8ubuntu1. It is the default for Fedora 22 and Fedora 23, but it seems this will be [reverted in Fedora 24](https://bugzilla.redhat.com/show_bug.cgi?id=1195883).
---
title: Build path
layout: docs
permalink: /docs/build-path/
---
Some tools will record the path of the source files in their output.
Most compilers will write the path of the source in the debug
information in order to locate the associated source files.
Some tools have flags (like gzip's `-n`) that will prevent them to write
the path in their output. Proposing patches to add similar feature in
other tools might be sufficiently easy.
But in most cases, post-processing will be required to either remove the
build path or normalize it to a predefined value.
For the specific case of [DWARF
symbols](https://en.wikipedia.org/wiki/DWARF), there is currently no good
tool to
change them after a build to a pre-determined value[^debugedit]. A work-around is to
[define the build path as part of the build environment]({{
"/docs/perimeter/" | prepend: site.baseurl }}).
[^debugedit]: [debugedit](https://fedoraproject.org/wiki/Releases/FeatureBuildId) can replace the path used at build time by a predefined one but it will do it by rewriting bytes in place. As this does not reorder the hash table of strings, the resulting bytes will still be different depending on the original build path.
This is also problematic because this will also apply to intermediate
source file that other tools generate. As they typically will use [random
file names]({{ "/docs/randomness/" | prepend: site.baseurl }}), having a
fixed build path will not be enough in such cases.
---
title: Building from source
layout: docs
permalink: /docs/build-toolchain-from-source/
---
Building the tools that make the environment from source is one way to
allow user to reproduce it. Using source directly makes it easier to
rely on new features, and easily works on a variety of platforms. It
might not scale well for a long list of dependencies, and asking users
to rebuild GCC for every piece of software they use might make them
slightly unhappy.
What follows are suggestions on how to handle building the compilation
tools from source.
Building using external resources
---------------------------------
The source for the different components can be retrieved from online
repository. Using release tarballs might be preferable as they are
easier to cache, [mirror, checksum and verify]({{
"/docs/volatile-inputs/" | prepend: site.baseurl }}). When retrieving
the source from a version control system repository, it's best to have a
precise reference to the code version. With Git, using a tag with a
verified signature or a commit hash will work best.
The compilation itself can be driven by shell scripts or an extra target
in the project `Makefile`.
Coreboot is a good example. The build documetation mandates to first run
`make crossgcc` before building Coreboot itself.
Check-in everything
-------------------
Another approach is to check the source of the entire toolchain in the
project's version control system.
This is how several integrated operating systems like *BSD are
developped. “Building the world” will start by building the toolchain in
the version control system before building the rest of the system.
It's also how it is done for Google's internal projects. They have
released [Bazel](http://bazel.io/) which is based on their
internal compilation tool. Bazel is designed to drive such large scale
builds with speed and reproducibility in mind.
Outside of fully integrated operating systems or corporate environment,
it might be hard to push the idea of adding ten or more times the actual
code base in toolchain…
Ship the toolchain as a build product
-------------------------------------
As it might be hard to ask every user to spend time rebuilding a whole
toolchain, OpenWrt gives a good example of a middle middle ground. An
“SDK” that can be downloaded alongside the system images which
contains everything that is needed to build—or rebuild—extra packages.
In that case the SDK becomes another build product, and it has to be
possible to build it reproducibly.
---
title: Buy-in
layout: docs
permalink: /docs/buy-in/
---
Working on reproducible builds might look like a lot of efforts with
little gain at first. While [this apply to many types of work related
to
security](https://www.schneier.com/blog/archives/2008/09/security_roi_1.html),
there are already some good arguments and testimonies
on why *reproducible builds* matter.
Resisting attacks
-----------------
In March 2015, The Intercept
[published](https://theintercept.com/2015/03/10/ispy-cia-campaign-steal-apples-secrets/)
from the Snowden leaks the abstract of a talk at an
[internal CIA conference in
2012](https://theintercept.com/document/2015/03/10/tcb-jamboree-2012-invitation/) about
[Strawhorse: Attacking the MacOS and iOS Software Development
Kit](https://theintercept.com/document/2015/03/10/strawhorse-attacking-macos-ios-software-development-kit/).
The abstract clearly explains how unnamed researchers have been creating
modified version of XCode that would—without any knowledge of the
developper—watermark or insert spyware in the compiled applications.
A few months later, a malware dubbed “XcodeGhost” has been found
targeting developers to make them unknowingly distribute malware
embedded in iOS applications. Palo Alto Networks
[describes](http://researchcenter.paloaltonetworks.com/2015/09/novel-malware-xcodeghost-modifies-xcode-infects-apple-ios-apps-and-hits-app-store/) it as:
> XcodeGhost is the first compiler malware in OS X. Its malicious code is
> located in a Mach-O object file that was repackaged into some versions
> of Xcode installers. These malicious installers were then uploaded to
> Baidu’s cloud file sharing service for used by Chinese iOS/OS X
> developers
The purpose of reproducible builds is exactly to resist such attacks.
Recompiling these applications with a clean compiler would have made
the problem easily visible, especially given the size of the added
payload.
As Mike Perry and Seth Schoen explained in December 2014 during [a talk at
31C3](https://media.ccc.de/events/31c3_-_6240_-_en_-_saal_g_-_201412271400_-_reproducible_builds_-_mike_perry_-_seth_schoen_-_hans_steiner)
in December, problematic changes might be more subtle, and a single bit
might be the only thing required to create a remotely exploitable
security hole. Seth Schoen also made the demonstration of a kernel-level
malware that would compromise the source code while it was being read by
the compiler, without leaving any traces on disk. While to the best of
our knowledge such attacks have not been observed in the wild,
<strong>reproducible builds is the only way to detect them
early</strong>.
Quality assurance
-----------------
Regular tests are required to make sure that the software can be built
reproducibly in various environments. Debian and other free software
distributions consider that their users must be able to build the
software they distribute. Such regular tests helps to avoid *fail to
build from source* bugs.
Build environments may evolve after a project is no longer receiving
major developments. While working on Debian, several high impact but
hard to detect bugs were identified by testing builds in varying
environments. To give some examples: [a library had a different
application binary interface for every
build](https://bugs.debian.org/773916), [garbled strings due to
encoding mismatch](https://bugs.debian.org/801855), [missing
translations](https://bugs.debian.org/778486), or [changing
dependencies](https://bugs.debian.org/778707).
The constraint of having to reflect about the build environment also
helps developers to think the relationship with external software or
data providers. Relying on external sources with no backup plans might
cause serious troubles in the long term.
Having reproducible builds also allow to recreate matching [debug
symbols](https://en.wikipedia.org/wiki/Debugging_data_format) for a
distributed build which can help understanding issues in software used
in production.
“But how can I trust my compiler?”
----------------------------------
A common question related to reproducible builds is how is it possible
to know if the build environment is not compromised if everyone is using
the same binaries? Or how can I trust that the compiler I just built
was not compromised by a backdoor in the compiler I used to build it?
The latter is known in the academic litterature since the
[Reflections on trusting
trust](https://dx.doi.org/10.1145%2F358198.358210) paper from
Ken Thompson published in 1984. It's the paper mentioned in the
description of the talk about “Strawhorse” mentioned earlier.
The technique known as [Diverse
Double-Compilation](http://www.dwheeler.com/trusting-trust/) formally
defined and resarched by David A. Wheeler can answer this question.
To sum up quickly how it works: taking two compilers, one trusted and
one under test trusted, the compiler under test is built twice,
once with each compiler. Using the compilers created from this build,
the compiler under test is built again. If the output is the same, then
we have a proof that no backdoors have been inserted during the
compilation. For this scheme to work, the output of the final
compilations need to be the same. And that's exactly where reproducible
builds are useful.
---
title: Cryptographic checksums
layout: docs
permalink: /docs/checksums/
---
How can users know that the build they just made has successfully
reproduced the original build?
The easiest way is to make sure that the build output are always
byte-for-byte identical. Byte for byte comparison is a trivial operation
and can be performed in many different environments.
The other benefit of having identical bytes is that it makes possible to
use [cryptographic
checksums](https://en.wikipedia.org/wiki/Cryptographic_hash_function).
Such checksums are really tiny compared to the full build products. They
are easily exchanged even in very low bandwidth situation.
For example, it makes it possible to build a software release both on a
well-connected (but hard to trust) server and on a laptop behind a bad
mobile connection. The digital signature can be made locally on the
laptop. As the build products will be identical, the signature will be
valid for the files produced on the well-connected server.
---
title: Definition strategies
layout: docs
permalink: /docs/definition-strategies/
---
They are multiple ways to define the build environment in a way that it
can be distributed. The following methods are not exclusive and multiple
aspects can be used for a single project. One can specify a reference
Linux distribution and build a specific compiler version from source.
Defining the build environment as part of the development process has a
very desirable aspect: changes in the build environment can be vetted
like any other changes. Updating to a new compiler version can be
subject to reviews, automatic testing, and in case things
break, rollback.
{% comment %}
XXX: maybe we want to add examples?
{% endcomment %}
Build from source
-----------------
One way to have users reproduce the tools used to perform the build
is simply to have them start by building the right version of these
tools from source.
Using `make` or any other compilation driver, the required tools will be
downloaded, built, and locally installed before compiling the software.
Like any other [inputs from the network]({{ "/docs/volatile-inputs/" |
prepend: site.baseurl }}), the content of the archive should be backed
up and verified using cryptographic checksums.
Reference distribution
----------------------
Using a specific version of free software distribution is another viable
options for a build environment.
Ideally, it should offer stable releases (like Debian, CentOS, or
FreeBSD) to avoid having constant updates to the documentation or
building scripts.
Recording the exact versions of the installed packages might be helpful
to diagnose issues. Some distributions also keep a complete history
of source packages or binary packages available for later
reinstallation.
Virtual machines / containers
-----------------------------
Some aspects of the build environment can be quite simplified by using
virtual machines or containers. With a virtual machine you can
easily perform the build in a more controlled environment. The build
user, system hostname, network configuration, or other aspects can be
enforced easily on all systems.
The downside is that it can introduce a lot of software that has be
trusted somehow. For example, it's currently not possible to install
Debian in a reproducible manner[^reproducible-install]. This makes it
harder to compare different installations.
[^reproducible-install]: Some [preliminary work](https://wiki.debian.org/ReproducibleInstalls) has been done, mainly to identify the issues. Having byte-for-byte identical installations is a requirement to make *live* distributions build in a reproducible manner, so there is some interest by multiple parties in fixing the problem.
---
title: Deterministic build systems
layout: docs
permalink: /docs/deterministic-build-systems/
---
A software cannot be easily be built reproducibly if the source varies
depending on factors that are hard or impossible to control like the
ordering of files on a filesystem or the current time.
Drawing the line
----------------
Which aspect of the build system needs to be made deterministic is
deeply linked to what is defined as part of the
[build environment]({{ "/docs/perimeter/" | prepend: site.basurl }}).
For example, we assume that different versions of a compiler will
produce different output and so usage of a specific
compiler version is mandated as part of the build environment. The same
assumption does not necessarily hold for more simple tools like `grep`
or `sed` where the requirement for the environment can be as loose as
“any recent Unix-like system”.
But it's hardly a good idea to mandate that the system pseudo-random
number generator be initialized with a given value before performing a
build, so better not have randomness affect a build output.
Another concrete example on where to draw the line: there is no need to
care about making the build system give constant output when run in
different build paths when the build path is considered part of the
build environment, and thus requiring rebuilds to be performed in the
same directory as the original build.
In a nutshell
-------------
The basics on how to make a build system deterministic can be summarized
as:
* Ensure stable inputs.
* Ensure stable outputs.
* Capture as little as possible from the environment.
What follows are some advices on common issues that can affect source
code or build systems that makes multiple builds from the exact same
source different.
Disclaimer
----------
Not all problems currently have solutions. Some tools that might be used
in a build process might require fixes to become non-deterministic. The
Debian effort keep a list of [all issues
found](https://reproducible.debian.net/index_issues.html) while
investigating reproducibility problems in its 22,000+ source packages.
While some requires changes in the package source itself, some can be
fixed by improving or fixing the tools used to perform the builds.
---
title: Embedded signatures
layout: docs
permalink: /docs/embedded-signatures/
---
Software that are distributed using embedded cryptographic signatures
can pose a challenge to allow users to reproduce identical results.
By definition, they will not be able to generate an identical signature.
This can either be solved by making the signature part of the build
process input or by offering tools to transform the distributed binaries
to pristine build results.
Pasting signatures
------------------
One way to handle embedded cryptographic signatures is to make the
signature an (optional) input of the build process. When a signature
is available, it just get copied at the right location.
This enables the following workflow:
1. An initial build is made by the developers who have access to the private key.
2. The build result is signed to an external file.
3. The signature is made part of the released source code.
4. The build that is going to be distributed is made from the latter source.
The `wireless-regdb` package in Debian is an example on [how this can be
be
implemented](https://sources.debian.net/src/wireless-regdb/latest/debian/rules/).
Ignoring signatures
-------------------
A specific comparison tool can be made available that is able to compare
to builds skipping the signatures. Ideally, it should also be able to
produce cryptographic checksums to make downloading the original build