btrfs: excessive disk writes (amplification)
The idle writes to disk need to be reduced. (Long running precesses can be seen with iotop -a
-> and 2x left arrow key for sorting.)
-
Mount with noatime
(instead of just relatime), to avoid all writing when reading. -
For SSDs, the btrfs mount option nospace_cache
(minimal speed penalty for SSDs) seems useful to reduce the "btrfs-transacti" idle-time write overhead to about 350K per minute, better, but still much just for no-op idling on sdcards. -
/tmp in RAM using /etc/fstab is universal and allows for more secure mount options:
echo "tmpfs /tmp tmpfs noatime,nosuid,noexec,nodev 0 0" >> /etc/fstab
mkdir -p /var/cache/apt/tmp
echo "tmpfs /var/cache/apt/tmp tmpfs noatime,nosuid,nodev 0 0" >> /etc/fstab
And
echo "APT::ExtractTemplates::TempDir \"/var/cache/apt/tmp\";" > /etc/apt/apt.conf.d/01-tempdir
as only-root writable tmpfs directory that is notnoexec
.https://askubuntu.com/questions/1232004/mounting-tmp-as-tmpfs-on-ubuntu-20-04cp -v /usr/share/systemd/tmp.mount /etc/systemd/system/ ; systemctl enable tmp.mount
- [ ] /var/cache in RAM? (Is /etc/fstab processed too late?)
echo "tmpfs /var/cache tmpfs noatime,nosuid,noexec,nodev 0 0" >> /etc/fstab
-
echo -e "[Definition]\ndbfile = :memory:\n" > /etc/fail2ban.local
-
NetworkManger seems to write frequently to some file(s) in /var/lib/NetworkManager/
? Maybe some cache-like file(s) can be moved and symlinked to/var/run/NetworkManager/
? -
Changing all the "persistence" lines in /etc/nscd.conf to "no" stops nscd from constantly doing cache writes, as seen in iotop. -
periodic slapd
writes -
periodic python3 ... .sshd
writes
Further conclusion from the discussion and experiments below:
Considering the use of sdcards (btrfs 30x write amplification + the device internal write amplification) I think it's advisable to have sdcard images with root on F2FS, and support to create a separate btrfs filesystem:
- To hold just (central) important /home user-data (external disk).
- And optionally to also move the rootfs of the installed system from the F2FS into a btrfs subvolume to the created btrfs partition (adding
nospace_cache
on SSDs).
Some BTRFS workarounds:
-
The process "btrfs-transacti" bursts out several MBs per minute even when idling during a day, as can be seen with
iotop
(and pressinga
for accumulation, and 2x<-
to sort).See also https://superuser.com/questions/1211324/btrfs-transacti-writes-to-disk-every-30-seconds
As auto-defragmentation stays disabled by default these days, that answer only suggests it could be due to the copy-on-write /var/log directory.
Possible fixes/improvements:
-
Enable btrfs compression to reduce write-volume in general?
-
Disabling all (see below) snapshots in plinth (and deleting existing) does reduce the hourly writes considerably
Boot snapshots workaround:
systemctl disable snapper-boot.timer
(#2037) -
Set up
chattr +C
on the/var/log
directory or subvolume, before the creation or copying of any log files (some move,re-create,copy-back pivot is required (#2034 (comment 226841)) to convert preexisting logfiles on upgrades)- Note, /var/log/journal already ships with the
C
attribute set. Converting the entire /var/log (the classic log files) did not seem to reduce the idle overhead writes. => So, not worth to drop COW, compression and checksumming for classic text log files?
- Note, /var/log/journal already ships with the
-
Also move other parts of the system (like the user-data) into separate subvolumes, to reduce the system snapshot size, and have these parts snapshoted separately from the system installs and rollbacks.
-
Possibly mount with option commit=600 to flush all data to the disk only every 10 minutes (/etc/fstab). (To aggregate fluctuating writes, and write larger chunks.) However, not well suited for user data like /home.
-
Related: Similar fedora issue: https://pagure.io/fedora-btrfs/project/issue/36