d/tests: Add initial autopkgtests.
The network-manager
packages has been running several DEP-8 integration tests on https://autopkgtest.ubuntu.com/packages/network-manager for many years. Most of them require a virtual machine environment, due to loading the mac80211_hwsim
kernel module for simulating wireless interfaces.
I took the time to refactor those tests and split them into smaller, isolated classes for improved code-reusability and allowing some of the test cases to be executed without the need for special kernel modules.
For now, it executes the integration tests for wired interfaces, using veth
devices. So this merge request copies the relevant test code for wired interface testing of NetworkManager in Ubuntu. This is probably still better than not having any DEP-8 tests.
Merge request reports
Activity
The SalsaCI pipeline passed, including the new autopkgtests, so I'm removing the "Draft" status. @biebl What do you think?
autopkgtest [17:15:53]: test nm-eth.py: ./debian/tests/prep-testbed.sh && python3 debian/tests/nm-eth.py autopkgtest [17:15:53]: test nm-eth.py: [----------------------- + dpkg-vendor --is Debian + mount -o remount,rw /sys + systemctl unmask systemd-udevd.service Removed '/etc/systemd/system/systemd-udevd.service'. + systemctl start systemd-udevd.service test_auto_ip4 (__main__.ColdplugEthernet.test_auto_ip4) ethernet: auto-connection, IPv4 ... .ok test_auto_ip6_dhcp (__main__.ColdplugEthernet.test_auto_ip6_dhcp) ethernet: auto-connection, IPv6 with DHCP ... .ok test_auto_ip6_raonly_no_pe (__main__.ColdplugEthernet.test_auto_ip6_raonly_no_pe) ethernet: auto-connection, IPv6 with only RA, PE disabled ... .ok test_manual_ip4 (__main__.ColdplugEthernet.test_manual_ip4) ethernet: manual connection, IPv4 ... .....ok test_manual_ip6_raonly_pubaddr (__main__.ColdplugEthernet.test_manual_ip6_raonly_pubaddr) ethernet: manual connection, IPv6 with only RA, preferring public address ... .....ok test_manual_ip6_raonly_tmpaddr (__main__.ColdplugEthernet.test_manual_ip6_raonly_tmpaddr) ethernet: manual connection, IPv6 with only RA, preferring temp address ... ....ok test_auto_detect_eth (__main__.HotplugEthernet.test_auto_detect_eth) new eth router is being detected automatically within 30s ... .....ok ---------------------------------------------------------------------- Ran 7 tests in 1111.328s OK autopkgtest [17:34:25]: test nm-eth.py: -----------------------] autopkgtest [17:34:25]: test nm-eth.py: - - - - - - - - - - results - - - - - - - - - - nm-eth.py PASS autopkgtest [17:34:25]: @@@@@@@@@@@@@@@@@@@@ summary nm-eth.py PASS
- Resolved by Lukas Märdian
- Resolved by Lukas Märdian
Ran 7 tests in 1111.328s
Any idea why this is so slow?
- Resolved by Lukas Märdian
- Resolved by Lukas Märdian
Thank you for your comments, I'll look into them in detail and get back to you. Especially the slowness of tests in LXC; I was observing that, too. Using an Ubuntu guest in a LXD runner is quick, same with qemu runner. I will test Ubuntu guest in LXC and Debian guest in LXD to minimize the problem space.
added 15 commits
-
91015f38...5954229e - 14 commits from branch
utopia-team:debian/master
- f66ecc65 - d/tests: Add initial autopkgtests.
-
91015f38...5954229e - 14 commits from branch
mentioned in commit slyon/network-manager@3eaeb247
added 4 commits
Toggle commit list@biebl The issues should now be addressed. PTAL.
43 43 return wrapped 44 44 45 45 def wait_nm_online(): 46 tries = 5 47 while tries > 0 and subprocess.call(['nm-online', '-qs']) != 0: 48 time.sleep(1) 49 tries = tries - 1 46 # XXX: NetworkManager does not reach state "started" in LXC, but stays 47 # stuck in "startup-pending", when checked via 'nm-online -s' 48 # even though "nm-online" clearly logs "online" and journalctl logs: 49 # "NetworkManager[1805]: <info> [...] manager: startup complete" 50 # Check if NM is already online, if not: wait for startup to complete Hm, this
startup-pending
issue is really weird. Would be great to find out why that is.I also notice a rather puzzling behaviour of
nm-online
: If I start NetworkManager inside a VM or container where say eth0 is managed by ifupdown (and up), nmcli says that the device is unmanaged, but nm-online saysConnecting............... 30s [online]
So I wonder if by running
nm-online
we actually ensure that the eth42 device is actually online.Hm, this
startup-pending
issue is really weird. Would be great to find out why that is.Would you be willing to raise this upstream (so we also have something to cross-reference)?
Edited by Michael BieblPlease try this patch:
diff --git a/debian/tests/base.py b/debian/tests/base.py index f64310f0d6..5758710193 100644 --- a/debian/tests/base.py +++ b/debian/tests/base.py @@ -342,6 +342,7 @@ class NetworkTestBase(unittest.TestCase): "/etc/NetworkManager", "/var/lib/NetworkManager", "/run/NetworkManager", + "/run/network", "/etc/netplan", ]: if os.path.exists(d):
It will overmount
/run/network/ifstate
and boom,nm-online
no longer works. It will timeout and thennm-online -s
is run, which also times out.So there is a more general underlying issue here which was only masked by that aweful old Debian patch of mine.
So, the "/run/network" overmount is something we should do to avoid any interference from an existing ifupdown setup. Would be great if you can update the MR accordingly.
You can btw. also reproduce this problem with an LXC/plucky container which does not have ifupdown. There, both
nm-online
andnm-online -s
will timeout.So, I guess this particular hunk of the diff should be reverted again, as it doesn't actually work.
@mpitt do you remember why you retried
nm-online -q
up-to 5 times. That appears rather excessive especially with the default 30s timeout. What do you both think aboutretries=3
andtimeout=10s
? This should still be plenty even for loaded servers.changed this line in version 4 of the diff
IMO 3*10 sec should be plenty enough. And I pushed a change reflecting that. Also, do you think we should additionally overmount
/etc/network
(in addition to/run/network
) to avoid any influence from and accidental modification to ifupdown?I think it would be good to file an upstream NM bug about this
nm-online -s
behaviour. But I need to figure out a clean reproducer, probably with theForce-online-state-with-unmanaged-devices.patch
dropped, to make a case for the upstream developers.A few more observations: If I start a sid lxc container, install network-manager and let it manage the provided
eth0
interface, bothnm-online
andnm-online -s
work as expected. As soon as I remount /sys rw, start udevd (*) and restart NetworkManager, bothnm-online
andnm-online -s
no longer work.Actually, simply remounting /sys rw is enough to trigger this failure.
It appears as if remounting /sys rw inside the container does not actually give a fully functioning /sys.
(*) as the prepare script does.
It appears as if remounting /sys rw inside the container does not actually give a fully functioning /sys.
A
udevadm trigger
yields a lot ofPermission denied
errors. Maybe we have to loop in someone like @stgraber to know if running systemd-udevd inside an lxc container is actually possible and if so, how it can be done.A few more observations: If I start a sid lxc container, install network-manager and let it manage the provided
eth0
interface, bothnm-online
andnm-online -s
work as expected. As soon as I remount /sys rw, start udevd (*) and restart NetworkManager, bothnm-online
andnm-online -s
no longer work.Actually, simply remounting /sys rw is enough to trigger this failure.
It appears as if remounting /sys rw inside the container does not actually give a fully functioning /sys.
That's a very good point! It explains why this only happens in LXC environments (independent of the Distro). I know that Ubuntu's systemd package contains quite some quirks to make
/sys
work OK inside LXD wrapped LXC containers. So that's why it seems to work in Ubuntu/LXD (as used in Ubuntu's armhf autopkgtests) but not properly in Debian/LXC (as used in DebCI).I'll try to see if we can get rid of the udev dependencies in the test code.. It is mostly used to ignore certain interfaces, like the veth router. Marking them unmanaged by NM, which we can probably also do through normal NM configuration.
Right, we only seem to require to set the
NM_UNMANAGED=0
property. Commenting out theudevadm control --reload
call and prep-testbed.sh fromTest-Command:
it seems to work mostly. I've also dropped the-q
from the nm-online call, to see how long it takes to reach the started state. Still a couple of errors in there, see the attached autopkgtest log log.txtEdited by Michael BieblThe same log with qemu. There
eth0
is marked as unmanaged log.txtIn this log it seems as if NetworkManager fiddles with the
eth0
interface (the one that should be managed by the system/ifupdown)Yeah, that's what we need to avoid, as it has the potential to break the DebCI test runners (been there with Netplan DEP8 tests before), should the guest become unreachable. But we might be able to put something like
unmanaged-devices=interface-name:eth0,interface-name:en*
when writing theNetworkManager.conf
inside the tests.https://networkmanager.dev/docs/api/latest/NetworkManager.conf.html#device-spec
base.py already generates a blacklist for this. I wonder why this works in the qemu case but not in the lxc case.
# down; we also blacklist any existing real interface to avoid # interfering with it, and for getting predictable results
Edited by Michael Biebldo you remember why you retried nm-online -q up-to 5 times. That appears rather excessive especially with the default 30s timeout.
@biebl absolutely, yes -- I clearly wrote this with the expectation that
nm-online
wouldn't wait at all, and just immediately respond with yes or no. That was the reason for the loop, which amounted to a total timeout of 5s (which seems reasonable for a test where everything is local, in particular the DHCP responder). Perhaps thenm-online
behaviour changed over the years to grow a timeout, or perhaps it was always like that and I just screwed it up back then -- either way, a single invocation with a 10s timeout, or (if that is somehow unstable) a loop of 5 with an 1s timeout feel appropriate. Thanks for spotting this!base.py already generates a blacklist for this. I wonder why this works in the qemu case but not in the lxc case.
haha, it appears this feature (unmanaged-devices) requires udev. So we are back to where we started :-)
So I see three options:
- Either we get systemd-udevd and /sys working properly inside an lxc container
- We get clarification from upstream that NM is supposed to work properly (inside a container) without a running udev and/or we identify missing features (such as unmanaged-devices) and find proper workarounds
- We are calling it quits and acknowledge that running NM inside a container is too much work and we upgrade the test to isolation-machine
Edited by Michael Biebl@slyon I'm a bit undecided on how to proceed. It appears NM without udev only sort-of works and udev under lxc also only sort-of works.
I would love to have some basic autopkgtests that can run under lxc i.e. all of debci but my gut feeling is, that if it's too much work to get working properly, a qemu based test (which only runs on amd64) is still better than having no autopkgtests.
@biebl Thank you for staying on top of this! I agree with your statement that "NM without udev only sort-of works and udev under lxc also only sort-of works." And that it's probably too much effort working out all the kinks between udev, LXC and NM upstream here.
So switching to
isolation-machine
sounds like a reasonable approach to at least get some coverage on DebCI/amd64 (sadly, no SalsaCI). And it would bring the additional benefit that we could port & execute more DEP-8 tests from Ubuntu, which mostly requireisolation-machine
anyway: https://git.launchpad.net/ubuntu/+source/network-manager/tree/debian/tests/controlHow can we request a qemu/amd64 runner for DebCI?
How can we request a qemu/amd64 runner for DebCI?
We need to poke @elbrus on #debci.
Edited by Michael BieblAnd it would bring the additional benefit that we could port & execute more DEP-8 tests from Ubuntu, which mostly require
isolation-machine
anyway: https://git.launchpad.net/ubuntu/+source/network-manager/tree/debian/tests/controlLooking forward to that. Keep those MRs coming!
added 1 commit
- e7ea39e9 - d/t/control: switch nm-eth.py to isolation-machine
- Resolved by Lukas Märdian
Can you rebase the MR without the intermediate commits such as
- d/t/base.py: Work around unusual nm-online waiting behaviour in LXC
- Revert "d/t/base.py: Work around unusual nm-online waiting behaviour in LXC" i.e. stuff that tried to make it work under lxc.
Instead I would update the commit message accordingly why isolation-machine is necessary.
Edited by Michael Biebl