Skip to content
Snippets Groups Projects

d/tests: Add initial autopkgtests.

Merged Lukas Märdian requested to merge slyon/network-manager:dep8-veth into debian/master

The network-manager packages has been running several DEP-8 integration tests on https://autopkgtest.ubuntu.com/packages/network-manager for many years. Most of them require a virtual machine environment, due to loading the mac80211_hwsim kernel module for simulating wireless interfaces.

I took the time to refactor those tests and split them into smaller, isolated classes for improved code-reusability and allowing some of the test cases to be executed without the need for special kernel modules.

For now, it executes the integration tests for wired interfaces, using veth devices. So this merge request copies the relevant test code for wired interface testing of NetworkManager in Ubuntu. This is probably still better than not having any DEP-8 tests.

Edited by Lukas Märdian

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Author Contributor

    The SalsaCI pipeline passed, including the new autopkgtests, so I'm removing the "Draft" status. @biebl What do you think?

    autopkgtest [17:15:53]: test nm-eth.py: ./debian/tests/prep-testbed.sh && python3 debian/tests/nm-eth.py
    autopkgtest [17:15:53]: test nm-eth.py: [-----------------------
    + dpkg-vendor --is Debian
    + mount -o remount,rw /sys
    + systemctl unmask systemd-udevd.service
    Removed '/etc/systemd/system/systemd-udevd.service'.
    + systemctl start systemd-udevd.service
    test_auto_ip4 (__main__.ColdplugEthernet.test_auto_ip4)
    ethernet: auto-connection, IPv4 ... .ok
    test_auto_ip6_dhcp (__main__.ColdplugEthernet.test_auto_ip6_dhcp)
    ethernet: auto-connection, IPv6 with DHCP ... .ok
    test_auto_ip6_raonly_no_pe (__main__.ColdplugEthernet.test_auto_ip6_raonly_no_pe)
    ethernet: auto-connection, IPv6 with only RA, PE disabled ... .ok
    test_manual_ip4 (__main__.ColdplugEthernet.test_manual_ip4)
    ethernet: manual connection, IPv4 ... .....ok
    test_manual_ip6_raonly_pubaddr (__main__.ColdplugEthernet.test_manual_ip6_raonly_pubaddr)
    ethernet: manual connection, IPv6 with only RA, preferring public address ... .....ok
    test_manual_ip6_raonly_tmpaddr (__main__.ColdplugEthernet.test_manual_ip6_raonly_tmpaddr)
    ethernet: manual connection, IPv6 with only RA, preferring temp address ... ....ok
    test_auto_detect_eth (__main__.HotplugEthernet.test_auto_detect_eth)
    new eth router is being detected automatically within 30s ... .....ok
    ----------------------------------------------------------------------
    Ran 7 tests in 1111.328s
    OK
    autopkgtest [17:34:25]: test nm-eth.py: -----------------------]
    autopkgtest [17:34:25]: test nm-eth.py:  - - - - - - - - - - results - - - - - - - - - -
    nm-eth.py            PASS
    autopkgtest [17:34:25]: @@@@@@@@@@@@@@@@@@@@ summary
    nm-eth.py            PASS
  • Lukas Märdian marked this merge request as ready

    marked this merge request as ready

  • Thanks a lot, this is really nice.

    Btw, we can ask for a qemu runner on debci (at least on x86-64).

  • Michael Biebl
  • Btw, we can ask for a qemu runner on debci (at least on x86-64).

    That said, I do think it's useful to start with those tests that can run inside a container, as this will cover more architectures and also salsa-ci.

  • Michael Biebl
    • Author Contributor
      Resolved by Lukas Märdian

      Thank you for your comments, I'll look into them in detail and get back to you. Especially the slowness of tests in LXC; I was observing that, too. Using an Ubuntu guest in a LXD runner is quick, same with qemu runner. I will test Ubuntu guest in LXC and Debian guest in LXD to minimize the problem space.

  • Michael Biebl added 15 commits

    added 15 commits

    Compare with previous version

  • Lukas Märdian added 4 commits

    added 4 commits

    • 772b446c - d/tests: Add initial autopkgtests.
    • 3eaeb247 - d/t/base.py: Work around unusual nm-online waiting behaviour in LXC
    • 8496b96e - d/t/base.py: Install backup mgmt network in /run/systemd/network/
    • 325b36a1 - d/t/prep-testbed.sh: Check actual systemd-udevd.service condition

    Compare with previous version

  • Lukas Märdian resolved all threads

    resolved all threads

  • Lukas Märdian resolved all threads

    resolved all threads

  • Author Contributor

    @biebl The issues should now be addressed. PTAL.

  • 43 43 return wrapped
    44 44
    45 45 def wait_nm_online():
    46 tries = 5
    47 while tries > 0 and subprocess.call(['nm-online', '-qs']) != 0:
    48 time.sleep(1)
    49 tries = tries - 1
    46 # XXX: NetworkManager does not reach state "started" in LXC, but stays
    47 # stuck in "startup-pending", when checked via 'nm-online -s'
    48 # even though "nm-online" clearly logs "online" and journalctl logs:
    49 # "NetworkManager[1805]: <info> [...] manager: startup complete"
    50 # Check if NM is already online, if not: wait for startup to complete
    • Hm, this startup-pending issue is really weird. Would be great to find out why that is.

      I also notice a rather puzzling behaviour of nm-online: If I start NetworkManager inside a VM or container where say eth0 is managed by ifupdown (and up), nmcli says that the device is unmanaged, but nm-online says

      Connecting...............   30s [online]

      So I wonder if by running nm-online we actually ensure that the eth42 device is actually online.

    • I wonder if this patch has the potential to break the test suite as we might not actually wait until the connection that we configured, is online.

    • Hm, this startup-pending issue is really weird. Would be great to find out why that is.

      Would you be willing to raise this upstream (so we also have something to cross-reference)?

      Edited by Michael Biebl
    • Please try this patch:

      diff --git a/debian/tests/base.py b/debian/tests/base.py
      index f64310f0d6..5758710193 100644
      --- a/debian/tests/base.py
      +++ b/debian/tests/base.py
      @@ -342,6 +342,7 @@ class NetworkTestBase(unittest.TestCase):
                   "/etc/NetworkManager",
                   "/var/lib/NetworkManager",
                   "/run/NetworkManager",
      +            "/run/network",
                   "/etc/netplan",
               ]:
                   if os.path.exists(d):

      It will overmount /run/network/ifstate and boom, nm-online no longer works. It will timeout and then nm-online -s is run, which also times out.

      So there is a more general underlying issue here which was only masked by that aweful old Debian patch of mine.

    • So, the "/run/network" overmount is something we should do to avoid any interference from an existing ifupdown setup. Would be great if you can update the MR accordingly.

      You can btw. also reproduce this problem with an LXC/plucky container which does not have ifupdown. There, both nm-online and nm-online -s will timeout.

      So, I guess this particular hunk of the diff should be reverted again, as it doesn't actually work.

      @mpitt do you remember why you retried nm-online -q up-to 5 times. That appears rather excessive especially with the default 30s timeout. What do you both think about retries=3 and timeout=10s? This should still be plenty even for loaded servers.

    • Lukas Märdian changed this line in version 4 of the diff

      changed this line in version 4 of the diff

    • Author Contributor

      IMO 3*10 sec should be plenty enough. And I pushed a change reflecting that. Also, do you think we should additionally overmount /etc/network (in addition to /run/network) to avoid any influence from and accidental modification to ifupdown?

      I think it would be good to file an upstream NM bug about this nm-online -s behaviour. But I need to figure out a clean reproducer, probably with the Force-online-state-with-unmanaged-devices.patch dropped, to make a case for the upstream developers.

    • A few more observations: If I start a sid lxc container, install network-manager and let it manage the provided eth0 interface, both nm-online and nm-online -s work as expected. As soon as I remount /sys rw, start udevd (*) and restart NetworkManager, both nm-online and nm-online -s no longer work.

      Actually, simply remounting /sys rw is enough to trigger this failure.

      It appears as if remounting /sys rw inside the container does not actually give a fully functioning /sys.

      (*) as the prepare script does.

    • It appears as if remounting /sys rw inside the container does not actually give a fully functioning /sys.

      A udevadm trigger yields a lot of Permission denied errors. Maybe we have to loop in someone like @stgraber to know if running systemd-udevd inside an lxc container is actually possible and if so, how it can be done.

    • Also, do you think we should additionally overmount /etc/network

      The test suite does not enable the ifupdown plugin, so this should not strictly be necessary.

    • Author Contributor

      A few more observations: If I start a sid lxc container, install network-manager and let it manage the provided eth0 interface, both nm-online and nm-online -s work as expected. As soon as I remount /sys rw, start udevd (*) and restart NetworkManager, both nm-online and nm-online -s no longer work.

      Actually, simply remounting /sys rw is enough to trigger this failure.

      It appears as if remounting /sys rw inside the container does not actually give a fully functioning /sys.

      That's a very good point! It explains why this only happens in LXC environments (independent of the Distro). I know that Ubuntu's systemd package contains quite some quirks to make /sys work OK inside LXD wrapped LXC containers. So that's why it seems to work in Ubuntu/LXD (as used in Ubuntu's armhf autopkgtests) but not properly in Debian/LXC (as used in DebCI).

      I'll try to see if we can get rid of the udev dependencies in the test code.. It is mostly used to ignore certain interfaces, like the veth router. Marking them unmanaged by NM, which we can probably also do through normal NM configuration.

    • Right, we only seem to require to set the NM_UNMANAGED=0 property. Commenting out the udevadm control --reload call and prep-testbed.sh from Test-Command: it seems to work mostly. I've also dropped the -q from the nm-online call, to see how long it takes to reach the started state. Still a couple of errors in there, see the attached autopkgtest log log.txt

      Edited by Michael Biebl
    • In this log it seems as if NetworkManager fiddles with the eth0 interface (the one that should be managed by the system/ifupdown)

    • The same log with qemu. There eth0 is marked as unmanaged log.txt

    • Author Contributor

      In this log it seems as if NetworkManager fiddles with the eth0 interface (the one that should be managed by the system/ifupdown)

      Yeah, that's what we need to avoid, as it has the potential to break the DebCI test runners (been there with Netplan DEP8 tests before), should the guest become unreachable. But we might be able to put something like unmanaged-devices=interface-name:eth0,interface-name:en* when writing the NetworkManager.conf inside the tests.

      https://networkmanager.dev/docs/api/latest/NetworkManager.conf.html#device-spec

    • base.py already generates a blacklist for this. I wonder why this works in the qemu case but not in the lxc case.

        # down; we also blacklist any existing real interface to avoid
        # interfering with it, and for getting predictable results
      Edited by Michael Biebl
    • do you remember why you retried nm-online -q up-to 5 times. That appears rather excessive especially with the default 30s timeout.

      @biebl absolutely, yes -- I clearly wrote this with the expectation that nm-online wouldn't wait at all, and just immediately respond with yes or no. That was the reason for the loop, which amounted to a total timeout of 5s (which seems reasonable for a test where everything is local, in particular the DHCP responder). Perhaps the nm-online behaviour changed over the years to grow a timeout, or perhaps it was always like that and I just screwed it up back then -- either way, a single invocation with a 10s timeout, or (if that is somehow unstable) a loop of 5 with an 1s timeout feel appropriate. Thanks for spotting this!

    • base.py already generates a blacklist for this. I wonder why this works in the qemu case but not in the lxc case.

      haha, it appears this feature (unmanaged-devices) requires udev. So we are back to where we started :-)

      So I see three options:

      1. Either we get systemd-udevd and /sys working properly inside an lxc container
      2. We get clarification from upstream that NM is supposed to work properly (inside a container) without a running udev and/or we identify missing features (such as unmanaged-devices) and find proper workarounds
      3. We are calling it quits and acknowledge that running NM inside a container is too much work and we upgrade the test to isolation-machine
      Edited by Michael Biebl
    • unmanaged-devices=interface-name:eth0

      Hm, that seem to be working without a running udev. unmanaged-devices=mac:XXX does not.

    • Hm, that seem to be working without a running udev. unmanaged-devices=mac:XXX does not.

      I take that back. I hard-coded interface-name:eth0 into the blacklist, but the device is still managed by NM when running the autopkgtest.

    • @slyon I'm a bit undecided on how to proceed. It appears NM without udev only sort-of works and udev under lxc also only sort-of works.

      I would love to have some basic autopkgtests that can run under lxc i.e. all of debci but my gut feeling is, that if it's too much work to get working properly, a qemu based test (which only runs on amd64) is still better than having no autopkgtests.

    • Author Contributor

      @biebl Thank you for staying on top of this! I agree with your statement that "NM without udev only sort-of works and udev under lxc also only sort-of works." And that it's probably too much effort working out all the kinks between udev, LXC and NM upstream here.

      So switching to isolation-machine sounds like a reasonable approach to at least get some coverage on DebCI/amd64 (sadly, no SalsaCI). And it would bring the additional benefit that we could port & execute more DEP-8 tests from Ubuntu, which mostly require isolation-machine anyway: https://git.launchpad.net/ubuntu/+source/network-manager/tree/debian/tests/control

      How can we request a qemu/amd64 runner for DebCI?

    • How can we request a qemu/amd64 runner for DebCI?

      We need to poke @elbrus on #debci.

      Edited by Michael Biebl
    • And it would bring the additional benefit that we could port & execute more DEP-8 tests from Ubuntu, which mostly require isolation-machine anyway: https://git.launchpad.net/ubuntu/+source/network-manager/tree/debian/tests/control

      Looking forward to that. Keep those MRs coming!

    • Please register or sign in to reply
  • Lukas Märdian added 3 commits

    added 3 commits

    • fb02f23e - d/t/base.py: Overmount /run/network/ to isolate tests from ifupdown & ifstate
    • e41c7e5f - Revert "d/t/base.py: Work around unusual nm-online waiting behaviour in LXC"
    • b109761b - d/t/base.py: decrease huge nm-online timeouts

    Compare with previous version

  • Lukas Märdian added 1 commit

    added 1 commit

    • e7ea39e9 - d/t/control: switch nm-eth.py to isolation-machine

    Compare with previous version

  • Michael Biebl
  • Can you rebase the MR without the intermediate commits such as

    • d/t/base.py: Work around unusual nm-online waiting behaviour in LXC
    • Revert "d/t/base.py: Work around unusual nm-online waiting behaviour in LXC" i.e. stuff that tried to make it work under lxc.

    Instead I would update the commit message accordingly why isolation-machine is necessary.

    Edited by Michael Biebl
  • I.e. a nice, clean patch series without our failed attempts to make it work under LXC

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading