1. 26 Jul, 2022 1 commit
  2. 06 Mar, 2022 2 commits
  3. 02 Feb, 2022 2 commits
  4. 22 Nov, 2021 4 commits
  5. 10 Nov, 2021 1 commit
  6. 03 Nov, 2021 1 commit
    • Jan Friesse's avatar
      totemsrp: Switch totempg buffers at the right time · e7a82370
      Jan Friesse authored
      Commit 92e0f9c7 added switching of
      totempg buffers in sync phase. But because buffers got switch too early
      there was a problem when delivering recovered messages (messages got
      corrupted and/or lost). Solution is to switch buffers after recovered
      messages got delivered.
      
      I think it is worth to describe complete history with reproducers so it
      doesn't get lost.
      
      It all started with 40263892 (more info
      about original problem is described in
      https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
      solves problem which is way to be reproduced with following reproducer:
      - 2 nodes
      - Both nodes running corosync and testcpg
      - Pause node 1 (SIGSTOP of corosync)
      - On node 1, send some messages by testcpg
        (it's not answering but this doesn't matter). Simply hit ENTER key
        few times is enough)
      - Wait till node 2 detects that node 1 left
      - Unpause node 1 (SIGCONT of corosync)
      
      and on node 1 newly mcasted cpg messages got sent before sync barrier,
      so node 2 logs "Unknown node -> we will not deliver message".
      
      Solution was to add switch of totemsrp new messages buffer.
      
      This patch was not enough so new one
      (92e0f9c7) was created. Reproducer of
      problem was similar, just cpgverify was used instead of testcpg.
      Occasionally when node 1 was unpaused it hang in sync phase because
      there was a partial message in totempg buffers. New sync message had
      different frag cont so it was thrown away and never delivered.
      
      After many years problem was found which is solved by this patch
      (original issue describe in
      https://github.com/corosync/corosync/issues/660
      
      ).
      Reproducer is more complex:
      - 2 nodes
      - Node 1 is rate-limited (used script on the hypervisor side):
        ```
        iface=tapXXXX
        # ~0.1MB/s in bit/s
        rate=838856
        # 1mb/s
        burst=1048576
        tc qdisc add dev $iface root handle 1: htb default 1
        tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
          burst ${burst}b
        tc qdisc add dev $iface handle ffff: ingress
        tc filter add dev $iface parent ffff: prio 50 basic police rate \
          ${rate}bps burst ${burst}b mtu 64kb "drop"
        ```
      - Node 2 is running corosync and cpgverify
      - Node 1 keeps restarting of corosync and running cpgverify in cycle
        - Console 1: while true; do corosync; sleep 20; \
            kill $(pidof corosync); sleep 20; done
        - Console 2: while true; do ./cpgverify;done
      
      And from time to time (reproduced usually in less than 5 minutes)
      cpgverify reports corrupted message.
      
      Signed-off-by: default avatarJan Friesse <jfriesse@redhat.com>
      Reviewed-by: default avatarFabio M. Di Nitto <fdinitto@redhat.com>
      e7a82370
  7. 25 Oct, 2021 1 commit
  8. 18 Oct, 2021 1 commit
  9. 02 Oct, 2021 1 commit
  10. 01 Oct, 2021 2 commits
  11. 30 Sep, 2021 7 commits
  12. 13 Sep, 2021 1 commit
  13. 20 Aug, 2021 1 commit
    • Jan Friesse's avatar
      totem: Add cancel_hold_on_retransmit config option · cdf72925
      Jan Friesse authored
      
      
      Previously, existence of retransmit messages canceled holding
      of token (and never allowed representative to enter token hold
      state).
      
      This makes token rotating maximum speed and keeps processor
      resending messages over and over again - overloading network
      and reducing chance to successfully deliver the messages.
      
      Also there were reports of various Antivirus / IPS / IDS which slows
      down delivery of packets with certain sizes (packets bigger than token)
      what make Corosync retransmit messages over and over again.
      
      Proposed solution is to allow representative to enter token hold
      state when there are only retransmit messages. This allows network to
      handle overload and/or gives Antivirus/IPS/IDS enough time scan and
      deliver packets without corosync entering "FAILED TO RECEIVE" state and
      adding more load to network.
      
      Signed-off-by: default avatarJan Friesse <jfriesse@redhat.com>
      Reviewed-by: default avatarChristine Caulfield <ccaulfie@redhat.com>
      cdf72925
  14. 04 Aug, 2021 2 commits
  15. 02 Aug, 2021 6 commits
  16. 29 Jul, 2021 1 commit
  17. 23 Jul, 2021 1 commit
    • Jan Friesse's avatar
      main: Add support for cgroup v2 and auto mode · c9996fdd
      Jan Friesse authored
      
      
      Support for cgroup v2 is very similar to cgroup v1 just checking (and
      writing) different file.
      
      Because of all the problems described later with cgroup v2 new "auto"
      mode (new default) is added. This mode first tries to set rr scheduling
      and moves Corosync to root cgroup only if it fails.
      
      Testing this feature is a bit harder than with cgroup v1 so it's
      probably worh noting in this commit message.
      
      1. Copy some service file (I've used httpd service) and set
         CPUQuota=30% in the [service] section.
      2. Check /sys/fs/cgroup/cgroup.subtree_control - there should be no
         "cpu"
      3. Start modified service
      4. Check /sys/fs/cgroup/cgroup.subtree_control - there should be "cpu"
      5. Start corosync - It should be able to get rt priority
      
      When move_to_root_cgroup is disabled (applies only for kernels
      with CONFIG_RT_GROUP_SCHED enabled), behavior differs:
      - If corosync is started before modified service, so
        there is no "cpu" in /sys/fs/cgroup/cgroup.subtree_control
        corosync starts without problem and gets rt priority.
        Starting modified service later will never add "cpu" into
        /sys/fs/cgroup/cgroup.subtree_control (because corosync is holding
        rt priority and it is placed in the non-root cgroup by systemd).
      
      - When corosync is started after modified service, so "cpu"
        is in /sys/fs/cgroup/cgroup.subtree_control, corosync is not
        able to get RT priority.
      
      It's worth noting problems when cgroup v2 is used together with systemd
      logging described in corosync.conf(5) man page.
      
      Signed-off-by: default avatarJan Friesse <jfriesse@redhat.com>
      Reviewed-by: default avatarChristine Caulfield <ccaulfie@redhat.com>
      c9996fdd
  18. 05 Jul, 2021 2 commits
  19. 03 Jun, 2021 1 commit
  20. 02 Jun, 2021 1 commit
  21. 21 May, 2021 1 commit