1. 12 Jan, 2022 1 commit
    • Harald Jensås's avatar
      Ensure 'port' is up2date after binding:host_id · 87f15ec6
      Harald Jensås authored
      On neutron routed provider networks IP allocation is
      deferred until 'binding:host_id' is set. When ironic
      creates neutron ports it first creates the port, then
      updates the port setting binding information.
      
      When using IPv6 networking ironic adds additional address
      allocations to ensure network chain-booting will succeed.
      When address allocation is deferred on port create ironic
      cannot detect that IPv6 is used and does not add the
      required additional addresses.
      
      This change ensures the 'port' object is updated after the
      port update setting the port binding required for neutron
      to allocate the address. This allows ironic to correctly
      detect IPv6 is used, and it will add the required IP
      address allocations.
      
      Story: 2009773
      Task: 44254
      Change-Id: I863dd4ab9615a9ce3b3dcb8798af674ac9966bf2
      (cherry picked from commit 3404dc91)
      87f15ec6
  2. 09 Dec, 2021 1 commit
  3. 08 Nov, 2021 1 commit
    • Aija Jauntēva's avatar
      Fix idrac-wsman deploy with existing non-BIOS jobs · 969cfefe
      Aija Jauntēva authored
      As with WS-Man iDRAC API setting boot device requires creating BIOS
      job and there can be only 1 open job per subsystem present in iDRAC,
      there is validation to check that the job queue is empty before
      continuing setting boot device. This does not work well for cases when
      using autoupdatescheduler that creates `Repository Update` job that
      stays Scheduled until executed and then followed by new Scheduled
      `Repository Update` job.
      
      This patch allows non-BIOS jobs to be present in the queue when setting
      boot device. This will still fail for cases when there are BIOS jobs
      present. In such cases should consider moving to idrac-redfish that
      does not create BIOS or any other job to set boot device.
      
      Story: 2009251
      Task: 43437
      Change-Id: I91e9ba3024a85897aeead21cede57464294b409b
      (cherry picked from commit b1d08ae8)
      969cfefe
  4. 14 Sep, 2021 1 commit
    • Aija Jauntēva's avatar
      Fix idrac-wsman set_power_state to wait on HW · 0df43f75
      Aija Jauntēva authored
      set_power_state has returned to the caller immediately without
      confirming the system has reached the requested state. This fixes that
      by synchronously waiting until the target state has been read before
      returning.
      
      That bug can cause instance workload deployments to fail on Dell EMC
      PowerEdge server models on which IPA ramdisk soft power off fails and
      ironic employs its OOB fallback strategy. After an otherwise successful
      deployment, the node is active, but is powered off. No error is reported
      in last_error. If the subsequent instance workflow expects the system to
      be powered on into the operating system, it fails.
      
      Story: 2009204
      Task: 43261
      Change-Id: I3112a22149c07e5508f26c79f33d09aeb905c308
      (cherry picked from commit 2a0fd1d1)
      0df43f75
  5. 03 Aug, 2021 1 commit
  6. 29 Jul, 2021 1 commit
  7. 27 Jul, 2021 1 commit
  8. 26 Jul, 2021 1 commit
  9. 05 Jul, 2021 1 commit
    • Dmitry Tantsur's avatar
      Cache AgentClient on Task, not globally · 0cb15a22
      Dmitry Tantsur authored
      
      
      In order to avoid potential cache coherency issues
      when using a globally cached AgentClient, e.g. with
      TSL certificates from the IPA, cache the AgentClient
      on a per task basis.
      
      Co-Authored-By: default avatarArne Wiebalck <arne.wiebalck@cern.ch>
      
      Conflicts:
      	ironic/drivers/modules/agent.py
      	ironic/drivers/modules/agent_base.py
      	ironic/drivers/modules/ansible/deploy.py
      	ironic/drivers/modules/iscsi_deploy.py
      	ironic/tests/unit/drivers/modules/test_agent.py
      
      Story: #2009004
      Task: #42678
      
      Change-Id: I0c458c8d9ae673181beb6d85c2ee68235ccef239
      (cherry picked from commit fcb6a109)
      0cb15a22
  10. 29 Jun, 2021 1 commit
    • Dhuldev Valekar's avatar
      Update the clear job id's constant · 4ac6ad73
      Dhuldev Valekar authored
      Fixes an issue of powering off with the ``idrac-wsman`` management
      interface while the execution of a clear job queue cleaning step is
      proceeding.
      Prior to this fix, the clean step would fail when powering off a node.
      
      Story: 2008988
      Task: 42641
      
      Change-Id: Ib4ab755c806f028d97379b80a8c27d6ade63cba1
      (cherry picked from commit 741a4d8a)
      4ac6ad73
  11. 21 Jun, 2021 1 commit
    • Julia Kreger's avatar
      Fix node detail instance_uuid request handling · 755c75e2
      Julia Kreger authored
      The instance_uuid handling on the detailed node information
      endpoint of the api (/v1/nodes/detail?instance_uuid=<uuid>),
      which is used by services such as Nova for explicit node status
      lookups, previously had special conditional logic surrounding it
      which skipped the inclusion of the API requestor project-id, from
      being incorporated into the database query.
      
      Ultimately, this allowed an authenticated user to obtain a partially
      redacted node entry where sensitive informational fields were scrubbed
      from the response payload.
      
      With this fix, queries for an explicit instance_uuid now follow the
      standard path inside the Ironic API to the database which includes
      inclusion of a requestor Project-ID if required by configured policy.
      
      Change-Id: I9bfa5a54e02c8a1e9c8cad6b9acdbad6ab62bef3
      Story: 2008976
      Task: 42620
      (cherry picked from commit be3c153d)
      755c75e2
  12. 09 Jun, 2021 1 commit
    • Aija Jauntēva's avatar
      Refactor iDRAC OEM extension manager calls · 0bc5265e
      Aija Jauntēva authored
      - Re-usable helper created to avoid duplication.
      - Although there is only one manager for system in known iDRAC systems
      still iterate through collection for future changes.
      - Restructured exception raising and error logging for better feedback.
      - Removed some unit tests to avoid duplication that is covered by
      method specific unit tests
      
      Change-Id: I03fdb48e47c9557c207a20ee876eccf3f3459d9f
      (cherry picked from commit 39cd751a)
      0bc5265e
  13. 02 Jun, 2021 2 commits
  14. 31 May, 2021 1 commit
  15. 19 May, 2021 1 commit
    • LinPeiWen's avatar
      Delete unavailable py2 package · 3258e49a
      LinPeiWen authored
      The openstack Ussuri and Victoria versions no longer support the
      Centos7 and pyrhon2 environment packages. Correct the missing
      problems in the latest document
      
      Change-Id: I60787243fdc6ed2741522355ec79970bdb912f41
      (cherry picked from commit 35dea078)
      (cherry picked from commit 77be4c6c)
      3258e49a
  16. 10 May, 2021 1 commit
    • Riccardo Pittau's avatar
      Point ipa-builder to stable/wallaby · 0df78f60
      Riccardo Pittau authored
      Since ironic-python-agent-bulder has stable branches starting from
      wallaby, we need to point all the older branches to stable/wallaby,
      unless they're already pinned to an older version.
      
      Change-Id: I90a0d4d75fb4581805f11e79ca7185cfdb66f77a
      0df78f60
  17. 07 May, 2021 1 commit
    • Dmitry Tantsur's avatar
      Fix deployment when executing a command fails after the command starts · 67871426
      Dmitry Tantsur authored
      If the agent accepts a command, but is unable to reply to Ironic (which
      sporadically happens before of the eventlet's TLS implementation), we
      currently retry the request and fail because the command is already
      executing. Ironic now detects this situation by checking the list of
      executing commands after receiving a connection error. If the requested
      command is last one, we assume that the command request succeeded.
      
      Ideally, we should pass a request ID to IPA and then compare it. Such
      a change would affect the API contract between the agent and Ironic
      and thus would not be backportable.
      
      Change-Id: I2ea21c9ec440fa7ddf8578cf7b34d6d0ebbb5dc8
      (cherry picked from commit abfe383c)
      67871426
  18. 05 May, 2021 1 commit
  19. 21 Apr, 2021 1 commit
  20. 15 Apr, 2021 2 commits
  21. 14 Apr, 2021 1 commit
  22. 09 Apr, 2021 2 commits
    • Steve Baker's avatar
      Fix ipmitool timing argument calculation · b205a32c
      Steve Baker authored
      Calculating the ipmitool `-N` and `-R` arguments from ironic.conf
      [ipmi] `command_retry_timeout` and `min_command_interval` now takes
      into account the 1 second interval increment that ipmitool adds on
      each retry event.
      
      Failure-path ipmitool run duration will now be just less than
      `command_retry_timeout` instead of much longer.
      
      Change-Id: Ia3d8d85497651290c62341ac121e2aa438b4ac50
      (cherry picked from commit 1de3db3b)
      b205a32c
    • Aija Jauntēva's avatar
      Fix idrac-wsman BIOS step async error handling · 6130dc15
      Aija Jauntēva authored
      Instead of using process_event('fail') use error_handlers,
      otherwise in case of failure node gets stuck and fails
      because of timeout, instead of failing earlier due to
      step failure.
      
      And improve coverage to test this error handling
      and also happy paths.
      
      Story: 2008307
      Task: 41197
      Change-Id: I1e957c2b526abc37920212b6431b11eedc9f89be
      (cherry picked from commit 83ce7c42)
      6130dc15
  23. 01 Apr, 2021 1 commit
    • Bob Fournier's avatar
      Restrict syncing of boot mode to Supermicro · 4fd09934
      Bob Fournier authored
      The fix for https://storyboard.openstack.org/#!/story/2008252 synced
      the boot mode after changing the boot device, because Supermicro nodes
      reset the boot mode if not included in the boot device set. However this
      can cause a problem on Dell nodes when changing the mode uefi->bios or
      bios->uefi. Restrict the syncing of the boot mode to Supermicro.
      
      Story: 2008712
      Task: 42046
      Change-Id: I9f305cb3f33766c1c93cf4347368b1ce025fc635
      (cherry picked from commit 8bd25a98)
      4fd09934
  24. 24 Mar, 2021 1 commit
  25. 23 Mar, 2021 1 commit
  26. 21 Mar, 2021 1 commit
    • Steve Baker's avatar
      Allow unsupported redfish set_boot_mode · 13fc01fe
      Steve Baker authored
      Currently if the baremetal boot mode is unknown and the driver doesn't
      support setting the boot mode then the error is logged and deployment
      continues.
      
      However if the BMC doesn't support getting or setting the boot mode
      then setting the boot mode raises an error which results in the deploy
      failing. This is the case for HPE Gen9 baremetal, which doesn't have a
      'BootSourceOverrideMode' attribute in its system Boot field, and
      raises a 400 iLO.2.14.UnsupportedOperation in response to setting the
      boot mode.
      
      This is raised from set_boot_mode as a RedfishError. This change
      raises UnsupportedDriverExtension exception when the 'mode' attribute
      is missing from the 'boot' field, allowing the deployment to continue.
      
      Change-Id: I360ff8180be252de21f5fcd2208947087e332a39
      (cherry picked from commit 9f221a7d)
      13fc01fe
  27. 15 Mar, 2021 1 commit
  28. 08 Mar, 2021 2 commits
    • Arne Wiebalck's avatar
      Lazy-load node details from the DB · 4ed8ceef
      Arne Wiebalck authored
      In order to reduce the load on the database backend, only lazy-load
      a node's ports, portgroups, volume_connectors, and volume_targets.
      With the power-sync as the main user, this change should reduce the
      number of DB operations by two thirds roughly.
      
      Change-Id: Id9a9a53156f7fd866d93569347a81e27c6f0673c
      (cherry picked from commit 82cab603)
      4ed8ceef
    • Arne Wiebalck's avatar
      [Trivial] Fix testing of volume connector exception · b2b862f5
      Arne Wiebalck authored
      Restore test symmetry.
      
      Change-Id: I54a9fed73e366a30545c3cd1982588d2f544d228
      (cherry picked from commit 61c5b3fd)
      b2b862f5
  29. 05 Mar, 2021 1 commit
  30. 03 Mar, 2021 1 commit
    • Jason Anderson's avatar
      Always retry locking when performing task handoff · 25a05cf3
      Jason Anderson authored
      There are some Ironic execution workflows where there is not an easy way
      to retry, such as when attempting to hand off the processing of an async
      task to a conductor. Task handoff can require releasing a lock on the
      node, so the next entity processing the task can acquire the lock
      itself. However, this is vulnerable to race conditions, as there is no
      uniform retry mechanism built in to such handoffs. Consider the
      continue_node_deploy/clean logic, which does this:
      
        method = 'continue_node_%s' % operation
        # Need to release the lock to let the conductor take it
        task.release_resources()
        getattr(rpc, method)(task.context, uuid, topic=topic
      
      If another process obtains a lock between the releasing of resources and
      the acquiring of the lock during the continue_node_* operation, and
      holds the lock longer than the max attempt * interval window (which
      defaults to 3 seconds), then the handoff will never complete. Beyond
      that, because there is no proper queue for processes waiting on the
      lock, there is no fairness, so it's also possible that instead of one
      long lock being held, the lock is obtained and held for a short window
      several times by other competing processes.
      
      This manifests as nodes occasionally getting stuck in the "DEPLOYING"
      state during a deploy. For example, a user may attempt to open or access
      the serial console before the deploy is complete--the serial console
      process obtains a lock and starves the conductor of the lock, so the
      conductor cannot finish the deploy. It's also possible a long heartbeat
      or badly-timed sequence of heartbeats could do the same.
      
      To fix this, this commit introduces the concept of a "patient" lock,
      which will retry indefinitely until it doesn't encounter the NodeLocked
      exception. This overrides any retry behavior.
      
        .. note::
           There may be other cases where such a lock is desired.
      
      Story: #2008323
      Change-Id: I9937fab18a50111ec56a3fd023cdb9d510a1e990
      (cherry picked from commit bfc2ad56)
      25a05cf3
  31. 02 Mar, 2021 1 commit
    • Julia Kreger's avatar
      Handle agent still doing the prior command · d1ffc6a5
      Julia Kreger authored
      The agent command exec model is based upon an incoming
      heartbeat, however heartbeats are independent and
      commands can take a long time. For example, software RAID
      setup in CI can encounter this.
      
      From an IPA log:
      
      [-] Picked root device /dev/md0 for node c6ca0af2-baec-40d6-879d-cbb5c751aafb
          based on root device hints {'name': '/dev/md0'}
      [-] Attempting to download image from http://199.204.45.248:3928/agent_images/
          c6ca0af2-baec-40d6-879d-cbb5c751aafb
      [-] Executing command: standby.get_partition_uuids with args: {} execute_command
          /usr/local/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py:255
      [-] Tried to execute standby.get_partition_uuids, agent is still executing Command name:
          execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image',
          'args': {'image_info': {'id': 'cb9e199a-af1b-4a6f-b00e-f284008b8046',
          'urls': ['http://199.204.45.248:3928/agent_images/c6ca0af2-baec-40d6-879d-cbb5c751aafb'],
          'disk_format': 'raw', 'container_format': 'bare', 'stream_raw_images': True, 'os_hash_algo':
          'sha512', 'os_hash_value':<trimed>
      
      This was with code built on master, using master images.
      Inside the conductor log, it notes that it is likely an out
      of date agent because only AgentAPIError is evaluated,
      however any API error is evaluated this way. In reality, we need
      to explicitly flag *when* we have an error that is because
      we've tried to soon as something is already being worked upon.
      
      The result, is to evaluate and return an exception indicating work
      is already in flight.
      
      Update - It looks like, the original fix to prevent busy agent
      recognition did not fully detect all cases as getting steps is a
      command which can
      get skipped by accident with a busy agent, under certain circumstances.
      Change I5d86878b5ed6142ed2630adee78c0867c49b663f in ironic-python-agent
      also changed the string that was being checked for the previous
      handling, where we really should have just made the string we were
      checking lower case in ironic. Oh well! This should fix things
      right up.
      
      Story: 2008167
      Task: 41175
      Change-Id: Ia169640b7084d17d26f22e457c7af512db6d21d6
      (cherry picked from commit 545dc210)
      d1ffc6a5
  32. 28 Feb, 2021 1 commit
  33. 24 Feb, 2021 1 commit
  34. 23 Feb, 2021 1 commit
    • Dmitry Tantsur's avatar
      Fix broken configdrive_use_object_store · dea33cba
      Dmitry Tantsur authored
      When it is set to True, we try to write text data to a binary file,
      which is not possible in Python 3. The issue has been "helpfully"
      hidden by the fact that we use bytes in unit tests, as well as
      by lack of CI coverage.
      
      Change-Id: Ibbf90dcbcb36a5f7cf084a44a221c0c5c003b95a
      (cherry picked from commit 73bdebd1)
      dea33cba
  35. 18 Feb, 2021 2 commits