1. 12 Jul, 2022 13 commits
    • Andrew Cooper's avatar
      x86/spec-ctrl: Mitigate Branch Type Confusion when possible · 87d90d51
      Andrew Cooper authored
      
      
      Branch Type Confusion affects AMD/Hygon CPUs on Zen2 and earlier.  To
      mitigate, we require SMT safety (STIBP on Zen2, no-SMT on Zen1), and to issue
      an IBPB on each entry to Xen, to flush the BTB.
      
      Due to performance concerns, dom0 (which is trusted in most configurations) is
      excluded from protections by default.
      
      Therefore:
       * Use STIBP by default on Zen2 too, which now means we want it on by default
         on all hardware supporting STIBP.
       * Break the current IBPB logic out into a new function, extending it with
         IBPB-at-entry logic.
       * Change the existing IBPB-at-ctxt-switch boolean to be tristate, and disable
         it by default when IBPB-at-entry is providing sufficient safety.
      
      If all PV guests on the system are trusted, then it is recommended to boot
      with `spec-ctrl=ibpb-entry=no-pv`, as this will provide an additional marginal
      perf improvement.
      
      This is part of XSA-407 / CVE-2022-23825.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit d8cb7e0f069e0f106d24941355b59b45a731eabe)
      87d90d51
    • Andrew Cooper's avatar
      x86/spec-ctrl: Enable Zen2 chickenbit · 5bccfbb6
      Andrew Cooper authored
      
      
      ... as instructed in the Branch Type Confusion whitepaper.
      
      This is part of XSA-407.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      (cherry picked from commit 9deaf2d932f08c16c6b96a1c426e4b1142c0cdbe)
      5bccfbb6
    • Andrew Cooper's avatar
      x86/cpuid: Enumeration for BTC_NO · 318d7bc3
      Andrew Cooper authored
      
      
      BTC_NO indicates that hardware is not succeptable to Branch Type Confusion.
      
      Zen3 CPUs don't suffer BTC.
      
      This is part of XSA-407.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit 76cb04ad64f3ab9ae785988c40655a71dde9c319)
      318d7bc3
    • Andrew Cooper's avatar
      x86/spec-ctrl: Support IBPB-on-entry · 0a6561b2
      Andrew Cooper authored
      
      
      We are going to need this to mitigate Branch Type Confusion on AMD/Hygon CPUs,
      but as we've talked about using it in other cases too, arrange to support it
      generally.  However, this is also very expensive in some cases, so we're going
      to want per-domain controls.
      
      Introduce SCF_ist_ibpb and SCF_entry_ibpb controls, adding them to the IST and
      DOM masks as appropriate.  Also introduce X86_FEATURE_IBPB_ENTRY_{PV,HVM} to
      to patch the code blocks.
      
      For SVM, the STGI is serialising enough to protect against Spectre-v1 attacks,
      so no "else lfence" is necessary.  VT-x will use use the MSR host load list,
      so doesn't need any code in the VMExit path.
      
      For the IST path, we can't safely check CPL==0 to skip a flush, as we might
      have hit an entry path before it's IBPB.  As IST hitting Xen is rare, flush
      irrespective of CPL.  A later path, SCF_ist_sc_msr, provides Spectre-v1
      safety.
      
      For the PV paths, we know we're interrupting CPL>0, while for the INTR paths,
      we can safely check CPL==0.  Only flush when interrupting guest context.
      
      An "else lfence" is needed for safety, but we want to be able to skip it on
      unaffected CPUs, so the block wants to be an alternative, which means the
      lfence has to be inline rather than UNLIKELY() (the replacement block doesn't
      have displacements fixed up for anything other than the first instruction).
      
      As with SPEC_CTRL_ENTRY_FROM_INTR_IST, %rdx is 0 on entry so rely on this to
      shrink the logic marginally.  Update the comments to specify this new
      dependency.
      
      This is part of XSA-407.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit 53a570b285694947776d5190f591a0d5b9b18de7)
      0a6561b2
    • Andrew Cooper's avatar
      x86/spec-ctrl: Rework SPEC_CTRL_ENTRY_FROM_INTR_IST · d2f0cf78
      Andrew Cooper authored
      
      
      We are shortly going to add a conditional IBPB in this path.
      
      Therefore, we cannot hold spec_ctrl_flags in %eax, and rely on only clobbering
      it after we're done with its contents.  %rbx is available for use, and the
      more normal register to hold preserved information in.
      
      With %rax freed up, use it instead of %rdx for the RSB tmp register, and for
      the adjustment to spec_ctrl_flags.
      
      This leaves no use of %rdx, except as 0 for the upper half of WRMSR.  In
      practice, %rdx is 0 from SAVE_ALL on all paths and isn't likely to change in
      the foreseeable future, so update the macro entry requirements to state this
      dependency.  This marginal optimisation can be revisited if circumstances
      change.
      
      No practical change.
      
      This is part of XSA-407.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit e9b8d31981f184c6539f91ec54bd9cae29cdae36)
      d2f0cf78
    • Andrew Cooper's avatar
      x86/spec-ctrl: Rename opt_ibpb to opt_ibpb_ctxt_switch · 51e812af
      Andrew Cooper authored
      
      
      We are about to introduce the use of IBPB at different points in Xen, making
      opt_ibpb ambiguous.  Rename it to opt_ibpb_ctxt_switch.
      
      No functional change.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit a8e5ef079d6f5c88c472e3e620db5a8d1402a50d)
      51e812af
    • Andrew Cooper's avatar
      x86/spec-ctrl: Rename SCF_ist_wrmsr to SCF_ist_sc_msr · 73465a7f
      Andrew Cooper authored
      
      
      We are about to introduce SCF_ist_ibpb, at which point SCF_ist_wrmsr becomes
      ambiguous.
      
      No functional change.
      
      This is part of XSA-407.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit 76d6a36f645dfdbad8830559d4d52caf36efc75e)
      73465a7f
    • Andrew Cooper's avatar
      x86/spec-ctrl: Rework spec_ctrl_flags context switching · b60c995d
      Andrew Cooper authored
      
      
      We are shortly going to need to context switch new bits in both the vcpu and
      S3 paths.  Introduce SCF_IST_MASK and SCF_DOM_MASK, and rework d->arch.verw
      into d->arch.spec_ctrl_flags to accommodate.
      
      No functional change.
      
      This is part of XSA-407.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit 5796912f7279d9348a3166655588d30eae9f72cc)
      b60c995d
    • Andrew Cooper's avatar
      x86/spec-ctrl: Add fine-grained cmdline suboptions for primitives · e5fd5081
      Andrew Cooper authored
      
      
      Support controling the PV/HVM suboption of msr-sc/rsb/md-clear, which
      previously wasn't possible.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit 27357c394ba6e1571a89105b840ce1c6f026485c)
      e5fd5081
    • Andrew Cooper's avatar
      xen/cmdline: Extend parse_boolean() to signal a name match · 2d316660
      Andrew Cooper authored
      
      
      This will help parsing a sub-option which has boolean and non-boolean options
      available.
      
      First, rework 'int val' into 'bool has_neg_prefix'.  This inverts it's value,
      but the resulting logic is far easier to follow.
      
      Second, reject anything of the form 'no-$FOO=' which excludes ambiguous
      constructs such as 'no-$foo=yes' which have never been valid.
      
      This just leaves the case where everything is otherwise fine, but parse_bool()
      can't interpret the provided string.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit 382326cac528dd1eb0d04efd5c05363c453e29f4)
      2d316660
    • Andrew Cooper's avatar
      x86/spec-ctrl: Knobs for STIBP and PSFD, and follow hardware STIBP hint · f1786895
      Andrew Cooper authored
      
      
      STIBP and PSFD are slightly weird bits, because they're both implied by other
      bits in MSR_SPEC_CTRL.  Add fine grain controls for them, and take the
      implications into account when setting IBRS/SSBD.
      
      Rearrange the IBPB text/variables/logic to keep all the MSR_SPEC_CTRL bits
      together, for consistency.
      
      However, AMD have a hardware hint CPUID bit recommending that STIBP be set
      unilaterally.  This is advertised on Zen3, so follow the recommendation.
      Furthermore, in such cases, set STIBP behind the guest's back for now.  This
      has negligible overhead for the guest, but saves a WRMSR on vmentry.  This is
      the only default change.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      (cherry picked from commit fef244b179c06fcdfa581f7d57fa6e578c49ff50)
      f1786895
    • Andrew Cooper's avatar
      x86/spec-ctrl: Only adjust MSR_SPEC_CTRL for idle with legacy IBRS · a556377d
      Andrew Cooper authored
      
      
      Back at the time of the original Spectre-v2 fixes, it was recommended to clear
      MSR_SPEC_CTRL when going idle.  This is because of the side effects on the
      sibling thread caused by the microcode IBRS and STIBP implementations which
      were retrofitted to existing CPUs.
      
      However, there are no relevant cross-thread impacts for the hardware
      IBRS/STIBP implementations, so this logic should not be used on Intel CPUs
      supporting eIBRS, or any AMD CPUs; doing so only adds unnecessary latency to
      the idle path.
      
      Furthermore, there's no point playing with MSR_SPEC_CTRL in the idle paths if
      SMT is disabled for other reasons.
      
      Fixes: 8d03080d2a33 ("x86/spec-ctrl: Cease using thunk=lfence on AMD")
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      (cherry picked from commit ffc7694e0c99eea158c32aa164b7d1e1bb1dc46b)
      a556377d
    • Andrew Cooper's avatar
      x86/spec-ctrl: Honour spec-ctrl=0 for unpriv-mmio sub-option · 104dd461
      Andrew Cooper authored
      
      
      This was an oversight from when unpriv-mmio was introduced.
      
      Fixes: 8c24b70fedcb ("x86/spec-ctrl: Add spec-ctrl=unpriv-mmio")
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      (cherry picked from commit 4cdb519d797c19ebb8fadc5938cdb47479d5a21b)
      104dd461
  2. 16 Jun, 2022 3 commits
    • Andrew Cooper's avatar
      x86/spec-ctrl: Add spec-ctrl=unpriv-mmio · c5f774ea
      Andrew Cooper authored
      
      
      Per Xen's support statement, PCI passthrough should be to trusted domains
      because the overall system security depends on factors outside of Xen's
      control.
      
      As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.
      
      However, users who have risk assessed their configuration may be happy with
      the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
      users should enable this option.
      
      On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
      mitigate MMIO cross-domain data leakage.
      
      On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:
      
       * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
         using FB_CLEAR.
       * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
         srb-lock, previously used to mitigate SRBDS.
      
      Both mitigations require microcode from IPU 2022.1, May 2022.
      
      This is part of XSA-404.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      (cherry picked from commit 8c24b70fedcb52633b2370f834d8a2be3f7fa38e)
      c5f774ea
    • Andrew Cooper's avatar
      x86/spec-ctrl: Enumeration for MMIO Stale Data controls · 9f078482
      Andrew Cooper authored
      
      
      The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
      data movement primitives.
      
      FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
      flushing side effect.  This is only enumerated on parts where VERW had
      previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
      being fixed in hardware.
      
      FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
      clearing side effect of VERW can be turned off for performance reasons.
      
      This is part of XSA-404.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      (cherry picked from commit 2ebe8fe9b7e0d36e9ec3cfe4552b2b197ef0dcec)
      9f078482
    • Andrew Cooper's avatar
      x86/spec-ctrl: Make VERW flushing runtime conditional · 878e684e
      Andrew Cooper authored
      
      
      Currently, VERW flushing to mitigate MDS is boot time conditional per domain
      type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
      conditionally use VERW based on the trustworthiness of the guest, and the
      devices passed through.
      
      Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
      path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.
      
      Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
      disposition at domain creation time, and context switch the SCF_verw bit.
      
      For now, VERW flushing is used and controlled exactly as before, but later
      patches will add per-domain cases too.
      
      No change in behaviour.
      
      This is part of XSA-404.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      (cherry picked from commit e06b95c1d44ab80da255219fc9f1e2fc423edcb6)
      878e684e
  3. 10 Jun, 2022 1 commit
  4. 09 Jun, 2022 7 commits
    • Andrew Cooper's avatar
      x86/pv: Track and flush non-coherent mappings of RAM · 82ba97ec
      Andrew Cooper authored
      
      
      There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with
      devices that make non-coherent writes.  The Linux sound subsystem makes
      extensive use of this technique.
      
      For such usecases, the guest's DMA buffer is mapped and consistently used as
      WC, and Xen doesn't interact with the buffer.
      
      However, a mischevious guest can use WC mappings to deliberately create
      non-coherency between the cache and RAM, and use this to trick Xen into
      validating a pagetable which isn't actually safe.
      
      Allocate a new PGT_non_coherent to track the non-coherency of mappings.  Set
      it whenever a non-coherent writeable mapping is created.  If the page is used
      as anything other than PGT_writable_page, force a cache flush before
      validation.  Also force a cache flush before the page is returned to the heap.
      
      This is CVE-2022-26364, part of XSA-402.
      
      Reported-by: Jann Horn's avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarGeorge Dunlap <george.dunlap@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      master commit: c1c9cae3a9633054b177c5de21ad7268162b2f2c
      master date: 2022-06-09 14:23:37 +0200
      82ba97ec
    • Andrew Cooper's avatar
      x86/amd: Work around CLFLUSH ordering on older parts · 25c7adee
      Andrew Cooper authored
      
      
      On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything,
      including reads and writes to the address, and LFENCE/SFENCE instructions.
      
      This creates a multitude of problematic corner cases, laid out in the manual.
      Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering.
      
      This is part of XSA-402.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      master commit: 062868a5a8b428b85db589fa9a6d6e43969ffeb9
      master date: 2022-06-09 14:23:07 +0200
      25c7adee
    • Andrew Cooper's avatar
      x86: Split cache_flush() out of cache_writeback() · 204d4f16
      Andrew Cooper authored
      
      
      Subsequent changes will want a fully flushing version.
      
      Use the new helper rather than opencoding it in flush_area_local().  This
      resolves an outstanding issue where the conditional sfence is on the wrong
      side of the clflushopt loop.  clflushopt is ordered with respect to older
      stores, not to younger stores.
      
      Rename gnttab_cache_flush()'s helper to avoid colliding in name.
      grant_table.c can see the prototype from cache.h so the build fails
      otherwise.
      
      This is part of XSA-402.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      master commit: 9a67ffee3371506e1cbfdfff5b90658d4828f6a2
      master date: 2022-06-09 14:22:38 +0200
      204d4f16
    • Andrew Cooper's avatar
      x86: Don't change the cacheability of the directmap · 07fbed87
      Andrew Cooper authored
      Changeset 55f97f49 ("x86: Change cache attributes of Xen 1:1 page mappings
      in response to guest mapping requests") attempted to keep the cacheability
      consistent between different mappings of the same page.
      
      The reason wasn't described in the changelog, but it is understood to be in
      regards to a concern over machine check exceptions, owing to errata when using
      mixed cacheabilities.  It did this primarily by updating Xen's mapping of the
      page in the direct map when the guest mapped a page with reduced cacheability.
      
      Unfortunately, the logic didn't actually prevent mixed cacheability from
      occurring:
       * A guest could map a page normally, and then map the same page with
         different cacheability; nothing prevented this.
       * The cacheability of the directmap was always latest-takes-precedence in
         terms of guest requests.
       * Grant-mapped frames with lesser cacheability didn't adjust the page's
         cacheattr settings.
       * The map_domain_page() function still unconditionally created WB mappings,
         irrespective of the page's cacheattr settings.
      
      Additionally, update_xen_mappings() had a bug where the alias calculation was
      wrong for mfn's which were .init content, which should have been treated as
      fully guest pages, not Xen pages.
      
      Worse yet, the logic introduced a vulnerability whereby necessary
      pagetable/segdesc adjustments made by Xen in the validation logic could become
      non-coherent between the cache and main memory.  The CPU could subsequently
      operate on the stale value in the cache, rather than the safe value in main
      memory.
      
      The directmap contains primarily mappings of RAM.  PAT/MTRR conflict
      resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser
      cacheability resolves to being coherent.  The special case is WC mappings,
      which are non-coherent against MTRR=WB regions (except for fully-coherent
      CPUs).
      
      Xen must not have any WC cacheability in the directmap, to prevent Xen's
      actions from creating non-coherency.  (Guest actions creating non-coherency is
      dealt with in subsequent patches.)  As all memory types for MTRR=WB ranges
      inter-operate coherently, so leave Xen's directmap mappings as WB.
      
      Only PV guests with access to devices can use reduced-cacheability mappings to
      begin with, and they're trusted not to mount DoSs against the system anyway.
      
      Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them.
      Shift the later PGC_* constants up, to gain 3 extra bits in the main reference
      count.  Retain the check in get_page_from_l1e() for special_pages() because a
      guest has no business using reduced cacheability on these.
      
      This reverts changeset 55f97f49
      
      
      
      This is CVE-2022-26363, part of XSA-402.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarGeorge Dunlap <george.dunlap@citrix.com>
      master commit: ae09597da34aee6bc5b76475c5eea6994457e854
      master date: 2022-06-09 14:22:08 +0200
      07fbed87
    • Andrew Cooper's avatar
      x86/page: Introduce _PAGE_* constants for memory types · a72146db
      Andrew Cooper authored
      
      
      ... rather than opencoding the PAT/PCD/PWT attributes in __PAGE_HYPERVISOR_*
      constants.  These are going to be needed by forthcoming logic.
      
      No functional change.
      
      This is part of XSA-402.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      master commit: 1be8707c75bf4ba68447c74e1618b521dd432499
      master date: 2022-06-09 14:21:38 +0200
      a72146db
    • Andrew Cooper's avatar
      x86/pv: Fix ABAC cmpxchg() race in _get_page_type() · 758f40d7
      Andrew Cooper authored
      
      
      _get_page_type() suffers from a race condition where it incorrectly assumes
      that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
      cannot have changed in-between.  Consider:
      
      CPU A:
        1. Creates an L2e referencing pg
           `-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
        2.     Issues flush_tlb_mask()
      CPU B:
        3. Creates a writeable mapping of pg
           `-> _get_page_type(pg, PGT_writable_page), count increases to 1
        4. Writes into new mapping, creating a TLB entry for pg
        5. Removes the writeable mapping of pg
           `-> _put_page_type(pg), count goes back down to 0
      CPU A:
        7.     Issues cmpxchg(), setting count 1, type PGT_l1_page_table
      
      CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
      suitably protected (i.e. read-only).  The TLB flush in step 2 must be deferred
      until after the guest is prohibited from creating new writeable mappings,
      which is after step 7.
      
      Defer all safety actions until after the cmpxchg() has successfully taken the
      intended typeref, because that is what prevents concurrent users from using
      the old type.
      
      Also remove the early validation for writeable and shared pages.  This removes
      race conditions where one half of a parallel mapping attempt can return
      successfully before:
       * The IOMMU pagetables are in sync with the new page type
       * Writeable mappings to shared pages have been torn down
      
      This is part of XSA-401 / CVE-2022-26362.
      
      Reported-by: Jann Horn's avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarGeorge Dunlap <george.dunlap@citrix.com>
      master commit: 8cc5036bc385112a82f1faff27a0970e6440dfed
      master date: 2022-06-09 14:21:04 +0200
      758f40d7
    • Andrew Cooper's avatar
      x86/pv: Clean up _get_page_type() · c70071eb
      Andrew Cooper authored
      
      
      Various fixes for clarity, ahead of making complicated changes.
      
       * Split the overflow check out of the if/else chain for type handling, as
         it's somewhat unrelated.
       * Comment the main if/else chain to explain what is going on.  Adjust one
         ASSERT() and state the bit layout for validate-locked and partial states.
       * Correct the comment about TLB flushing, as it's backwards.  The problem
         case is when writeable mappings are retained to a page becoming read-only,
         as it allows the guest to bypass Xen's safety checks for updates.
       * Reduce the scope of 'y'.  It is an artefact of the cmpxchg loop and not
         valid for use by subsequent logic.  Switch to using ACCESS_ONCE() to treat
         all reads as explicitly volatile.  The only thing preventing the validated
         wait-loop being infinite is the compiler barrier hidden in cpu_relax().
       * Replace one page_get_owner(page) with the already-calculated 'd' already in
         scope.
      
      No functional change.
      
      This is part of XSA-401 / CVE-2022-26362.
      
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: default avatarGeorge Dunlap <george.dunlap@eu.citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarGeorge Dunlap <george.dunlap@citrix.com>
      master commit: 9186e96b199e4f7e52e033b238f9fe869afb69c7
      master date: 2022-06-09 14:20:36 +0200
      c70071eb
  5. 12 Apr, 2022 1 commit
  6. 08 Apr, 2022 7 commits
    • Roger Pau Monné's avatar
      livepatch: avoid relocations referencing ignored section symbols · eeaf24cc
      Roger Pau Monné authored
      
      
      Track whether symbols belong to ignored sections in order to avoid
      applying relocations referencing those symbols. The address of such
      symbols won't be resolved and thus the relocation will likely fail or
      write garbage to the destination.
      
      Return an error in that case, as leaving unresolved relocations would
      lead to malfunctioning payload code.
      
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: default avatarBjoern Doebel <doebel@amazon.de>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      master commit: 9120b5737f517fe9d2a3936c38d3a2211630323b
      master date: 2022-04-08 10:27:11 +0200
      eeaf24cc
    • Roger Pau Monné's avatar
      livepatch: do not ignore sections with 0 size · 97258d88
      Roger Pau Monné authored
      A side effect of ignoring such sections is that symbols belonging to
      them won't be resolved, and that could make relocations belonging to
      other sections that reference those symbols fail.
      
      For example it's likely to have an empty .altinstr_replacement with
      symbols pointing to it, and marking the section as ignored will
      prevent the symbols from being resolved, which in turn will cause any
      relocations against them to fail.
      
      In order to solve this do not ignore sections with 0 size, only ignore
      sections that don't have the SHF_ALLOC flag set.
      
      Special case such empty sections in move_payload so they are not taken
      into account in order to decide whether a livepatch can be safely
      re-applied after a revert.
      
      Fixes: 98b728a7
      
       ('livepatch: Disallow applying after an revert')
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: default avatarBjoern Doebel <doebel@amazon.de>
      Reviewed-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      master commit: 0dc1f929e8fed681dec09ca3ea8de38202d5bf30
      master date: 2022-04-08 10:24:10 +0200
      97258d88
    • Jan Beulich's avatar
      vPCI: replace %pp · 019e56a0
      Jan Beulich authored
      4.14 doesn't know of this format specifier extension yet.
      
      Fixes: 47188b2f
      
       ("vpci/msix: fix PBA accesses")
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      019e56a0
    • Andrew Cooper's avatar
      x86/cpuid: Clobber CPUID leaves 0x800000{1d..20} in policies · 9c4d3fbf
      Andrew Cooper authored
      c/s 1a914256 increased the AMD max leaf from 0x8000001c to 0x80000021, but
      did not adjust anything in the calculate_*_policy() chain.  As a result, on
      hardware supporting these leaves, we read the real hardware values into the
      raw policy, then copy into host, and all the way into the PV/HVM default
      policies.
      
      All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
      next by PQOS), so any software following the rules is fine and will leave them
      alone.  However, leaf 0x8000001d takes a subleaf input and at least two
      userspace utilities have been observed to loop indefinitely under Xen (clearly
      waiting for eax to report "no more cache levels").
      
      Such userspace is buggy, but Xen's behaviour isn't great either.
      
      In the short term, clobber all information in these leaves.  This is a giant
      bodge, but there are complexities with implementing all of these leaves
      properly.
      
      Fixes: 1a914256 ("x86/cpuid: support LFENCE always serialising CPUID bit")
      Link: https://github.com/QubesOS/qubes-issues/issues/7392
      
      
      Reported-by: default avatarfosslinux <fosslinux@aussies.space>
      Reported-by: Marek Marczykowski-Górecki's avatarMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Signed-off-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      master commit: d4012d50082c2eae2f3cbe7770be13b9227fbc3f
      master date: 2022-04-07 11:36:45 +0100
      9c4d3fbf
    • Jan Beulich's avatar
      VT-d: avoid infinite recursion on domain_context_mapping_one() error path · 140a95dd
      Jan Beulich authored
      
      
      Despite the comment there infinite recursion was still possible, by
      flip-flopping between two domains. This is because prev_dom is derived
      from the DID found in the context entry, which was already updated by
      the time error recovery is invoked. Simply introduce yet another mode
      flag to prevent rolling back an in-progress roll-back of a prior
      mapping attempt.
      
      Also drop the existing recursion prevention for having been dead anyway:
      Earlier in the function we already bail when prev_dom == domain.
      
      Fixes: 8f41e481b485 ("VT-d: re-assign devices directly")
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      master commit: 99d829dba1390b98a3ca07b365713e62182ee7ca
      master date: 2022-04-07 12:31:16 +0200
      140a95dd
    • Jan Beulich's avatar
      VT-d: avoid NULL deref on domain_context_mapping_one() error paths · 78630ac4
      Jan Beulich authored
      
      
      First there's a printk() which actually wrongly uses pdev in the first
      place: We want to log the coordinates of the (perhaps fake) device
      acted upon, which may not be pdev.
      
      Then it was quite pointless for eb19326a328d ("VT-d: prepare for per-
      device quarantine page tables (part I)") to add a domid_t parameter to
      domain_context_unmap_one(): It's only used to pass back here via
      me_wifi_quirk() -> map_me_phantom_function(). Drop the parameter again.
      
      Finally there's the invocation of domain_context_mapping_one(), which
      needs to be passed the correct domain ID. Avoid taking that path when
      pdev is NULL and the quarantine state is what would need restoring to.
      This means we can't security-support non-PCI-Express devices with RMRRs
      (if such exist in practice) any longer; note that as of trhe 1st of the
      two commits referenced below assigning them to DomU-s is unsupported
      anyway.
      
      Fixes: 8f41e481b485 ("VT-d: re-assign devices directly")
      Fixes: 14dd241aad8a ("IOMMU/x86: use per-device page tables for quarantining")
      Coverity ID: 1503784
      Reported-by: default avatarAndrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      master commit: 608394b906e71587f02e6662597bc985bad33a5a
      master date: 2022-04-07 12:30:19 +0200
      78630ac4
    • Jan Beulich's avatar
      VT-d: don't needlessly look up DID · d3568578
      Jan Beulich authored
      
      
      If get_iommu_domid() in domain_context_unmap_one() fails, we better
      wouldn't clear the context entry in the first place, as we're then unable
      to issue the corresponding flush. However, we have no need to look up the
      DID in the first place: What needs flushing is very specifically the DID
      that was in the context entry before our clearing of it.
      
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Reviewed-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      master commit: 445ab9852d69d8957467f0036098ebec75fec092
      master date: 2022-04-07 12:29:03 +0200
      d3568578
  7. 07 Apr, 2022 8 commits