Skip to content

Improve RPM header comparison by handling HEADERIMMUTABLE/HEADERSIGNATURES fields

Problem Description

When comparing RPM packages using diffoscope, the current python-rpm based implementation generates misleading output for tags HEADERIMMUTABLE (RPMTAG_HEADERIMMUTABLE, tag 63) and HEADERSIGNATURES (RPMTAG_HEADERSIGNATURES, tag 62):

  1. Excessive Data Expansion
    The underlying python-rpm library expands these tags into full immutable region data (including header index count, total byte length, index entries, and raw binary content) instead of showing the actual 16-byte field value. This results in binary outputs often spanning hundreds of kilobytes per field.

  2. Causes False Rebuild Differences

    • The expanded data contains volatile build-metadata fragments (timestamps, build IDs, file paths)
    • This creates false positives during RPM rebuild comparisons as minor metadata changes appear as large binary differences
    • Obscures genuine package differences due to excessive output size
  3. Inconsistent with Standard RPM Representation
    Tools like rpm -q --qf and rpmfile correctly represent these fields as compact 16-byte values, while python-rpm's expansion behavior serves specialized verification purposes not needed for diffoscope's use case.

Proposed Solutions

Option 1: Switch to rpmfile-based Parsing

Replace python-rpm with python-rpmfile for header extraction:

import rpmfile  

def get_rpm_header(path):  
    with rpmfile.open(path) as rpm:  
        return rpm.headers  

Benefits:

  • Directly provides canonical 16-byte values for all header fields
  • Eliminates special-case handling needs
  • Maintains full header fidelity without expansion artifacts

Option 2: Add Summary Mode for Problematic Tags

Modify get_rpm_header() to summarize instead of expanding tags 62/63:

def get_rpm_header(path, ts):  
    ...  
    for rpmtag in sorted(rpm.tagnames):  
        if rpmtag in (rpm.RPMTAG_HEADERSIGNATURES, rpm.RPMTAG_HEADERIMMUTABLE):  
            region = hdr[rpmtag]  
            idx_count = int.from_bytes(region[:4], 'big')  # First 4 bytes = index count  
            data_len = int.from_bytes(region[4:8], 'big')  # Next 4 bytes = data length  
            s.write(f"{rpm.tagnames[rpmtag]}: [region: {idx_count} indexes, {data_len} bytes]\n")  
            continue  
        ... # Default processing  

Benefits:

  • Reduces output size from hundreds of KB to <100 bytes per field
  • Preserves structural metadata about the immutable region
  • Maintains compatibility with existing python-rpm dependency

Why This Matters

  • Reduces Noise: Eliminates false differences caused by volatile build metadata
  • Improves Performance: Avoids processing multi-hundred KB binary blobs during comparisons
  • Enhances Accuracy: Focuses comparison on semantically meaningful header data

Offer to Contribute

I've verified this issue using real RPM packages and analyzed both python-rpm and RPM C library behaviors. I'm prepared to implement either solution via PR and welcome maintainer guidance on the preferred approach:

  1. Complete migration to rpmfile (cleaner long-term solution)
  2. Summary mode for specific tags with python-rpm (minimal-impact fix)

Happy to provide sample outputs, test cases, or refine implementation based on feedback!

Edited by Daniel Duan
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information