Improve RPM header comparison by handling HEADERIMMUTABLE/HEADERSIGNATURES fields

Problem Description

When comparing RPM packages using diffoscope, the current python-rpm based implementation generates misleading output for tags HEADERIMMUTABLE (RPMTAG_HEADERIMMUTABLE, tag 63) and HEADERSIGNATURES (RPMTAG_HEADERSIGNATURES, tag 62):

  1. Excessive Data Expansion
    The underlying python-rpm library expands these tags into full immutable region data (including header index count, total byte length, index entries, and raw binary content) instead of showing the actual 16-byte field value. This results in binary outputs often spanning hundreds of kilobytes per field.

  2. Causes False Rebuild Differences

    • The expanded data contains volatile build-metadata fragments (timestamps, build IDs, file paths)
    • This creates false positives during RPM rebuild comparisons as minor metadata changes appear as large binary differences
    • Obscures genuine package differences due to excessive output size
  3. Inconsistent with Standard RPM Representation
    Tools like rpm -q --qf and rpmfile correctly represent these fields as compact 16-byte values, while python-rpm's expansion behavior serves specialized verification purposes not needed for diffoscope's use case.

Proposed Solutions

Option 1: Switch to rpmfile-based Parsing

Replace python-rpm with python-rpmfile for header extraction:

import rpmfile  

def get_rpm_header(path):  
    with rpmfile.open(path) as rpm:  
        return rpm.headers  

Benefits:

  • Directly provides canonical 16-byte values for all header fields
  • Eliminates special-case handling needs
  • Maintains full header fidelity without expansion artifacts

Option 2: Add Summary Mode for Problematic Tags

Modify get_rpm_header() to summarize instead of expanding tags 62/63:

def get_rpm_header(path, ts):  
    ...  
    for rpmtag in sorted(rpm.tagnames):  
        if rpmtag in (rpm.RPMTAG_HEADERSIGNATURES, rpm.RPMTAG_HEADERIMMUTABLE):  
            region = hdr[rpmtag]  
            idx_count = int.from_bytes(region[:4], 'big')  # First 4 bytes = index count  
            data_len = int.from_bytes(region[4:8], 'big')  # Next 4 bytes = data length  
            s.write(f"{rpm.tagnames[rpmtag]}: [region: {idx_count} indexes, {data_len} bytes]\n")  
            continue  
        ... # Default processing  

Benefits:

  • Reduces output size from hundreds of KB to <100 bytes per field
  • Preserves structural metadata about the immutable region
  • Maintains compatibility with existing python-rpm dependency

Why This Matters

  • Reduces Noise: Eliminates false differences caused by volatile build metadata
  • Improves Performance: Avoids processing multi-hundred KB binary blobs during comparisons
  • Enhances Accuracy: Focuses comparison on semantically meaningful header data

Offer to Contribute

I've verified this issue using real RPM packages and analyzed both python-rpm and RPM C library behaviors. I'm prepared to implement either solution via PR and welcome maintainer guidance on the preferred approach:

  1. Complete migration to rpmfile (cleaner long-term solution)
  2. Summary mode for specific tags with python-rpm (minimal-impact fix)

Happy to provide sample outputs, test cases, or refine implementation based on feedback!

Edited by Daniel Duan