Improve RPM header comparison by handling HEADERIMMUTABLE/HEADERSIGNATURES fields
Problem Description
When comparing RPM packages using diffoscope, the current python-rpm based implementation generates misleading output for tags HEADERIMMUTABLE
(RPMTAG_HEADERIMMUTABLE, tag 63) and HEADERSIGNATURES
(RPMTAG_HEADERSIGNATURES, tag 62):
-
Excessive Data Expansion
The underlying python-rpm library expands these tags into full immutable region data (including header index count, total byte length, index entries, and raw binary content) instead of showing the actual 16-byte field value. This results in binary outputs often spanning hundreds of kilobytes per field. -
Causes False Rebuild Differences
- The expanded data contains volatile build-metadata fragments (timestamps, build IDs, file paths)
- This creates false positives during RPM rebuild comparisons as minor metadata changes appear as large binary differences
- Obscures genuine package differences due to excessive output size
-
Inconsistent with Standard RPM Representation
Tools likerpm -q --qf
andrpmfile
correctly represent these fields as compact 16-byte values, while python-rpm's expansion behavior serves specialized verification purposes not needed for diffoscope's use case.
Proposed Solutions
Option 1: Switch to rpmfile-based Parsing
Replace python-rpm with python-rpmfile for header extraction:
import rpmfile
def get_rpm_header(path):
with rpmfile.open(path) as rpm:
return rpm.headers
Benefits:
- Directly provides canonical 16-byte values for all header fields
- Eliminates special-case handling needs
- Maintains full header fidelity without expansion artifacts
Option 2: Add Summary Mode for Problematic Tags
Modify get_rpm_header()
to summarize instead of expanding tags 62/63:
def get_rpm_header(path, ts):
...
for rpmtag in sorted(rpm.tagnames):
if rpmtag in (rpm.RPMTAG_HEADERSIGNATURES, rpm.RPMTAG_HEADERIMMUTABLE):
region = hdr[rpmtag]
idx_count = int.from_bytes(region[:4], 'big') # First 4 bytes = index count
data_len = int.from_bytes(region[4:8], 'big') # Next 4 bytes = data length
s.write(f"{rpm.tagnames[rpmtag]}: [region: {idx_count} indexes, {data_len} bytes]\n")
continue
... # Default processing
Benefits:
- Reduces output size from hundreds of KB to <100 bytes per field
- Preserves structural metadata about the immutable region
- Maintains compatibility with existing python-rpm dependency
Why This Matters
- Reduces Noise: Eliminates false differences caused by volatile build metadata
- Improves Performance: Avoids processing multi-hundred KB binary blobs during comparisons
- Enhances Accuracy: Focuses comparison on semantically meaningful header data
Offer to Contribute
I've verified this issue using real RPM packages and analyzed both python-rpm and RPM C library behaviors. I'm prepared to implement either solution via PR and welcome maintainer guidance on the preferred approach:
- Complete migration to rpmfile (cleaner long-term solution)
- Summary mode for specific tags with python-rpm (minimal-impact fix)
Happy to provide sample outputs, test cases, or refine implementation based on feedback!