Skip to content
Commits on Source (8)
# VarMatch
robust matching of small variant datasets using flexible scoring schemes
# [New] VarMatch for multiple datasets
The GIAB branch of VarMatch can compare complex variants from multiple (>=3) datasets.
To download the GIAB branch, use command ```git clone -b giab https://github.com/medvedevgroup/varmatch.git```
# Authors
- Chen Sun (The Pennsylvania State University)
- Paul Medvedev (The Pennsylvania State University)
# Release Date
### TBA
Any questions about VarMatch, please email to chensun at cse dot psu dot edu.
If you identify a bug in VarMatch, please either reported on 'github Issues' of VarMatch, or email directly to chensun at cse dot psu dot edu.
# Prerequisite
- GCC 4.7 or later for c++11 support
- Python 2.7 or later
......@@ -23,7 +20,7 @@ If you identify a bug in VarMatch, please either reported on 'github Issues' of
> *matplotlib is not a prerequisite if either `-f`, `-G` or `-C` parameter is used
# Installation
**Quick Install Instruction:**
### Quick Install Instruction:
You can build VarMatch from source.
```
git clone https://github.com/medvedevgroup/varmatch.git
......@@ -31,13 +28,25 @@ cd varmatch
make all
```
### Uninstall
`cd` to the directory of VarMatch
```
make clean
```
# Test Data Set
- Links to a test data set (~15M) : [https://github.com/medvedevgroup/varmatch/blob/master/test_data.txt](https://github.com/medvedevgroup/varmatch/blob/master/test_data.txt)
- Links to the data used for bencharmking in the paper: [https://github.com/medvedevgroup/varmatch/blob/master/data.txt](https://github.com/medvedevgroup/varmatch/blob/master/data.txt)
# Usage
### Quick Usage:
*compare two vcf files to match variants*
```
./varmatch -b baseline.vcf -q query.vcf -g ref.fa -o out -f
./varmatch -b baseline.vcf -q query.vcf -g ref.fa -o output -f
```
- `-b` baseline vcf file
- `-q` query vcf file
......@@ -45,7 +54,11 @@ make all
- `-o` output file prefix, default value is `out`
- `-f` fast mode*, equivalent to use parameters `-u 0 -m 0 -s 0 -C`
>*fast mode is suggested for ordinary analysis
>*fast mode is suggested for ordinary analysis.
>VarMatch accept baseline and query in VCF file format (e.g. xx.vcf), it does not accept gz file (e.g. xx.vcf.gz) in current version.
>see [Results of VarMatch](#results-of-varmatch) section for intepretation of results in output directory.
### Detail Usage
......@@ -116,9 +129,11 @@ Where:
`-G`, `--no_graph`
disable graphic module
`-C`, `--disable_curves`
disable Precision-Recall curves, if use -G or --no_graph, then
automatically disable these curves
`-f`, `--fast_mode`
In this mode, automatically disable graphic module and precision-
recall curves, only performs one matching criterion.
......@@ -130,3 +145,80 @@ Where:
use `-h/--help` for detailed help message.
# Results of VarMatch
### varmatch (.match) file
You can find varmatch (.match) files in VarMatch output directory, filename is in the format of query`x`.`u`\_`m`\_`s`.match
- `x` is the id of queries.
- `u` is the value of parameter `-u`, `--score_unit`
- `m` is the value of parameter `-m`, `--match_mode`
- `s` is the value of parameter `-s`, `--score_scheme`
For instance, if you use one query VCF file and use `-f` parameter, there is query1.0_0_0.match in your output file.
varmatch file contains the information of matched VCF entries from baseline and query VCF file.
Lines in varmatch file started with `#` are comment lines. They contain general information of baseline and query VCF file, and also general information of varmatch file. The first 2 lines of .match file starts with `###`:
- `###VCF1` is the baseline VCF filename
- `###VCF2` is the query VCF filename
varmatch files contains at least 9 fields:
1. `CHROM` field represents chromosome ID
2. `POS` field represents genome position on reference genome sequence
3. `REF` field represents reference allele
4. `ALT` field represents alternative alleles, multiple alleles are separated by `/`
5. `VCF1` field represents variants from baseline. If it is a direct match, this column is `.`. If it is not a direct match, this column contains variants separated by `;`. Each variant contains three information: reference genome position, reference allele, alternative alleles.
6. `VCF2` field represents variants from query. If it is a direct match, this column is `.`.
7. `PHASE1` field represents phasing information of variants from baseline. If it is a direct match, this column is `.`.
8. `PHASE2` field represents phasing information of variants from query. If it is a direct match, this column is `.`.
9. `SCORE` field represents the total score of variants from baseline and query, calculated based on given score unit, and score scheme.
The meaning of each line in varmatch file:
> variants in `VCF1` column is equivalent to variants in `VCF2` column. If applying them separately on `REF` sequence, which is a substring of reference genome sequence starts at position `POS` of chromosome `CHROM`, can get the same donor sequences in `ALT` column.
The phasing information of variants in `VCF1` and `VCF2` are separately in `PHASE1` and `PHASE2`.
varmatch (.match) file format gives a standard representation of equivalent variants, especially for complex variants.
If you have any suggestions of improving .match file format, please [contact me](#contact).
### .stat file
It contains some statistical information.
# License
VarMatch is now under Apache2 license!
See [license.txt](https://github.com/medvedevgroup/varmatch/blob/master/license.txt)
# Contact
chensun@cse.psu.edu
You also can report bugs or suggest features using issue tracker at GitHub [https://github.com/medvedevgroup/varmatch](https://github.com/medvedevgroup/varmatch)
# Acknowledgements
If using VarMatch, please cite:
Sun, C., & Medvedev, P. (2016). VarMatch: robust matching of small variant datasets using flexible scoring schemes. Bioinformatics, 33(9), 1301-1308.
Corresponding BiBTex:
```
@article{sun2016varmatch,
title={VarMatch: robust matching of small variant datasets using flexible scoring schemes},
author={Sun, Chen and Medvedev, Paul},
journal={Bioinformatics},
volume={33},
number={9},
pages={1301--1308},
year={2016},
publisher={Oxford University Press}
}
```
This project has been supported in part by NSF awards DBI-1356529, CCF-1439057, IIS-1453527, and IIS-1421908.
This file contains raw data needed for benchmarking in the paper.
# Reference Genome Sequence GRCh37
> You can download GRCh37 reference genome sequence in either of the following ftp sites:
### ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/
### ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/BUILD.37.3/
> The reference genome sequence GRCh37 is stored in these ftp sites by chromosomes. If you want to perform whole genome analysis, a single integrated FASTA file is needed.
# CHM1 Data Set:
### Bowtie2-Freebayes
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140312_chm1_alignment_heng/vcfs/CHM1.bt2.fb.vcf.gz
### Bowtie2-HC
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140312_chm1_alignment_heng/vcfs/CHM1.bt2.hc.vcf.gz
# NA12878 Data Set:
### GIAB benchmark (Version 2.18)
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/analysis/NIST_union_callsets_06172013/NISTIntegratedCalls_14datasets_131103_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.18_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf.gz
### If previous link is disabled, go to the following directory for GIAB benchmark history version:
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/analysis/NIST_union_callsets_06172013/
### Bowtie2-Freebayes
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140312_chm1_alignment_heng/vcfs/NA12878.bt2.fb.vcf.gz
### Bowtie2-UG
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140312_chm1_alignment_heng/vcfs/NA12878.bt2.ug.vcf.gz
### Bowtie2-SAMtools
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140312_chm1_alignment_heng/vcfs/NA12878.bt2.st.vcf.gz
### Bowtie2-Platypus
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140312_chm1_alignment_heng/vcfs/NA12878.bt2.pt.vcf.gz
### BWA_MEM-Freebayes
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140312_chm1_alignment_heng/vcfs/NA12878.mem.fb.vcf.gz
\ No newline at end of file
varmatch (0+20160708+dfsg-1) UNRELEASED; urgency=medium
varmatch (0+git20180126.5c6fed5+dfsg-1) UNRELEASED; urgency=medium
* Initial release (Closes: #nnnn)
TODO: Some C++ syntax errors with g++-8
-- Afif Elghraoui <afif@debian.org> Tue, 12 Jul 2016 00:39:51 -0700
-- Afif Elghraoui <afif@debian.org> Tue, 30 Oct 2018 14:49:35 +0100
Source: varmatch
Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Section: science
Priority: optional
Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Build-Depends:
debhelper (>=9),
libtclap-dev,
Standards-Version: 3.9.8
Build-Depends: debhelper (>= 11~),
libtclap-dev
Standards-Version: 4.2.1
Vcs-Browser: https://salsa.debian.org/med-team/varmatch
Vcs-Git: https://salsa.debian.org/med-team/varmatch.git
Homepage: https://github.com/medvedevgroup/varmatch
Vcs-Git: https://anonscm.debian.org/git/debian-med/varmatch.git
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/varmatch.git
Package: varmatch
Architecture: any
Depends:
${shlibs:Depends},
Depends: ${shlibs:Depends},
${misc:Depends},
python,
python-matplotlib,
python-matplotlib
Description: robust matching of small genomic variant datasets
Small variant calling is an important component of many analyses, and, in
many instances, it is important to determine the set of variants which appear
......
......@@ -3,14 +3,14 @@ Author: Afif Elghraoui <afif@debian.org>
Forwarded: no
Last-Update: 2016-07-12
--- varmatch.orig/varmatch
+++ varmatch/varmatch
@@ -556,7 +556,7 @@
--- a/varmatch
+++ b/varmatch
@@ -565,7 +565,7 @@ def main():
check_compare_command = True
- script_path = sys.path[0]
+ script_path = "/usr/lib/varmatch"
compare_tool = script_path + '/vm-core'
compare_tool = script_path + '/src/vm-core'
output_dir = ''
temp_dir = ''
version=4
opts="mode=git,pretty=0+git%cd.%h,repacksuffix=+dfsg,dversionmangle=auto,repack,compression=xz" \
https://github.com/medvedevgroup/varmatch.git HEAD
# Issue asking for release tags:
# https://github.com/medvedevgroup/varmatch/issues/12
\ No newline at end of file
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t,t,'r-o')
plt.plot(t,t/2, 'g-^')
plt.savefig('xx.png')
1 /home/varmatch/human/chr1.fa
2 /home/varmatch/human/chr2.fa
17 /home/varmatch/human/backup/chr17.fa
X /home/varmatch/human/chrxx.fa
Y /home/anotherpath/human/chrY/human.y.fa
\ No newline at end of file
#!/usr/bin/env python
from sys import argv
import argparse
import math
import scipy.stats as stats
citation = 'Please cite our paper'
parser = argparse.ArgumentParser(epilog=citation)
parser.add_argument('--qu', metavar='N', help='quality number(QUAL) threshold >= N (default: N=30)', default=30)
parser.add_argument('--ab', metavar='N', help='allele balance(AB) threshold <= N%% (default: N=20)', default=20)
parser.add_argument('--fs', metavar='N', help='Fisher strand P-vale <= N (default: N=0.001)', default=0.001)
parser.add_argument('--rd', metavar='N', default=65,
help="average read depth=N, maximum read depth(MD) threshold >= N+4*sqrt(N) (default: N=65),"
" use --rd 0 to disable MD filter")
parser.add_argument('-i', metavar='input.vcf', help='input VCF file')
parser.add_argument('-o', metavar='output.vcf', help='output VCF file name(default: output.vcf)', default='output.vcf')
parser.add_argument('--homo', action='store_true', help='filter out homozygous variants')
parser.add_argument('--nf', action='store_true', help="no filters used in Heng Li review")
parser.add_argument('--snp', action='store_true', help="only want SNPs")
parser.add_argument('--indel', action='store_true', help='only want INDELs')
args = parser.parse_args()
def main():
if len(argv) < 2:
parser.print_help()
exit()
filter_homo = args.homo
if not filter_homo:
print ('Warning: compulsively filter out homozygous variants :)')
filter_homo = True
md = 0 # maximum depth filter
if args.rd != 0:
md = args.rd + 4 * math.sqrt(args.rd)
else:
print ('Warning: maximum depth(MD) filter is disabled because read depth = 0')
output_file = open(args.o, 'w')
with open(args.i) as input_file:
for line in input_file.readlines():
qu_fail = False
ab_fail = False
fs_fail = False
md_fail = False
if line.startswith('#'):
output_file.write(line)
continue
columns = line.split('\t')
if len(columns) < 8:
print ('Warning: current variant does not contains enough info for filtering')
continue
ab_contain = False
ab_pass = True
two_alleles = False
pv = 1.0
rd = -1
srf = -1
srr = -1
saf_list = []
sar_list = []
alt = columns[4]
if ',' in alt:
two_alleles = True
ref = columns[3]
is_indel = False
for a in alt.split(','):
if len(ref) != len(a):
is_indel = True
if args.snp and is_indel:
continue
if args.indel and not is_indel:
continue
# Filter out homozygous
if filter_homo:
if len(columns) < 10:
print('Warning: variant does not contain enough info to filter homozygous variants')
format_col = columns[8].split(':')
gt_index = -1
for i in range(len(format_col)):
if format_col[i] == 'GT':
gt_index = i
if gt_index == -1:
print ('Warning: variant does not contain genotype info')
continue
val_col = columns[9].split(':')
gt_val = val_col[gt_index]
gt_col = []
if '/' in gt_val:
gt_col = gt_val.split('/')
elif '|' in gt_val:
gt_col = gt_val.split('|')
else:
print ('Warning: unrecognized genotype info')
continue
if gt_col[0] == gt_col[1]:
continue
if args.nf:
output_file.write(line)
continue
quality_num = float(columns[5])
# quality filter(QU)
if quality_num < args.qu:
qu_fail = True
if not qu_fail:
output_file.write(line)
continue
info_col = columns[7].split(';')
for info in info_col:
val_col = info.split('=')
info_name = val_col[0]
info_val = val_col[1]
if info_name == 'AB':
ab_contain = True
if two_alleles:
ab_col = info_val.split(',')
for ab in ab_col:
if float(ab) > args.ab * 0.01:
ab_pass = False
else:
if float(info_val) > args.ab * 0.01:
ab_pass = False
elif info_name == 'DP':
rd = int(info_val)
elif info_name == 'SRF':
srf = int(info_val)
elif info_name == 'SRR':
srr = int(info_val)
elif info_name == 'SAF':
if two_alleles:
temp_list = info_val.split(',')
saf_list = [int(temp_list[0]), int(temp_list[1])]
else:
saf_list = [int(info_val)]
elif info_name == 'SAR':
if two_alleles:
temp_list = info_val.split(',')
sar_list = [int(temp_list[0]), int(temp_list[1])]
else:
sar_list = [int(info_val)]
# AB filter
if not ab_contain or not ab_pass:
ab_fail = True
if not ab_fail:
output_file.write(line)
continue
# Maximum depth(MD) filter
if rd == -1:
print ('Warning: current variant does not contain read depth info')
continue
elif rd < md:
md_fail = True
if not md_fail:
output_file.write(line)
continue
# Fisher strand filter(FS)
oddsratio, pv = stats.fisher_exact([[srf, srr], [saf_list[0], sar_list[0]]])
if pv > args.fs:
fs_fail = True
if two_alleles:
oddsratio, pv = stats.fisher_exact([[srf, srr], [saf_list[1], sar_list[1]]])
if pv > args.fs:
fs_fail = True
if not fs_fail:
output_file.write(line)
continue
output_file.close()
if __name__ == "__main__":
main()
\ No newline at end of file
# Copyright 2015, Chen Sun
#
# Based on source code copyright by 2013, Michael H. Goldwasser
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
from lib.linked_binary_tree import LinkedBinaryTree
from lib.map_base import MapBase
import copy
class TreeMap(LinkedBinaryTree, MapBase):
"""Sorted map implementation using a binary search tree."""
#---------------------------- override Position class ----------------------------
class Position(LinkedBinaryTree.Position):
def key(self):
"""Return key of map's key-value pair."""
return self.element()._key
def value(self):
"""Return value of map's key-value pair."""
return self.element()._value
#------------------------------- nonpublic utilities -------------------------------
def _subtree_search(self, p, k):
"""Return Position of p's subtree having key k, or last node searched."""
#print(k)
if k == p.key(): # found match
return p
elif k < p.key(): # search left subtree
if self.left(p) is not None:
return self._subtree_search(self.left(p), k)
else: # search right subtree
if self.right(p) is not None:
return self._subtree_search(self.right(p), k)
return p # unsuccessful search
#create a subtree_search help function
def _search_trace(self, p, k):
"""Return all the Position that has been searched."""
yield p
while p is not None and k != p.key():
if k < p.key():
p = self.left(p)
yield p
else:
p = self.right(p)
yield p
def _subtree_first_position(self, p):
"""Return Position of first item in subtree rooted at p."""
walk = p
while self.left(walk) is not None: # keep walking left
walk = self.left(walk)
return walk
def _subtree_last_position(self, p):
"""Return Position of last item in subtree rooted at p."""
walk = p
while self.right(walk) is not None: # keep walking right
walk = self.right(walk)
return walk
#--------------------- public methods providing "positional" support ---------------------
def first(self):
"""Return the first Position in the tree (or None if empty)."""
return self._subtree_first_position(self.root()) if len(self) > 0 else None
def last(self):
"""Return the last Position in the tree (or None if empty)."""
return self._subtree_last_position(self.root()) if len(self) > 0 else None
def before(self, p):
"""Return the Position just before p in the natural order.
Return None if p is the first position.
"""
self._validate(p) # inherited from LinkedBinaryTree
if self.left(p):
return self._subtree_last_position(self.left(p))
else:
# walk upward
walk = p
above = self.parent(walk)
while above is not None and walk == self.left(above):
walk = above
above = self.parent(walk)
return above
def after(self, p):
"""Return the Position just after p in the natural order.
Return None if p is the last position.
"""
self._validate(p) # inherited from LinkedBinaryTree
if self.right(p):
return self._subtree_first_position(self.right(p))
else:
walk = p
above = self.parent(walk)
while above is not None and walk == self.right(above):
walk = above
above = self.parent(walk)
return above
def find_position(self, k):
"""Return position with key k, or else neighbor (or None if empty)."""
if self.is_empty():
return None
else:
p = self._subtree_search(self.root(), k)
self._rebalance_access(p) # hook for balanced tree subclasses
return p
def find_nearest(self, k):
"""Return position with key k, or else the nearest position k' (or None if empty)."""
if self.is_empty():
return None
else:
shortest_distance = 3000000000
nearest_p = None
for p in self._search_trace(self.root(), k):
if p is not None:
#print(p.key(), abs(p.key()-k), shortest_distance)
abs_distance = abs(p.key() - k)
if abs_distance < shortest_distance:
shortest_distance = abs_distance
nearest_p = p
self._rebalance_access(nearest_p) # hook for balanced tree subclasses
return nearest_p
def find_nearest_small(self, k):
"""Return position with key k, or else the nearest position with k' < k (or None if empty)."""
if self.is_empty():
return None
else:
shortest_distance = 3000000000
nearest_p = None
for p in self._search_trace(self.root(), k):
if p is not None:
distance = k - p.key()
if distance >= 0 and distance < shortest_distance:
shortest_distance = distance
nearest_p = p
self._rebalance_access(nearest_p) # hook for balanced tree subclasses
return nearest_p
def find_nearest_large(self, k):
"""Return position with key k, or else the nearest position with k' > k (or None if empty)."""
if self.is_empty():
return None
else:
shortest_distance = 3000000000
nearest_p = None
for p in self._search_trace(self.root(), k):
if p is not None:
distance = p.key()-k
if distance >= 0 and distance < shortest_distance:
shortest_distance = distance
nearest_p = p
self._rebalance_access(nearest_p) # hook for balanced tree subclasses
return nearest_p
def delete(self, p):
"""Remove the item at given Position."""
self._validate(p) # inherited from LinkedBinaryTree
if self.left(p) and self.right(p): # p has two children
replacement = self._subtree_last_position(self.left(p))
self._replace(p, replacement.element()) # from LinkedBinaryTree
p = replacement
# now p has at most one child
parent = self.parent(p)
self._delete(p) # inherited from LinkedBinaryTree
self._rebalance_delete(parent) # if root deleted, parent is None
def keys(self):
key_list = []
p = self.first()
while p is not None:
key_list.append(p.key())
p = self.after(p)
return key_list
#--------------------- public methods for (standard) map interface ---------------------
def __getitem__(self, k):
"""Return value associated with key k (raise KeyError if not found)."""
if self.is_empty():
raise KeyError('Key Error: ' + repr(k))
else:
p = self._subtree_search(self.root(), k)
self._rebalance_access(p) # hook for balanced tree subclasses
if k != p.key():
raise KeyError('Key Error: ' + repr(k))
return p.value()
def __setitem__(self, k, v):
"""Assign value v to key k, overwriting existing value if present."""
if self.is_empty():
leaf = self._add_root(self._Item(k,v)) # from LinkedBinaryTree
else:
p = self._subtree_search(self.root(), k)
if p.key() == k:
p.element()._value = v # replace existing item's value
self._rebalance_access(p) # hook for balanced tree subclasses
return
else:
item = self._Item(k,v)
if p.key() < k:
leaf = self._add_right(p, item) # inherited from LinkedBinaryTree
else:
leaf = self._add_left(p, item) # inherited from LinkedBinaryTree
self._rebalance_insert(leaf) # hook for balanced tree subclasses
def __delitem__(self, k):
"""Remove item associated with key k (raise KeyError if not found)."""
if not self.is_empty():
p = self._subtree_search(self.root(), k)
if k == p.key():
self.delete(p) # rely on positional version
return # successful deletion complete
self._rebalance_access(p) # hook for balanced tree subclasses
raise KeyError('Key Error: ' + repr(k))
def __iter__(self):
"""Generate an iteration of all keys in the map in order."""
p = self.first()
while p is not None:
yield p.key()
p = self.after(p)
#--------------------- public methods for sorted map interface ---------------------
def __reversed__(self):
"""Generate an iteration of all keys in the map in reverse order."""
p = self.last()
while p is not None:
yield p.key()
p = self.before(p)
def find_min(self):
"""Return (key,value) pair with minimum key (or None if empty)."""
if self.is_empty():
return None
else:
p = self.first()
return (p.key(), p.value())
def find_max(self):
"""Return (key,value) pair with maximum key (or None if empty)."""
if self.is_empty():
return None
else:
p = self.last()
return (p.key(), p.value())
def find_le(self, k):
"""Return (key,value) pair with greatest key less than or equal to k.
Return None if there does not exist such a key.
"""
if self.is_empty():
return None
else:
p = self.find_position(k)
if k < p.key():
p = self.before(p)
return (p.key(), p.value()) if p is not None else None
def find_lt(self, k):
"""Return (key,value) pair with greatest key strictly less than k.
Return None if there does not exist such a key.
"""
if self.is_empty():
return None
else:
p = self.find_position(k)
if not p.key() < k:
p = self.before(p)
return (p.key(), p.value()) if p is not None else None
def find_ge(self, k):
"""Return (key,value) pair with least key greater than or equal to k.
Return None if there does not exist such a key.
"""
if self.is_empty():
return None
else:
p = self.find_position(k) # may not find exact match
if p.key() < k: # p's key is too small
p = self.after(p)
return (p.key(), p.value()) if p is not None else None
def find_gt(self, k):
"""Return (key,value) pair with least key strictly greater than k.
Return None if there does not exist such a key.
"""
if self.is_empty():
return None
else:
p = self.find_position(k)
if not k < p.key():
p = self.after(p)
return (p.key(), p.value()) if p is not None else None
def linear_range_search(self, position, start, stop):
"""
Iterate all position such that start < position.key < stop
Mind: linear_search function only return Position, not key value pair.
If start is None, searching begins from self.first()
If start is None, iteration begins with minimum key of map.
If end is None, iteration continues through the maximum key of map.
"""
if not self.is_empty():
if position is not None:
p = position
else:
p = self.first()
while p is not None and (stop is None or p.key() < stop):
if p.key() >= start:
yield(p)
p = self.after(p)
def find_range(self, start, stop):
"""Iterate all (key,value) pairs such that start <= key < stop.
If start is None, iteration begins with minimum key of map.
If stop is None, iteration continues through the maximum key of map.
"""
if not self.is_empty():
if start is None:
p = self.first()
else:
# we initialize p with logic similar to find_ge
p = self.find_position(start)
if p.key() < start:
p = self.after(p)
while p is not None and (stop is None or p.key() < stop):
yield (p.key(), p.value())
p = self.after(p)
#--------------------- hooks used by subclasses to balance a tree ---------------------
def _rebalance_insert(self, p):
"""Call to indicate that position p is newly added."""
pass
def _rebalance_delete(self, p):
"""Call to indicate that a child of p has been removed."""
pass
def _rebalance_access(self, p):
"""Call to indicate that position p was recently accessed."""
pass
#--------------------- nonpublic methods to support tree balancing ---------------------
def _relink(self, parent, child, make_left_child):
"""Relink parent node with child node (we allow child to be None)."""
if make_left_child: # make it a left child
parent._left = child
else: # make it a right child
parent._right = child
if child is not None: # make child point to parent
child._parent = parent
def _rotate(self, p):
"""Rotate Position p above its parent.
Switches between these configurations, depending on whether p==a or p==b.
b a
/ \ / \
a t2 t0 b
/ \ / \
t0 t1 t1 t2
Caller should ensure that p is not the root.
"""
"""Rotate Position p above its parent."""
x = p._node
y = x._parent # we assume this exists
z = y._parent # grandparent (possibly None)
if z is None:
self._root = x # x becomes root
x._parent = None
else:
self._relink(z, x, y == z._left) # x becomes a direct child of z
# now rotate x and y, including transfer of middle subtree
if x == y._left:
self._relink(y, x._right, True) # x._right becomes left child of y
self._relink(x, y, False) # y becomes right child of x
else:
self._relink(y, x._left, False) # x._left becomes right child of y
self._relink(x, y, True) # y becomes left child of x
def _restructure(self, x):
"""Perform a trinode restructure among Position x, its parent, and its grandparent.
Return the Position that becomes root of the restructured subtree.
Assumes the nodes are in one of the following configurations:
z=a z=c z=a z=c
/ \ / \ / \ / \
t0 y=b y=b t3 t0 y=c y=a t3
/ \ / \ / \ / \
t1 x=c x=a t2 x=b t3 t0 x=b
/ \ / \ / \ / \
t2 t3 t0 t1 t1 t2 t1 t2
The subtree will be restructured so that the node with key b becomes its root.
b
/ \
a c
/ \ / \
t0 t1 t2 t3
Caller should ensure that x has a grandparent.
"""
"""Perform trinode restructure of Position x with parent/grandparent."""
y = self.parent(x)
z = self.parent(y)
if (x == self.right(y)) == (y == self.right(z)): # matching alignments
self._rotate(y) # single rotation (of y)
return y # y is new subtree root
else: # opposite alignments
self._rotate(x) # double rotation (of x)
self._rotate(x)
return x # x is new subtree root
# Copyright 2015, Chen Sun
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
from lib.tree import Tree
class BinaryTree(Tree):
"""Abstract base class representing a binary tree structure."""
# --------------------- additional abstract methods ---------------------
def left(self, p):
"""Return a Position representing p's left child.
Return None if p does not have a left child.
"""
raise NotImplementedError('must be implemented by subclass')
def right(self, p):
"""Return a Position representing p's right child.
Return None if p does not have a right child.
"""
raise NotImplementedError('must be implemented by subclass')
# ---------- concrete methods implemented in this class ----------
def sibling(self, p):
"""Return a Position representing p's sibling (or None if no sibling)."""
parent = self.parent(p)
if parent is None: # p must be the root
return None # root has no sibling
else:
if p == self.left(parent):
return self.right(parent) # possibly None
else:
return self.left(parent) # possibly None
def children(self, p):
"""Generate an iteration of Positions representing p's children."""
if self.left(p) is not None:
yield self.left(p)
if self.right(p) is not None:
yield self.right(p)
def inorder(self):
"""Generate an inorder iteration of positions in the tree."""
if not self.is_empty():
for p in self._subtree_inorder(self.root()):
yield p
def _subtree_inorder(self, p):
"""Generate an inorder iteration of positions in subtree rooted at p."""
if self.left(p) is not None: # if left child exists, traverse its subtree
for other in self._subtree_inorder(self.left(p)):
yield other
yield p # visit p between its subtrees
if self.right(p) is not None: # if right child exists, traverse its subtree
for other in self._subtree_inorder(self.right(p)):
yield other
# override inherited version to make inorder the default
def positions(self):
"""Generate an iteration of the tree's positions."""
return self.inorder() # make inorder the default
# Copyright 2015, Chen Sun
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
from lib.binary_tree import BinaryTree
#ch8
class LinkedBinaryTree(BinaryTree):
"""Linked representation of a binary tree structure."""
#-------------------------- nested _Node class --------------------------
class _Node:
"""Lightweight, nonpublic class for storing a node."""
__slots__ = '_element', '_parent', '_left', '_right' # streamline memory usage
def __init__(self, element, parent=None, left=None, right=None):
self._element = element
self._parent = parent
self._left = left
self._right = right
#-------------------------- nested Position class --------------------------
class Position(BinaryTree.Position):
"""An abstraction representing the location of a single element."""
def __init__(self, container, node):
"""Constructor should not be invoked by user."""
self._container = container
self._node = node
def element(self):
"""Return the element stored at this Position."""
return self._node._element
def __eq__(self, other):
"""Return True if other is a Position representing the same location."""
return type(other) is type(self) and other._node is self._node
#------------------------------- utility methods -------------------------------
def _validate(self, p):
"""Return associated node, if position is valid."""
if not isinstance(p, self.Position):
raise TypeError('p must be proper Position type')
if p._container is not self:
raise ValueError('p does not belong to this container')
if p._node._parent is p._node: # convention for deprecated nodes
raise ValueError('p is no longer valid')
return p._node
def _make_position(self, node):
"""Return Position instance for given node (or None if no node)."""
return self.Position(self, node) if node is not None else None
#-------------------------- binary tree constructor --------------------------
def __init__(self):
"""Create an initially empty binary tree."""
self._root = None
self._size = 0
#-------------------------- public accessors --------------------------
def __len__(self):
"""Return the total number of elements in the tree."""
return self._size
def root(self):
"""Return the root Position of the tree (or None if tree is empty)."""
return self._make_position(self._root)
def parent(self, p):
"""Return the Position of p's parent (or None if p is root)."""
node = self._validate(p)
return self._make_position(node._parent)
def left(self, p):
"""Return the Position of p's left child (or None if no left child)."""
node = self._validate(p)
return self._make_position(node._left)
def right(self, p):
"""Return the Position of p's right child (or None if no right child)."""
node = self._validate(p)
return self._make_position(node._right)
def num_children(self, p):
"""Return the number of children of Position p."""
node = self._validate(p)
count = 0
if node._left is not None: # left child exists
count += 1
if node._right is not None: # right child exists
count += 1
return count
#-------------------------- nonpublic mutators --------------------------
def _add_root(self, e):
"""Place element e at the root of an empty tree and return new Position.
Raise ValueError if tree nonempty.
"""
if self._root is not None:
raise ValueError('Root exists')
self._size = 1
self._root = self._Node(e)
return self._make_position(self._root)
def _add_left(self, p, e):
"""Create a new left child for Position p, storing element e.
Return the Position of new node.
Raise ValueError if Position p is invalid or p already has a left child.
"""
node = self._validate(p)
if node._left is not None:
raise ValueError('Left child exists')
self._size += 1
node._left = self._Node(e, node) # node is its parent
return self._make_position(node._left)
def _add_right(self, p, e):
"""Create a new right child for Position p, storing element e.
Return the Position of new node.
Raise ValueError if Position p is invalid or p already has a right child.
"""
node = self._validate(p)
if node._right is not None:
raise ValueError('Right child exists')
self._size += 1
node._right = self._Node(e, node) # node is its parent
return self._make_position(node._right)
def _replace(self, p, e):
"""Replace the element at position p with e, and return old element."""
node = self._validate(p)
old = node._element
node._element = e
return old
def _delete(self, p):
"""Delete the node at Position p, and replace it with its child, if any.
Return the element that had been stored at Position p.
Raise ValueError if Position p is invalid or p has two children.
"""
node = self._validate(p)
if self.num_children(p) == 2:
raise ValueError('Position has two children')
child = node._left if node._left else node._right # might be None
if child is not None:
child._parent = node._parent # child's grandparent becomes parent
if node is self._root:
self._root = child # child becomes root
else:
parent = node._parent
if node is parent._left:
parent._left = child
else:
parent._right = child
self._size -= 1
node._parent = node # convention for deprecated node
return node._element
def _attach(self, p, t1, t2):
"""Attach trees t1 and t2, respectively, as the left and right subtrees of the external Position p.
As a side effect, set t1 and t2 to empty.
Raise TypeError if trees t1 and t2 do not match type of this tree.
Raise ValueError if Position p is invalid or not external.
"""
node = self._validate(p)
if not self.is_leaf(p):
raise ValueError('position must be leaf')
if not type(self) is type(t1) is type(t2): # all 3 trees must be same type
raise TypeError('Tree types must match')
self._size += len(t1) + len(t2)
if not t1.is_empty(): # attached t1 as left subtree of node
t1._root._parent = node
node._left = t1._root
t1._root = None # set t1 instance to empty
t1._size = 0
if not t2.is_empty(): # attached t2 as right subtree of node
t2._root._parent = node
node._right = t2._root
t2._root = None # set t2 instance to empty
t2._size = 0
# Copyright 2015, Chen Sun
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#from ..exceptions import Empty
#ch7
class LinkedQueue:
"""FIFO queue implementation using a singly linked list for storage."""
#-------------------------- nested _Node class --------------------------
class _Node:
"""Lightweight, nonpublic class for storing a singly linked node."""
__slots__ = '_element', '_next' # streamline memory usage
def __init__(self, element, next):
self._element = element
self._next = next
#------------------------------- queue methods -------------------------------
def __init__(self):
"""Create an empty queue."""
self._head = None
self._tail = None
self._size = 0 # number of queue elements
def __len__(self):
"""Return the number of elements in the queue."""
return self._size
def is_empty(self):
"""Return True if the queue is empty."""
return self._size == 0
def first(self):
"""Return (but do not remove) the element at the front of the queue.
Raise Empty exception if the queue is empty.
"""
if self.is_empty():
raise Empty('Queue is empty')
return self._head._element # front aligned with head of list
def dequeue(self):
"""Remove and return the first element of the queue (i.e., FIFO).
Raise Empty exception if the queue is empty.
"""
if self.is_empty():
raise Empty('Queue is empty')
answer = self._head._element
self._head = self._head._next
self._size -= 1
if self.is_empty(): # special case as queue is empty
self._tail = None # removed head had been the tail
return answer
def enqueue(self, e):
"""Add an element to the back of queue."""
newest = self._Node(e, None) # node will be new tail node
if self.is_empty():
self._head = newest # special case: previously empty
else:
self._tail._next = newest
self._tail = newest # update reference to tail node
self._size += 1
# Copyright 2015, Chen Sun
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
from collections import MutableMapping
#ch10
class MapBase(MutableMapping):
"""Our own abstract base class that includes a nonpublic _Item class."""
#------------------------------- nested _Item class -------------------------------
class _Item:
"""Lightweight composite to store key-value pairs as map items."""
__slots__ = '_key', '_value'
def __init__(self, k, v):
self._key = k
self._value = v
def __eq__(self, other):
return self._key == other._key # compare items based on their keys
def __ne__(self, other):
return not (self == other) # opposite of __eq__
def __lt__(self, other):
return self._key < other._key # compare items based on their keys
# Copyright 2015, Chen Sun
#
# Based on source code copyright by 2013, Michael H. Goldwasser
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
from lib.binary_search_tree import TreeMap
"""
usage:
"""
class RedBlackTreeMap(TreeMap):
"""Sorted map implementation using a red-black tree."""
#-------------------------- nested _Node class --------------------------
class _Node(TreeMap._Node):
"""Node class for red-black tree maintains bit that denotes color."""
__slots__ = '_red' # add additional data member to the Node class
def __init__(self, element, parent=None, left=None, right=None):
TreeMap._Node.__init__(self, element, parent, left, right)
self._red = True # new node red by default
#------------------------- positional-based utility methods -------------------------
# we consider a nonexistent child to be trivially black
def _set_red(self, p): p._node._red = True
def _set_black(self, p): p._node._red = False
def _set_color(self, p, make_red): p._node._red = make_red
def _is_red(self, p): return p is not None and p._node._red
def _is_red_leaf(self, p): return self._is_red(p) and self.is_leaf(p)
def _get_red_child(self, p):
"""Return a red child of p (or None if no such child)."""
for child in (self.left(p), self.right(p)):
if self._is_red(child):
return child
return None
#------------------------- support for insertions -------------------------
def _rebalance_insert(self, p):
self._resolve_red(p) # new node is always red
def _resolve_red(self, p):
if self.is_root(p):
self._set_black(p) # make root black
else:
parent = self.parent(p)
if self._is_red(parent): # double red problem
uncle = self.sibling(parent)
if not self._is_red(uncle): # Case 1: misshapen 4-node
middle = self._restructure(p) # do trinode restructuring
self._set_black(middle) # and then fix colors
self._set_red(self.left(middle))
self._set_red(self.right(middle))
else: # Case 2: overfull 5-node
grand = self.parent(parent)
self._set_red(grand) # grandparent becomes red
self._set_black(self.left(grand)) # its children become black
self._set_black(self.right(grand))
self._resolve_red(grand) # recur at red grandparent
#------------------------- support for deletions -------------------------
def _rebalance_delete(self, p):
if len(self) == 1:
self._set_black(self.root()) # special case: ensure that root is black
elif p is not None:
n = self.num_children(p)
if n == 1: # deficit exists unless child is a red leaf
c = next(self.children(p))
if not self._is_red_leaf(c):
self._fix_deficit(p, c)
elif n == 2: # removed black node with red child
if self._is_red_leaf(self.left(p)):
self._set_black(self.left(p))
else:
self._set_black(self.right(p))
def _fix_deficit(self, z, y):
"""Resolve black deficit at z, where y is the root of z's heavier subtree."""
if not self._is_red(y): # y is black; will apply Case 1 or 2
x = self._get_red_child(y)
if x is not None: # Case 1: y is black and has red child x; do "transfer"
old_color = self._is_red(z)
middle = self._restructure(x)
self._set_color(middle, old_color) # middle gets old color of z
self._set_black(self.left(middle)) # children become black
self._set_black(self.right(middle))
else: # Case 2: y is black, but no red children; recolor as "fusion"
self._set_red(y)
if self._is_red(z):
self._set_black(z) # this resolves the problem
elif not self.is_root(z):
self._fix_deficit(self.parent(z), self.sibling(z)) # recur upward
else: # Case 3: y is red; rotate misaligned 3-node and repeat
self._rotate(y)
self._set_black(y)
self._set_red(z)
if z == self.right(y):
self._fix_deficit(z, self.left(z))
else:
self._fix_deficit(z, self.right(z))
# Copyright 2015, Chen Sun
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
from lib.linked_queue import LinkedQueue #LinkedQueue is only used for bfs
import collections
#ch8
class Tree:
"""Abstract base class representing a tree structure."""
#------------------------------- nested Position class -------------------------------
class Position:
"""An abstraction representing the location of a single element within a tree.
Note that two position instaces may represent the same inherent location in a tree.
Therefore, users should always rely on syntax 'p == q' rather than 'p is q' when testing
equivalence of positions.
we define a tree ADT using the concept of apositionas an abstraction for a node of a tree
"""
def element(self):
"""Return the element stored at this Position."""
raise NotImplementedError('must be implemented by subclass')
def __eq__(self, other):
"""Return True if other Position represents the same location."""
raise NotImplementedError('must be implemented by subclass')
def __ne__(self, other):
"""Return True if other does not represent the same location."""
return not (self == other) # opposite of __eq__
# ---------- abstract methods that concrete subclass must support ----------
def root(self):
"""Return Position representing the tree's root (or None if empty)."""
raise NotImplementedError('must be implemented by subclass')
def parent(self, p):
"""Return Position representing p's parent (or None if p is root)."""
raise NotImplementedError('must be implemented by subclass')
def num_children(self, p):
"""Return the number of children that Position p has."""
raise NotImplementedError('must be implemented by subclass')
def children(self, p):
"""Generate an iteration of Positions representing p's children."""
raise NotImplementedError('must be implemented by subclass')
def __len__(self):
"""Return the total number of elements in the tree."""
raise NotImplementedError('must be implemented by subclass')
# ---------- concrete methods implemented in this class ----------
def is_root(self, p):
"""Return True if Position p represents the root of the tree."""
return self.root() == p
def is_leaf(self, p):
"""Return True if Position p does not have any children."""
return self.num_children(p) == 0
def is_empty(self):
"""Return True if the tree is empty."""
return len(self) == 0
def depth(self, p):
"""Return the number of levels separating Position p from the root."""
if self.is_root(p):
return 0
else:
return 1 + self.depth(self.parent(p))
def _height1(self): # works, but O(n^2) worst-case time
"""Return the height of the tree."""
return max(self.depth(p) for p in self.positions() if self.is_leaf(p))
def _height2(self, p): # time is linear in size of subtree
"""Return the height of the subtree rooted at Position p."""
if self.is_leaf(p):
return 0
else:
return 1 + max(self._height2(c) for c in self.children(p))
def height(self, p=None):
"""Return the height of the subtree rooted at Position p.
If p is None, return the height of the entire tree.
"""
if p is None:
p = self.root()
return self._height2(p) # start _height2 recursion
def __iter__(self):
"""Generate an iteration of the tree's elements."""
for p in self.positions(): # use same order as positions()
yield p.element() # but yield each element
def positions(self):
"""Generate an iteration of the tree's positions."""
return self.preorder() # return entire preorder iteration
def preorder(self):
"""Generate a preorder iteration of positions in the tree."""
if not self.is_empty():
for p in self._subtree_preorder(self.root()): # start recursion
yield p
def _subtree_preorder(self, p):
"""Generate a preorder iteration of positions in subtree rooted at p."""
yield p # visit p before its subtrees
for c in self.children(p): # for each child c
for other in self._subtree_preorder(c): # do preorder of c's subtree
yield other # yielding each to our caller
def postorder(self):
"""Generate a postorder iteration of positions in the tree."""
if not self.is_empty():
for p in self._subtree_postorder(self.root()): # start recursion
yield p
def _subtree_postorder(self, p):
"""Generate a postorder iteration of positions in subtree rooted at p."""
for c in self.children(p): # for each child c
for other in self._subtree_postorder(c): # do postorder of c's subtree
yield other # yielding each to our caller
yield p # visit p after its subtrees
def breadthfirst(self):
"""Generate a breadth-first iteration of the positions of the tree."""
if not self.is_empty():
fringe = LinkedQueue() # known positions not yet yielded
fringe.enqueue(self.root()) # starting with the root
while not fringe.is_empty():
p = fringe.dequeue() # remove from front of the queue
yield p # report this position
for c in self.children(p):
fringe.enqueue(c) # add children to back of queue
This diff is collapsed.
......@@ -4,8 +4,6 @@ all: vm
vm:
$(MAKE) -C src all
chmod +x varmatch
chmod +x purify
chmod +x filter
clean:
$(MAKE) -C src clean
......