Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille
--- a/README.md
+++ b/README.md
@@ -4,6 +4,12 @@

 Software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more (see Nanopolish modules, below).

+## Release notes
+
+* 0.10.2: added new program `nanopolish polya` to estimate the length of poly-A tails on direct RNA reads (by @paultsw)
+
+* 0.10.1: `nanopolish variants --consensus` now only outputs a VCF file instead of a fasta sequence. The VCF file describes the changes that need to be made to turn the draft sequence into the polished assembly. A new program, `nanopolish vcf2fasta`, is provided to generate the polished genome (this replaces `nanopolish_merge.py`, see usage instructions below). This change is to avoid issues when merging segments that end on repeat boundaries (reported by Michael Wykes and Chris Wright).
+
 ## Dependencies

 A compiler that supports C++11 is needed to build nanopolish. Development of the code is performed using [gcc-4.8](https://gcc.gnu.org/gcc-4.8/).
@@ -36,7 +42,7 @@ When major features have been added or bugs fixed, we will tag and release a new
 ```
 git clone --recursive https://github.com/jts/nanopolish.git
 cd nanopolish
-git checkout v0.7.1
+git checkout v0.9.2
 make
 ```

@@ -45,7 +51,6 @@ make
 The main subprograms of nanopolish are:

 ```
-nanopolish extract: extract reads in FASTA or FASTQ format from a directory of FAST5 files
 nanopolish call-methylation: predict genomic bases that may be methylated
 nanopolish variants: detect SNPs and indels with respect to a reference genome
 nanopolish variants --consensus: calculate an improved consensus sequence for a draft genome assembly
@@ -82,7 +87,7 @@ Now, we use nanopolish to compute the consensus sequence (the genome is polished

 ```
 python nanopolish_makerange.py draft.fa | parallel --results nanopolish.results -P 8 \
-    nanopolish variants --consensus polished.{1}.fa -w {1} -r reads.fa -b reads.sorted.bam -g draft.fa -t 4 --min-candidate-frequency 0.1
+    nanopolish variants --consensus -o polished.{1}.vcf -w {1} -r reads.fa -b reads.sorted.bam -g draft.fa -t 4 --min-candidate-frequency 0.1
 ```

 This command will run the consensus algorithm on eight 50kbp segments of the genome at a time, using 4 threads each. Change the ```-P``` and ```--threads``` options as appropriate for the machines you have available.
@@ -90,7 +95,7 @@ This command will run the consensus algorithm on eight 50kbp segments of the gen
 After all polishing jobs are complete, you can merge the individual 50kb segments together back into the final assembly:

 ```
-python nanopolish_merge.py polished.*.fa > polished_genome.fa
+nanopolish vcf2fasta -g draft.fa polished.*.vcf > polished_genome.fa
 ```

 ## Calling Methylation

--- a/debian/changelog
+++ b/debian/changelog
+nanopolish (0.10.2-1) unstable; urgency=medium
+
+  * Team upload.
+  * New upstream version
+  * debhelper 11
+  * Point Vcs fields to salsa.debian.org
+  * Standards-Version: 4.2.1
+  * Fix path for Perl interpreter
+  * Enhance long description
+
+ -- Andreas Tille <tille@debian.org>  Thu, 13 Sep 2018 11:37:24 +0200
+
 nanopolish (0.9.0-1) unstable; urgency=medium

  * New upstream version

--- a/debian/compat
+++ b/debian/compat
-10
+11
--- a/debian/control
+++ b/debian/control
 Source: nanopolish
-Section: science
-Priority: optional
 Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
 Uploaders: Afif Elghraoui <afif@debian.org>
-Build-Depends:
-	debhelper (>= 10),
-	zlib1g-dev,
-	libfast5-dev (>= 0.6.5),
-	libhts-dev,
-	libeigen3-dev,
-Standards-Version: 4.1.3
+Section: science
+Priority: optional
+Build-Depends: debhelper (>= 11~),
+               zlib1g-dev,
+               libfast5-dev (>= 0.6.5),
+               libhts-dev,
+               libeigen3-dev
+Standards-Version: 4.2.1
+Vcs-Browser: https://salsa.debian.org/med-team/nanopolish
+Vcs-Git: https://salsa.debian.org/med-team/nanopolish.git
 Homepage: https://github.com/jts/nanopolish
-Vcs-Git: https://anonscm.debian.org/git/debian-med/nanopolish.git
-Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/nanopolish.git

 Package: nanopolish
 Architecture: any-amd64 any-i386
-Depends:
-	${shlibs:Depends},
-	${misc:Depends},
-Suggests:
-	python,
-	perl,
-	make,
+Depends: ${shlibs:Depends},
+         ${misc:Depends}
+Suggests: python,
+          perl,
+          make
 Description: consensus caller for nanopore sequencing data
- Nanopolish uses a signal-level hidden Markov model for consensus
- calling of nanopore genome sequencing data.
+ Nanopolish uses a signal-level hidden Markov model for consensus calling
+ of nanopore genome sequencing data. It can perform signal-level analysis
+ of Oxford Nanopore sequencing data. Nanopolish can calculate an improved
+ consensus sequence for a draft genome assembly, detect base
+ modifications, call SNPs and indels with respect to a reference genome
+ and more.
--- a/debian/patches/hdf5-groupcheck.patch
+++ b/debian/patches/hdf5-groupcheck.patch
@@ -7,13 +7,11 @@ Subject: [PATCH] check for existence of group before opening it (#335)
 src/common/nanopolish_fast5_io.cpp | 7 +++++++
 1 file changed, 7 insertions(+)

-diff --git a/src/common/nanopolish_fast5_io.cpp b/src/common/nanopolish_fast5_io.cpp
-index 0144062..7f38719 100644
 --- a/src/common/nanopolish_fast5_io.cpp
 +++ b/src/common/nanopolish_fast5_io.cpp
-@@ -208,6 +208,13 @@ std::string fast5_get_fixed_string_attribute(hid_t hdf5_file, const std::string&
-     int ret;
-     std::string out;
+@@ -215,6 +215,13 @@ std::string fast5_get_fixed_string_attri
+         return "";
+     }
 
 +    // according to http://hdf-forum.184993.n3.nabble.com/check-if-dataset-exists-td194725.html
 +    // we should use H5Lexists to check for the existence of a group/dataset using an arbitrary path

--- a/debian/rules
+++ b/debian/rules
@@ -28,3 +28,9 @@ override_dh_auto_clean:
 	$(MAKE) clean
 	$(RM) .depend
 	mv Makefile~ Makefile
+
+override_dh_install:
+	dh_install
+	for pl in `find debian -name "*.pl"` ; do \
+	    sed -i '1s?^#!/usr/bin/env.*perl?#!/usr/bin/perl?' $${pl} ; \
+	done
--- a/docs/source/_static/nanopolish-workflow.png
+++ b/docs/source/_static/nanopolish-workflow.png
--- a/docs/source/manual.rst
+++ b/docs/source/manual.rst
@@ -10,7 +10,7 @@ Modules available: ::
    nanopolish variants: detect SNPs and indels with respect to a reference genome
    nanopolish variants --consensus: calculate an improved consensus sequence for a draft genome assembly
    nanopolish eventalign: align signal-level events to k-mers of a reference genome
-
+    nanopolish phase-reads: Phase reads using heterozygous SNVs with respect to a reference genome 
 |

 extract
@@ -167,7 +167,7 @@ Usage example
   * - ``-r``, ``--reads=FILE``
     - Y
     - NA
-     - the 2D ONT reads are in fasta FILE
+     - the ONT reads are in fasta FILE

   * - ``-b``, ``--bam=FILE``
     - Y
@@ -253,7 +253,7 @@ Usage example
   * - ``-r``, ``--reads=FILE``
     - Y
     - NA
-     - the 2D ONT reads are in fasta FILE
+     - the ONT reads are in fasta FILE

   * - ``-b``, ``--bam=FILE``
     - Y
@@ -365,7 +365,7 @@ Usage example
   * - ``-r, --reads=FILE``
     - Y
     - NA
-     - the 2D ONT reads are in fasta FILE
+     - the ONT reads are in fasta FILE

   * - ``-b, --bam=FILE``
     - Y
@@ -411,3 +411,78 @@ Usage example
     - N
     - NA
     - read alternative k-mer models from FILE
+
+
+phase-reads - (experimental)
+--------------------
+
+Overview
+"""""""""""""""""""""""
+
+Phase reads using heterozygous SNVs with respect to a reference genome 
+
+Input
+"""""""""""""""""""""""
+
+    * basecalled reads
+    * alignment information
+    * assembled genome
+    * variants (from nanopolish variants or from other sources eg. Illumina VCF)
+
+Usage example
+"""""""""""""""""""""""
+
+::
+
+   nanopolish phase-reads [OPTIONS] --reads reads.fa --bam alignments.bam --genome genome.fa variants.vcf
+
+.. list-table::
+   :widths: 20 10 20 50
+   :header-rows: 1
+
+   * - Argument name(s)
+     - Required
+     - Default value
+     - Description
+
+   * - ``-v``
+     - N
+     - NA
+     - write verbose output
+
+   * - ``-w, --window=STR``
+     - N
+     - NA
+     - Only phase reads in the window STR (format : ctg:start_id-end_id)
+
+   * - ``-r, --reads=FILE``
+     - Y
+     - NA
+     - the ONT reads are in fasta FILE
+
+   * - ``-b, --bam=FILE``
+     - Y
+     - NA
+     - the reads aligned to the genome assembly are in bam FILE
+
+   * - ``-g, --genome=FILE``
+     - Y
+     - NA
+     - the genome we are computing a consensus for is in FILE
+
+   * - ``variants.vcf``
+     - Y
+     - NA
+     - the variants (from nanopolish variants or Illumina in VCF format) to be phased are in FILE
+
+   * - ``-t, --threads=NUM``
+     - N
+     - 1
+     - use NUM threads
+
+    * - ``--progress``
+     - N
+     - NA
+     - print out a progress message
+
+ 
--- a/docs/source/quickstart_consensus.rst
+++ b/docs/source/quickstart_consensus.rst
 .. _quickstart_consensus:

-Quickstart - how to polish the consensus assembly
+Quickstart - how to polish a genome assembly
 ===================================================

-The original purpose for nanopolish was to improve the consensus assembly accuracy for Oxford Nanopore Technology sequencing reads. Here we provide a step-by-step tutorial to help you get started with our tool.
+The original purpose of nanopolish was to improve the consensus accuracy of an assembly of Oxford Nanopore Technology sequencing reads. Here we provide a step-by-step tutorial to help you get started.

 **Requirements**:

-* `nanopolish v0.8.4 <installation.html>`_
-* `samtools v1.2 <https://htslib.org>`_
-* `bwa v0.7.12 <https://github.com/lh3/bwa>`_
+* `nanopolish <installation.html>`_
+* `samtools <https://htslib.org>`_
+* `minimap2 <https://github.com/lh3/minimap2>`_
 * `MUMmer <https://github.com/mummer4/mummer>`_

 Download example dataset
@@ -29,8 +29,7 @@ You can download the example dataset we will use here: ::
 * Region: "tig00000001:200000-202000"
 * Note: Ligation-mediated PCR amplification performed

-This is a subset of reads that aligned to a 2kb region in the E. coli draft assembly. To see how we generated these files please refer to the tutorial :ref:`creati
-ng_example_dataset <here>`.
+This is a subset of reads that aligned to a 2kb region in the E. coli draft assembly. To see how we generated these files please refer to the tutorial :ref:`creating_example_dataset <here>`.

 You should find the following files:

@@ -42,7 +41,7 @@ You should find the following files:

 For the evaluation step you will need the reference genome: ::

-    wget -O ref.fa ftp://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn
+    curl -o ref.fa https://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn

 Analysis workflow
 -------------------------------
@@ -50,7 +49,6 @@ Analysis workflow
 The pipeline below describes the recommended analysis workflow for larger datasets. In this tutorial, we will run through the basic steps of the pipeline for this smaller (2kb) dataset.

 .. figure:: _static/nanopolish-workflow.png
-  :scale: 90%
  :alt: nanopolish-tutorial-workflow

 Data preprocessing
@@ -66,24 +64,20 @@ Compute the draft genome assembly using canu
 -----------------------------------------------

 As computing the draft genome assembly takes a few hours we have included the pre-assembled data for you (``draft.fa``).
-We used the following parameters with `canu <canu.readthedocs.io>`_: ::
+We used the following parameters with `canu <http://canu.readthedocs.io/en/latest/>`_: ::

    canu \
        -p ecoli -d outdir genomeSize=4.6m \
-        -nanopore-raw albacore-2.0.1-merged.fastq \
+        -nanopore-raw albacore-2.0.1-merged.fastq

 Compute a new consensus sequence for a draft assembly
 ------------------------------------------------------------------------

-Now that we have ``reads.fasta`` indexed with ``nanopolish index``, and have a draft genome assembly ``draft.fa``, we can begin to improve the assembly with nanopolish. Let us get started! 
+Now that we have ``reads.fasta`` indexed with ``nanopolish index``, and have a draft genome assembly ``draft.fa``, we can begin to improve the assembly with nanopolish. Let us get started!

-First step, is to index the draft genome assembly. We can do that with the following command: ::
+First, we align the original reads (``reads.fasta``) to the draft assembly (``draft.fa``) and sort alignments: ::

-    bwa index draft.fa
-
-Next, we align the original reads (``reads.fasta``) to the draft assembly (``draft.fa``) and sort alignments: ::
-
-    bwa mem -x ont2d -t 8 draft.fa reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp
+    minimap2 -ax map-ont -t 8 draft.fa reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp
    samtools index reads.sorted.bam

 **Checkpoint**: we can do a quick check to see if this step worked. The bam file should not be empty. ::
@@ -93,13 +87,24 @@ Next, we align the original reads (``reads.fasta``) to the draft assembly (``dra
 Then we run the consensus algorithm. For larger datasets we use ``nanopolish_makerange.py`` to split the draft genome assembly into 50kb segments, so that we can run the consensus algorithm on each segment in parallel. The output would be the polished segments in ``fasta`` format. 
 Since our dataset is only covering a 2kb region, we skip this step and use the following command: ::

-    nanopolish variants --consensus polished.fa \
+    nanopolish variants --consensus -o polished.vcf \
        -w "tig00000001:200000-202000" \
        -r reads.fasta \
        -b reads.sorted.bam \
        -g draft.fa

-We are left with our desired output: ``polished.fa``.
+We are left with: ``polished.vcf``.
+
+**Note**: As of v0.10.1, ``nanopolish variants --consensus`` only outputs a VCF file instead of a fasta sequence.
+
+To generate the polished genome in fasta format: ::
+
+    nanopolish vcf2fasta --skip-checks -g draft.fa polished.vcf > polished_genome.fa
+
+We only polished a 2kb region, so let's pull that out: :: 
+
+    samtools faidx polished_genome.fa
+    samtools faidx polished_genome.fa "tig00000001:200000-202000" > polished.fa

 Evaluate the assembly
 ---------------------------------

--- a/docs/source/quickstart_eventalign.rst
+++ b/docs/source/quickstart_eventalign.rst
@@ -12,9 +12,9 @@ The eventalign module in nanopolish is used to align events or "squiggles" to a

 **Requirements**:

-* `nanopolish v0.8.4 <installation.html>`_
-* `samtools v1.2 <http://samtools.sourceforge.net/>`_
-* `bwa v0.7.12 <https://github.com/lh3/bwa>`_
+* `nanopolish <installation.html>`_
+* `samtools <http://samtools.sourceforge.net/>`_
+* `minimap2 <https://github.com/lh3/minimap2>`_

 Download example dataset
 ------------------------------------
@@ -42,18 +42,12 @@ You should find the following files:

 You will need the E. coli reference genome: ::

-    wget -O ref.fa ftp://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn
+    curl -o ref.fa https://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn

-Align the reads with bwa
+Align the reads with minimap2
 --------------------------------

-In order to run bwa we first need to index the reference genome: ::
-
-    bwa index ref.fa
-
-**Output files**: ``ref.fa.sa``, ``ref.fa.amb``, ``ref.fa.ann``, ``ref.fa.pac``, and ``ref.fa.bwt``.
-
-We will need to index the reads as well: ::
+We first need to index the reads: ::

    nanopolish index -d fast5_files/ reads.fasta

@@ -61,7 +55,7 @@ We will need to index the reads as well: ::

 Then we can align the reads to the reference: ::

-    bwa mem -x ont2d ref.fa reads.fasta | samtools sort -o reads-ref.sorted.bam -T reads.tmp
+    minimap2 -ax map-ont -t 8 ref.fa reads.fasta | samtools sort -o reads-ref.sorted.bam -T reads.tmp
    samtools index reads-ref.sorted.bam

 **Output files**: ``reads-ref.sorted.bam`` and ``reads-ref.sorted.bam.bai``.

--- a/scripts/calculate_methylation_frequency.py
+++ b/scripts/calculate_methylation_frequency.py
@@ -48,7 +48,7 @@ csv_reader = csv.DictReader(in_fh, delimiter='\t')

 for record in csv_reader:
    
-    num_sites = int(record['num_cpgs']) 
+    num_sites = int(record['num_motifs']) 
    llr = float(record['log_lik_ratio'])

    # Skip ambiguous call
@@ -77,7 +77,7 @@ for record in csv_reader:
        update_call_stats(key, num_sites, is_methylated, sequence)

 # header
-print("\t".join(["chromosome", "start", "end", "num_cpgs_in_group", "called_sites", "called_sites_methylated", "methylated_frequency", "group_sequence"]))
+print("\t".join(["chromosome", "start", "end", "num_motifs_in_group", "called_sites", "called_sites_methylated", "methylated_frequency", "group_sequence"]))

 sorted_keys = sorted(sites.keys(), key = lambda x: split_key(x))


--- a/scripts/convert_all_models.py
+++ b/scripts/convert_all_models.py
+from __future__ import print_function
+
 import sys
 import os.path
 import os
@@ -19,12 +21,12 @@ for model_file in sys.stdin:
    functions.append(function_name)
    outfiles.append(outfile)

-print "// Autogenerated from convert_all_models.py"
+print("// Autogenerated from convert_all_models.py")
 for f in outfiles:
-    print "#include \"%s\"" % (f)
+    print("#include \"%s\"" % f)

-print "\n// Autogenerated from convert_all_models.py"
-print "const static std::vector<PoreModel> builtin_models({"
-print "\t" + ",\n\t".join([f + "()" for f in functions])
-print "});"
+print("\n// Autogenerated from convert_all_models.py")
+print("const static std::vector<PoreModel> builtin_models({")
+print("\t" + ",\n\t".join([f + "()" for f in functions]))
+print("});")

--- a/scripts/convert_model_to_header.py
+++ b/scripts/convert_model_to_header.py
 #! /usr/bin/env python
 # Convert a .model file into a .inl file that can be directly compiled into nanopolish
+from __future__ import print_function

 import argparse

 def assign(name, value):
    assert(len(name) > 0)
    assert(value is not None)
-    print "\ttmp.%s = %s;" % (name, value)
+    print("\ttmp.%s = %s;" % (name, value))

 def quote(value):
    assert(len(value) > 0)
@@ -44,43 +45,43 @@ for line in f:
            bases[b] = 1

 # Preamble
-print "// Autogenerated by convert_model_to_header.py"
-print "#ifndef NANOPOLISH_%s_INL" % (args.function_name.upper())
-print "#define NANOPOLISH_%s_INL" % (args.function_name.upper())
+print("// Autogenerated by convert_model_to_header.py")
+print("#ifndef NANOPOLISH_%s_INL" % args.function_name.upper())
+print("#define NANOPOLISH_%s_INL" % args.function_name.upper())

-data_name = "%s_data" % (args.function_name)
-print "static std::vector<double> %s = {" % (data_name)
+data_name = "%s_data" % args.function_name
+print("static std::vector<double> %s = {" % data_name)
 for ki, t in enumerate(model):

    is_last = ki == len(model) - 1
    sep = ',' if not is_last else ''

-    print "\t\t%.5f, %.5f, %.5f, %.5f%s // %s" % (float(t[1]), float(t[2]), float(t[3]), float(t[4]), sep, t[0])
-print "};"
+    print("\t\t%.5f, %.5f, %.5f, %.5f%s // %s" % (float(t[1]), float(t[2]), float(t[3]), float(t[4]), sep, t[0]))
+print("};")

-print "PoreModel %s()\n{" % (args.function_name)
-print "\tPoreModel tmp;"
+print("PoreModel %s()\n{" % args.function_name)
+print("\tPoreModel tmp;")
 # Output metadata
 assign("model_filename", quote(args.input))
 assign("k", K)

 num_states = len(model)

-print "\ttmp.states.resize(%d);" % (num_states)
-print "\tfor(size_t i = 0; i < %d; ++i) {" % (num_states)
-print "\t\ttmp.states[i].level_mean = %s[4*i + 0];" % (data_name)
-print "\t\ttmp.states[i].level_stdv = %s[4*i + 1];" % (data_name)
-print "\t\ttmp.states[i].sd_mean = %s[4*i + 2];" % (data_name)
-print "\t\ttmp.states[i].sd_stdv = %s[4*i + 3];" % (data_name)
-print "\t\ttmp.states[i].update_sd_lambda();"
-print "\t\ttmp.states[i].update_logs();"
-print "\t}"
+print("\ttmp.states.resize(%d);" % num_states)
+print("\tfor(size_t i = 0; i < %d; ++i) {" % num_states)
+print("\t\ttmp.states[i].level_mean = %s[4*i + 0];" % data_name)
+print("\t\ttmp.states[i].level_stdv = %s[4*i + 1];" % data_name)
+print("\t\ttmp.states[i].sd_mean = %s[4*i + 2];" % data_name)
+print("\t\ttmp.states[i].sd_stdv = %s[4*i + 3];" % data_name)
+print("\t\ttmp.states[i].update_sd_lambda();")
+print("\t\ttmp.states[i].update_logs();")
+print("\t}")

 if "alphabet" in header_kv:
-    print "\ttmp.pmalphabet = get_alphabet_by_name(%s);" % (quote(header_kv["alphabet"]))
+    print("\ttmp.pmalphabet = get_alphabet_by_name(%s);" % (quote(header_kv["alphabet"])))
 else:
-    print "\ttmp.pmalphabet = best_alphabet(%s);" % (quote("".join(bases)))
+    print("\ttmp.pmalphabet = best_alphabet(%s);" % (quote("".join(bases))))

-print "\ttmp.set_metadata(%s, %s);" % (quote(header_kv["kit"]), quote(header_kv["strand"]))
-print "\treturn tmp;\n}"
-print "#endif"
+print("\ttmp.set_metadata(%s, %s);" % (quote(header_kv["kit"]), quote(header_kv["strand"])))
+print("\treturn tmp;\n}")
+print("#endif")
--- a/scripts/extract_reads_aligned_to_region.py
+++ b/scripts/extract_reads_aligned_to_region.py
 #!/usr/bin/env python
-'''
+"""
 ========================================================
-Extract info on reads that align to a given region 
+Extract info on reads that align to a given region
 in draft genome assembly.
 ========================================================
-'''
+"""
+from __future__ import print_function
+
 try:
-	from Bio import SeqIO
+	from Bio.SeqIO.FastaIO import SimpleFastaParser
+	from Bio.SeqIO.QualityIO import FastqGeneralIterator
 	import pysam
 	import argparse
 	import subprocess
@@ -144,27 +147,27 @@ def main():
 		if ".gz" in file_type:
 			with gzip.open(o_fa, "rt") as fin:
 				if "fasta.gz" in file_type:
-					for record in SeqIO.parse(fin, "fasta"):
+					for title, seq in SimpleFastaParser(handle):
+						name = title.split(None, 1)[0]
 						if record.id in region_read_ids:
-							fout.write(">" + record.id + "\n")
-							fout.write(str(record.seq) + "\n")
+							fout.write(">%s\n%s\n" % (name, seq))
 				elif "fastq.gz" in file_type:
-					for record in SeqIO.parse(fin, "fastq"):
+					for title, seq, qual in FastqGeneralIterator(handle):
+						name = title.split(None, 1)[0]
 						if record.id in region_read_ids:
-							fout.write(">" + record.id + "\n")
-							fout.write(str(record.seq) + "\n")
+							fout.write(">%s\n%s\n" % (name, seq))
 		else:
 			with open(o_fa, "rt") as fin:
 				if "fasta" in file_type:
-					for record in SeqIO.parse(fin, "fasta"):
+					for title, seq in SimpleFastaParser(handle):
+						name = title.split(None, 1)[0]
 						if record.id in region_read_ids:
-							fout.write(">" + record.id + "\n")
-							fout.write(str(record.seq) + "\n")
+							fout.write(">%s\n%s\n" % (name, seq))
 				elif "fastq" in file_type:
-					for record in SeqIO.parse(fin, "fastq"):
+					for title, seq, qual in FastqGeneralIterator(handle):
+						name = title.split(None, 1)[0]
 						if record.id in region_read_ids:
-							fout.write(">" + record.id + "\n")
-							fout.write(str(record.seq) + "\n")
+							fout.write(">%s\n%s\n" % (name, seq))

 	# --------------------------------------------------------
 	# PART 7: 	Let's get to tarring
@@ -178,6 +181,11 @@ def main():
 	archive = tarfile.open(tar_filename, "w:gz")
 	custom_print( "[ Creating a tar.gz file ] \t" + tar_filename )
 	custom_print( "[+] FAST5 files: " + op + "/fast5_files/<FAST5 file(s)>" )
+	# track missing fast5 files
+	bad_f5_found = False # true if missing fast5 file
+	bad_read_id = ""
+	bad_f5_path = ""
+	num_bad_cases = 0
 	for r in region_fast5_files.keys():
 		read_id = r
 		f5 = region_fast5_files[r]
@@ -185,7 +193,23 @@ def main():
 		# get basename of fast5 file
 		f5_basename = extract_basename(f5)
 		an = op + "/fast5_files/" + f5_basename
-		archive.add(f5, arcname=an)
+		try:
+			archive.add(f5, arcname=an)
+		except:
+			bad_f5_found = True
+			bad_read_id = read_id
+			bad_f5_path = f5
+			num_bad_cases += 1
+	
+	# handle missing fast5 files
+	if bad_f5_found:
+		print("\nERROR: For read " + read_id + ", could not add " + str(f5) + ".")
+		print("This path is inferred from the readdb file.")
+		print("Please check that this is the correct path in readdb file for this read.")
+		if num_bad_cases > 1:
+			print("There are " + str(num_bad_cases) + " other reads with this problem (out of " + str(len(region_fast5_files)) + ").")
+		print("\n")
+		sys.exit(1)

 	# --------------------------------------------------------
 	# PART 8:	Add new files to tar
@@ -289,7 +313,7 @@ def detect_fa_filetype(fa_filename):
 	for ext in ['fastq.gz', 'fasta.gz', 'fastq', 'fasta']:
 		if path.endswith(ext):
 			return ext
-	print( "Must be either fasta, fastq, fasta.gz, fastq.gz" )
+	print("Must be either fasta, fastq, fasta.gz, fastq.gz")
 	sys.exit(1)

 def custom_print(s):
@@ -302,7 +326,7 @@ def custom_print(s):
 	global verbose
 	global log
 	if verbose:
-		print s
+		print(s)
 	log.append(s)

 if __name__ == "__main__":

--- a/scripts/nanopolish_makerange.py
+++ b/scripts/nanopolish_makerange.py
+from __future__ import print_function
+
 import sys
 import argparse
-from Bio import SeqIO
+from Bio.SeqIO.FastaIO import SimpleFastaParser

 parser = argparse.ArgumentParser(description='Partition a genome into a set of overlapping segments')
 parser.add_argument('--segment-length', type=int, default=50000)
@@ -10,7 +12,9 @@ if len(extra) != 1:
    sys.stderr.write("Error: a genome file is expected\n")
 filename = extra[0]

-recs = [ (rec.name, len(rec.seq)) for rec in SeqIO.parse(open(filename), "fasta")]
+with open(filename) as handle:
+    recs = [(title.split(None, 1)[0], len(seq))
+            for title, seq in SimpleFastaParser(handle)]

 SEGMENT_LENGTH = args.segment_length
 OVERLAP_LENGTH = args.overlap_length

--- a/scripts/nanopolish_merge.py
+++ b/scripts/nanopolish_merge.py
+from __future__ import print_function
+
 import sys
 import glob
-from Bio import SeqIO
+from Bio.SeqIO.FastaIO import SimpleFastaParser
 from Bio import pairwise2

 def merge_into_consensus(consensus, incoming, overlap_length):
@@ -67,14 +69,15 @@ segments_by_name = dict()

 # Load the polished segments into a dictionary keyed by the start coordinate
 for fn in sys.argv[1:]:
-    for rec in SeqIO.parse(open(fn), "fasta"):
-        (contig, segment_range) = rec.name.split(":")
+    with open(fn) as handle:
+        for title, seq in SimpleFastaParser(handle):
+            (contig, segment_range) = title.split(None, 1)[0].split(":")

-        if contig not in segments_by_name:
-            segments_by_name[contig] = dict()
+            if contig not in segments_by_name:
+                segments_by_name[contig] = dict()

-        segment_start, segment_end = segment_range.split("-")
-        segments_by_name[contig][int(segment_start)] = str(rec.seq)
+            segment_start, segment_end = segment_range.split("-")
+            segments_by_name[contig][int(segment_start)] = seq

 # Assemble while making sure every segment is present
 for contig_name in sorted(segments_by_name.keys()):

--- a/scripts/reestimate_polya_emissions.py
+++ b/scripts/reestimate_polya_emissions.py
+"""
+reestimate_polya_emissions.py: given two `polya-samples` TSV files based on different
+underlying kmer models (with the newer TSV giving failing poly(A) segmentations),
+infer the best new parameters for the HMM emissions.
+
+Usage:
+$ python reestimate_polya_emissions.py samples.old.tsv seg.old.tsv samples.new.tsv
+where:
+* `samples.old.tsv` is the output of `nanopolish polya -vv [...] | grep 'polya-samples'`,
+generated by the **old** kmer models;
+* `seg.old.tsv` is the output of `nanopolish polya -v [...] | grep 'polya-segmentation'`,
+generated by the **old** kmer models;
+* `samples.new.tsv` is the output of `nanopolish polya -vv [...] | grep 'polya-samples'`,
+generated by the **new** kmer models.
+
+Dependencies:
+* numpy >= 1.11.2
+* scipy >= 0.18.1
+* sklearn >= 0.18.1
+"""
+import csv
+import numpy as np
+import argparse
+import os
+from scipy.stats import norm
+from sklearn.mixture import GaussianMixture
+
+
+log_inv_sqrt_2pi = np.log(0.3989422804014327)
+def log_normal_pdf(xs, mu, sigma):
+    """Compute the log-normal PDF of a given sample(s) against a mu and sigma."""
+    alpha = (xs - mu) * np.reciprocal(sigma)
+    return ( log_inv_sqrt_2pi - np.log(sigma) + (-0.5 * alpha * alpha) )
+
+
+def fit_gaussian(samples):
+    """Given a numpy array of floating point samples, fit a gaussian distribution."""
+    mu, sigma = norm.fit(samples)
+    return (mu,sigma)
+
+
+def fit_gmm(samples, ncomponents=2):
+    """Given a numpy array of floating point samples, fit a gaussian mixture model."""
+    # assume samples is of shape (NSAMPLES,); unsqueeze to (NSAMPLES,1) and train a GMM:
+    gmm = GaussianMixture(n_components=ncomponents)
+    gmm.fit(samples.reshape(-1,1))
+    # return params of GMM in [(coeff, mu, sigma)] format:
+    params = [(gmm.weights_[c], gmm.means_[c][0], gmm.covariances_[c][0][0]) for c in range(ncomponents)]
+    return params
+
+
+def old_tsv_to_numpy(tsv_path):
+    """
+    Read a TSV containing raw samples and return a dictionary consisting
+    of the following numpy datasets:
+    * S_loglkhd: the log-likelihoods of the samples belonging to the START segment.
+    * L_loglkhd: the log-likelihoods of the samples belonging to the LEADER segment.
+    * A_loglkhd: the log-likelihoods of the samples belonging to the ADAPTER segment.
+    * P_loglkhd: the log-likelihoods of the samples belonging to the POLYA segment.
+    * T_loglkhd: the log-likelihoods of the samples belonging to the TRANSCRIPT segment.
+    """
+    # instantiate arrays to hold values:
+    S_loglkhd = []
+    L_loglkhd = []
+    A_loglkhd = []
+    P_loglkhd = []
+    T_loglkhd = []
+
+    # loop over TSV file and append data to arrays:
+    str2int = { 'START': 0, 'LEADER': 1, 'ADAPTER': 2, 'POLYA': 3, 'TRANSCRIPT': 5 }
+    with open(tsv_path, 'r') as f:
+        headers =  ['tag','read_id', 'chr', 'idx', 'sample', 'scaled_sample',
+                    's_llh', 'l_llh','a_llh','p_llh', 'c_llh', 't_llh','region']
+        rdr = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE, fieldnames=headers)
+        for row in rdr:
+            # parse row fields:
+            s_llh = float(row['s_llh'])
+            l_llh = float(row['l_llh'])
+            a_llh = float(row['a_llh'])
+            p_llh = float(row['p_llh'])
+            t_llh = float(row['t_llh'])
+            region = row['region']
+            
+            # append log-likelihoods to appropriate arrays:
+            if region == 'START':
+                S_loglkhd.append(s_llh)
+            if region == 'LEADER':
+                L_loglkhd.append(l_llh)
+            if region == 'ADAPTER':
+                A_loglkhd.append(a_llh)
+            if region == 'POLYA':
+                P_loglkhd.append(p_llh)
+            if region == 'TRANSCRIPT':
+                T_loglkhd.append(t_llh)
+    
+    return { "S_loglkhd": np.array(S_loglkhd, dtype=float),
+             "L_loglkhd": np.array(L_loglkhd, dtype=float),
+             "A_loglkhd": np.array(A_loglkhd, dtype=float),
+             "P_loglkhd": np.array(P_loglkhd, dtype=float),
+             "T_loglkhd": np.array(T_loglkhd, dtype=float) }
+
+
+def make_segmentation_dict(segmentations_tsv_path):
+    """
+    Load a segmentations TSV file. Rows of `segmentations_tsv_path` look like this:
+
+    tag                 read_id: pos:       L_0    A_0:    P_0:     P_1:     RR:    P(A)L:  AL:
+    polya-segmentation  fc06...  161684804  47.0   1851.0  8354.0   11424.0  73.76  75.18   35.23
+
+    Note that this function only takes the first available segmentation for each read, i.e.
+    if a read id appears more than once in the TSV, only the first segmentation is kept, and
+    later occurrences of the read id in the TSV are ignored.
+    """
+    segments = {}
+    # loop thru TSV and update the list of segmentations:
+    with open(segmentations_tsv_path, 'r') as f:
+        headers = ['tag', 'read_id', 'pos', 'L_start', 'A_start', 'P_start', 'P_end', 'rate', 'plen', 'alen']
+        rdr = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE, fieldnames=headers)
+        for row in rdr:
+            if row['read_id'] not in segments.keys():
+                segments[row['read_id']] = { 'L_start': int(float(row['L_start'])),
+                                             'A_start': int(float(row['A_start'])),
+                                             'P_start': int(float(row['P_start'])),
+                                             'P_end': int(float(row['P_end'])) }
+    return segments
+
+
+def region_search(read_id, sample_ix, segmentations):
+    """
+    Given a dictionary of ("gold-standard") segmentations, look up the region that a
+    given read and sample index belongs to.
+
+    Returns an integer label out of 0,1,2,3,4,5 where:
+    0 => START, 1 => LEADER, 2 => ADAPTER, 3 => POLYA, 5 => TRANSCRIPT, 6 => UNKNOWN
+
+    (We skip label '4' because it represents CLIFFs, which we don't track here --- they have
+    a uniform distribution.)
+    """
+    # find read ID in segmentations:
+    read_key = None
+    for long_read_id in segmentations.keys():
+        if long_read_id[0:len(read_id)] == read_id:
+            read_key = long_read_id
+
+    # return UNK if read not found:
+    if read_key == None:
+        return 6
+
+    # find region that `sample_ix` belongs to:
+    l_start = segmentations[read_key]['L_start']
+    a_start = segmentations[read_key]['A_start']
+    p_start = segmentations[read_key]['P_start']
+    p_end = segmentations[read_key]['P_end']
+    if (sample_ix < l_start):
+        return 0
+    if (sample_ix < a_start):
+        return 1
+    if (sample_ix < p_start):
+        return 2
+    if (sample_ix <= p_end):
+        return 3
+    if (sample_ix > p_end):
+        return 5
+    return 6
+
+
+def new_tsv_to_numpy(tsv_path, segmentations):
+    """
+    Read a TSV of new, miscalled samples and a dictionary of correct segmentations (coming from
+    an older, correct TSV) and return a dict of numpy arrays.
+
+    Args:
+    * tsv_path: path to a TSV generated by `nanopolish polya -vv [...]`.
+    * segmentations: a dictionary of segmentation intervals, given in numpy format.
+
+    Returns: a dictionary of numpy arrays.
+    """
+    # instantiate arrays to hold values:
+    S_samples = []
+    L_samples = []
+    A_samples = []
+    P_samples = []
+    T_samples = []
+
+    # loop over TSV file and append data to arrays:
+    with open(tsv_path, 'r') as f:
+        headers =  ['tag','read_id', 'chr', 'idx', 'sample', 'scaled_sample',
+                    's_llh', 'l_llh','a_llh','p_llh', 'c_llh', 't_llh','region']
+        rdr = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE, fieldnames=headers)
+        for row in rdr:
+            scaled_sample = float(row['scaled_sample'])
+            read = row['read_id']
+            contig = row['chr']
+            index = int(row['idx'])
+            region = region_search(read, index, segmentations)
+            if region == 0:
+                S_samples.append(scaled_sample)
+            if region == 1:
+                L_samples.append(scaled_sample)
+            if region == 2:
+                A_samples.append(scaled_sample)
+            if region == 3:
+                P_samples.append(scaled_sample)
+            if region == 5:
+                T_samples.append(scaled_sample)
+            
+    return { "S_samples": np.array(S_samples, dtype=float),
+             "L_samples": np.array(L_samples, dtype=float),
+             "A_samples": np.array(A_samples, dtype=float),
+             "P_samples": np.array(P_samples, dtype=float),
+             "T_samples": np.array(T_samples, dtype=float) }
+
+
+def main(old_samples_tsv, old_segmentations_tsv, new_samples_tsv, benchmark=True):
+    """
+    Infer and print the new values for mu and sigma (for each of S, L, A, P, C, T) to STDOUT.
+
+    Args:
+    * old_samples_tsv: path to TSV file containing polya-samples data from an older kmer model.
+    * old_segmentations_tsv: path to TSV file containing polya-segmentation data from an older kmer model.
+    * new_samples_tsv: path to TSV file containing polya-samples data from the newer kmer model.
+
+    Returns: N/A, prints outputs to STDOUT.
+    """
+    ### read all samples into numpy arrays:
+    print("Loading data from TSV...")
+    old_data = old_tsv_to_numpy(old_samples_tsv)
+    segmentations = make_segmentation_dict(old_segmentations_tsv)
+    new_data = new_tsv_to_numpy(new_samples_tsv, segmentations)
+    print("... Datasets loaded.")
+
+    ### infer best possible new mu,sigma for each of S, L, A, P, T:
+    print("Fitting gaussians to new scaled samples (this may take a while)...")
+    new_mu_S, new_sigma_S = fit_gaussian(new_data['S_samples'])
+    new_mu_L, new_sigma_L = fit_gaussian(new_data['L_samples'])
+    (new_pi0_A, new_mu0_A, new_sig0_A), (new_pi1_A, new_mu1_A, new_sig1_A) = fit_gmm(new_data['A_samples'], ncomponents=2)
+    new_mu_P, new_sigma_P = fit_gaussian(new_data['P_samples'])
+    (new_pi0_T, new_mu0_T, new_sig0_T), (new_pi1_T, new_mu1_T, new_sig1_T) = fit_gmm(new_data['T_samples'], ncomponents=2)
+
+    ### print to stdout:
+    print("New params for START: mu = {0}, var = {1}, stdv = {2}".format(new_mu_S, new_sigma_S, np.sqrt(new_sigma_S)))
+    print("New params for LEADER: mu = {0}, var = {1}, stdv = {2}".format(new_mu_L, new_sigma_L, np.sqrt(new_sigma_L)))
+    print("New params for ADAPTER0: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi0_A, new_mu0_A, new_sig0_A, np.sqrt(new_sig0_A)))
+    print("New params for ADAPTER1: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi1_A, new_mu1_A, new_sig1_A, np.sqrt(new_sig1_A)))
+    print("New params for POLYA: mu = {0}, var = {1}, stdv = {2}".format(new_mu_P, new_sigma_P, np.sqrt(new_sigma_P)))
+    print("New params for TRANSCR0: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi0_T, new_mu0_T, new_sig0_T, np.sqrt(new_sig0_T)))
+    print("New params for TRANSCR1: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi1_T, new_mu1_T, new_sig1_T, np.sqrt(new_sig1_T)))
+
+    ### optionally, benchmark:
+    if not benchmark:
+        return
+
+    print("===== Emission Log-Likelihood Benchmarks =====")
+    old_S_llh = np.mean(old_data['S_loglkhd'])
+    new_S_llh = np.mean(norm.logpdf(new_data['S_samples'], loc=new_mu_S, scale=np.sqrt(new_sigma_S)))
+    print("> Average START log-probs:")
+    print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_S_llh, new_S_llh))
+    
+    old_L_llh = np.mean(old_data['L_loglkhd'])
+    new_L_llh = np.mean(norm.logpdf(new_data['L_samples'], loc=new_mu_L, scale=np.sqrt(new_sigma_L)))
+    print("> Average LEADER log-probs:")
+    print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_L_llh, new_L_llh))
+
+    old_A_llh = np.mean(old_data['A_loglkhd'])
+    new_A_llh0 = new_pi0_A * norm.pdf(new_data['A_samples'], loc=new_mu0_A, scale=np.sqrt(new_sig0_A))
+    new_A_llh1 = new_pi1_A * norm.pdf(new_data['A_samples'], loc=new_mu1_A, scale=np.sqrt(new_sig1_A))
+    new_A_llh = np.mean(np.log(new_A_llh0 + new_A_llh1))
+    print("> Average ADAPTER log-probs:")
+    print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_A_llh, new_A_llh))
+
+    old_P_llh = np.mean(old_data['P_loglkhd'])
+    new_P_llh = np.mean(norm.logpdf(new_data['P_samples'], loc=new_mu_P, scale=np.sqrt(new_sigma_P)))
+    print("> Average POLYA log-probs:")
+    print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_P_llh, new_P_llh))
+
+    old_T_llh = np.mean(old_data['T_loglkhd'])
+    new_T_llh0 = new_pi0_T * norm.pdf(new_data['T_samples'], loc=new_mu0_T, scale=np.sqrt(new_sig0_T))
+    new_T_llh1 = new_pi1_T * norm.pdf(new_data['T_samples'], loc=new_mu1_T, scale=np.sqrt(new_sig1_T))
+    new_T_llh = np.mean(np.log(new_T_llh0 + new_T_llh1))
+    print("> Average TRANSCRIPT log-probs:")
+    print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_T_llh, new_T_llh))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Infer new Poly(A) emission parameters.")
+    parser.add_argument("old_samples_tsv", help="Path to TSV file of old samples.")
+    parser.add_argument("segmentation_tsv", help="Path to segmentations for reads.")
+    parser.add_argument("new_samples_tsv", help="Path to TSV file of new samples.")
+    parser.add_argument("--benchmark", default=True, type=bool, dest="benchmark",
+                        help="If `--benchmark=False`, don't the new estimated HMM parameters.")
+    args = parser.parse_args()
+    # sanity checks:
+    assert os.path.exists(args.old_samples_tsv)
+    assert os.path.exists(args.segmentation_tsv)
+    assert os.path.exists(args.new_samples_tsv)
+    # run inference and (optional) benchmarking of new parameters:
+    main(args.old_samples_tsv, args.segmentation_tsv, args.new_samples_tsv, benchmark=args.benchmark)
--- a/src/alignment/nanopolish_eventalign.cpp
+++ b/src/alignment/nanopolish_eventalign.cpp
@@ -63,11 +63,13 @@ static const char *EVENTALIGN_USAGE_MESSAGE =
 "  -b, --bam=FILE                       the reads aligned to the genome assembly are in bam FILE\n"
 "  -g, --genome=FILE                    the genome we are computing a consensus for is in FILE\n"
 "  -t, --threads=NUM                    use NUM threads (default: 1)\n"
+"  -q, --min-mapping-quality=NUM        only use reads with mapping quality at least NUM (default: 0)\n"
 "      --scale-events                   scale events to the model, rather than vice-versa\n"
 "      --progress                       print out a progress message\n"
 "  -n, --print-read-names               print read names instead of indexes\n"
 "      --summary=FILE                   summarize the alignment of each read/strand in FILE\n"
 "      --samples                        write the raw samples for the event to the tsv output\n"
+"      --signal-index                   write the raw signal start and end index values for the event to the tsv output\n"
 "      --models-fofn=FILE               read alternative k-mer models from FILE\n"
 "\nReport bugs to " PACKAGE_BUGREPORT "\n\n";

@@ -84,32 +86,36 @@ namespace opt
    static int progress = 0;
    static int num_threads = 1;
    static int scale_events = 0;
-    static int batch_size = 128;
+    static int batch_size = 512;
+    static int min_mapping_quality = 0;
    static bool print_read_names;
    static bool full_output;
    static bool write_samples = false;
+    static bool write_signal_index = false;
 }

-static const char* shortopts = "r:b:g:t:w:vn";
+static const char* shortopts = "r:b:g:t:w:q:vn";

-enum { OPT_HELP = 1, OPT_VERSION, OPT_PROGRESS, OPT_SAM, OPT_SUMMARY, OPT_SCALE_EVENTS, OPT_MODELS_FOFN, OPT_SAMPLES };
+enum { OPT_HELP = 1, OPT_VERSION, OPT_PROGRESS, OPT_SAM, OPT_SUMMARY, OPT_SCALE_EVENTS, OPT_MODELS_FOFN, OPT_SAMPLES, OPT_SIGNAL_INDEX };

 static const struct option longopts[] = {
-    { "verbose",          no_argument,       NULL, 'v' },
-    { "reads",            required_argument, NULL, 'r' },
-    { "bam",              required_argument, NULL, 'b' },
-    { "genome",           required_argument, NULL, 'g' },
-    { "window",           required_argument, NULL, 'w' },
-    { "threads",          required_argument, NULL, 't' },
-    { "summary",          required_argument, NULL, OPT_SUMMARY },
-    { "models-fofn",      required_argument, NULL, OPT_MODELS_FOFN },
-    { "print-read-names", no_argument,       NULL, 'n' },
-    { "samples",          no_argument,       NULL, OPT_SAMPLES },
-    { "scale-events",     no_argument,       NULL, OPT_SCALE_EVENTS },
-    { "sam",              no_argument,       NULL, OPT_SAM },
-    { "progress",         no_argument,       NULL, OPT_PROGRESS },
-    { "help",             no_argument,       NULL, OPT_HELP },
-    { "version",          no_argument,       NULL, OPT_VERSION },
+    { "verbose",             no_argument,       NULL, 'v' },
+    { "reads",               required_argument, NULL, 'r' },
+    { "bam",                 required_argument, NULL, 'b' },
+    { "genome",              required_argument, NULL, 'g' },
+    { "window",              required_argument, NULL, 'w' },
+    { "threads",             required_argument, NULL, 't' },
+    { "min-mapping-quality", required_argument, NULL, 'q' },
+    { "summary",             required_argument, NULL, OPT_SUMMARY },
+    { "models-fofn",         required_argument, NULL, OPT_MODELS_FOFN },
+    { "print-read-names",    no_argument,       NULL, 'n' },
+    { "samples",             no_argument,       NULL, OPT_SAMPLES },
+    { "signal-index",        no_argument,       NULL, OPT_SIGNAL_INDEX },
+    { "scale-events",        no_argument,       NULL, OPT_SCALE_EVENTS },
+    { "sam",                 no_argument,       NULL, OPT_SAM },
+    { "progress",            no_argument,       NULL, OPT_PROGRESS },
+    { "help",                no_argument,       NULL, OPT_HELP },
+    { "version",             no_argument,       NULL, OPT_VERSION },
    { NULL, 0, NULL, 0 }
 };

@@ -147,7 +153,7 @@ struct EventalignSummary
 };

 //
-const PoreModel* EventAlignmentParameters::get_model() const 
+const PoreModel* EventAlignmentParameters::get_model() const
 {
    if(this->alphabet == "") {
        return this->sr->get_base_model(this->strand_idx);
@@ -176,12 +182,12 @@ void trim_aligned_pairs_to_ref_region(std::vector<AlignedPair>& aligned_pairs, i
 {
    std::vector<AlignedPair> trimmed;
    for(size_t i = 0; i < aligned_pairs.size(); ++i) {
-        if(aligned_pairs[i].ref_pos >= ref_start && 
+        if(aligned_pairs[i].ref_pos >= ref_start &&
           aligned_pairs[i].ref_pos <= ref_end) {
            trimmed.push_back(aligned_pairs[i]);
        }
    }
-    
+
    aligned_pairs.swap(trimmed);
 }

@@ -194,7 +200,7 @@ int get_end_pair(const std::vector<AlignedPair>& aligned_pairs, int ref_pos_max,
            return pair_idx - 1;
        pair_idx += 1;
    }
-    
+
    return aligned_pairs.size() - 1;
 }

@@ -206,7 +212,7 @@ std::string get_reference_region_ts(const faidx_t* fai, const char* ref_name, in
    char* cref_seq;
    #pragma omp critical
    cref_seq = faidx_fetch_seq(fai, ref_name, start, end, fetched_len);
-    
+
    assert(cref_seq != NULL);

    std::string out(cref_seq);
@@ -225,6 +231,10 @@ void emit_tsv_header(FILE* fp)
    fprintf(fp, "%s\t%s\t%s\t%s\t", "event_index", "event_level_mean", "event_stdv", "event_length");
    fprintf(fp, "%s\t%s\t%s\t%s", "model_kmer", "model_mean", "model_stdv", "standardized_level");

+    if(opt::write_signal_index) {
+        fprintf(fp, "\t%s\t%s", "start_idx", "end_idx");
+    }
+
    if(opt::write_samples) {
        fprintf(fp, "\t%s", "samples");
    }
@@ -284,7 +294,7 @@ std::vector<uint32_t> event_alignment_to_cigar(const std::vector<EventAlignment>
            incoming = (r_step - 1) << BAM_CIGAR_SHIFT;
            incoming |= BAM_CDEL;
            out.push_back(incoming);
-            
+
            incoming = 1 << BAM_CIGAR_SHIFT;
            incoming |= BAM_CMATCH;
        } else {
@@ -296,7 +306,7 @@ std::vector<uint32_t> event_alignment_to_cigar(const std::vector<EventAlignment>
        // If the operation matches the previous, extend the length
        // otherwise append a new op
        if(bam_cigar_op(out.back()) == bam_cigar_op(incoming)) {
-            uint32_t sum = bam_cigar_oplen(out.back()) + 
+            uint32_t sum = bam_cigar_oplen(out.back()) +
                           bam_cigar_oplen(incoming);
            out.back() = sum << BAM_CIGAR_SHIFT | bam_cigar_op(incoming);
        } else {
@@ -313,13 +323,13 @@ std::vector<uint32_t> event_alignment_to_cigar(const std::vector<EventAlignment>
 void emit_event_alignment_sam(htsFile* fp,
                              const SquiggleRead& sr,
                              const bam_hdr_t* base_hdr,
-                              const bam1_t* base_record, 
+                              const bam1_t* base_record,
                              const std::vector<EventAlignment>& alignments)
 {
    if(alignments.empty())
        return;
    bam1_t* event_record = bam_init1();
-    
+
    // Variable-length data
    std::string qname = sr.read_name + (alignments.front().strand_idx == 0 ? ".template" : ".complement");

@@ -332,7 +342,7 @@ void emit_event_alignment_sam(htsFile* fp,
    event_record->core.flag = alignments.front().rc ? 16 : 0;

    event_record->core.l_qseq = 0;
-    
+
    event_record->core.mtid = -1;
    event_record->core.mpos = -1;
    event_record->core.isize = 0;
@@ -345,23 +355,23 @@ void emit_event_alignment_sam(htsFile* fp,
                           event_record->core.n_cigar * 4 + // 4 bytes per cigar op
                           event_record->core.l_qseq + // query seq
                           event_record->core.l_qseq; // query quality
-        
+
    // nothing copied yet
    event_record->l_data = 0;
-    
+
    // allocate data
    event_record->data = (uint8_t*)malloc(event_record->m_data);

    // copy q name
    assert(event_record->core.l_qname <= event_record->m_data);
-    strncpy(bam_get_qname(event_record), 
+    strncpy(bam_get_qname(event_record),
            qname.c_str(),
            event_record->core.l_qname);
    event_record->l_data += event_record->core.l_qname;
-    
+
    // cigar
    assert(event_record->l_data + event_record->core.n_cigar * 4 <= event_record->m_data);
-    memcpy(bam_get_cigar(event_record), 
+    memcpy(bam_get_cigar(event_record),
           &cigar[0],
           event_record->core.n_cigar * 4);
    event_record->l_data += event_record->core.n_cigar * 4;
@@ -445,6 +455,11 @@ void emit_event_alignment_tsv(FILE* fp,
                                               model_stdv,
                                               standard_level);

+        if(opt::write_signal_index) {
+            std::pair<size_t, size_t> signal_idx = sr.get_event_sample_idx(ea.strand_idx, ea.event_idx);
+            fprintf(fp, "\t%zu\t%zu", signal_idx.first, signal_idx.second);
+        }
+
        if(opt::write_samples) {
            std::vector<float> samples = sr.get_scaled_samples_for_event(ea.strand_idx, ea.event_idx);
            std::stringstream sample_ss;
@@ -525,7 +540,13 @@ void realign_read(const ReadDB& read_db,
    std::string read_name = bam_get_qname(record);

    // load read
-    SquiggleRead sr(read_name, read_db, opt::write_samples ? SRF_LOAD_RAW_SAMPLES : 0);
+    int sr_flag;
+    if(( opt::write_samples) || (opt::write_signal_index)) {
+        sr_flag = SRF_LOAD_RAW_SAMPLES;
+    } else {
+        sr_flag = 0;
+    }
+    SquiggleRead sr(read_name, read_db, sr_flag);

    if(opt::verbose > 1) {
        fprintf(stderr, "Realigning %s [%zu %zu]\n",
@@ -545,7 +566,7 @@ void realign_read(const ReadDB& read_db,
        params.hdr = hdr;
        params.record = record;
        params.strand_idx = strand_idx;
-        
+
        params.read_idx = read_idx;
        params.region_start = region_start;
        params.region_end = region_end;
@@ -596,12 +617,12 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
    int fetched_len = 0;
    int ref_offset = params.record->core.pos;
    std::string ref_name(params.hdr->target_name[params.record->core.tid]);
-    std::string ref_seq = get_reference_region_ts(params.fai, ref_name.c_str(), ref_offset, 
+    std::string ref_seq = get_reference_region_ts(params.fai, ref_name.c_str(), ref_offset,
                                                  bam_endpos(params.record), &fetched_len);

    // convert to upper case
    std::transform(ref_seq.begin(), ref_seq.end(), ref_seq.begin(), ::toupper);
-    
+
    // k from read pore model
    const uint32_t k = params.sr->get_model_k(params.strand_idx);

@@ -641,12 +662,12 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
        // get the event range of the read to re-align
        int read_kidx_start = aligned_pairs.front().read_pos;
        int read_kidx_end = aligned_pairs.back().read_pos;
-        
+
        if(do_base_rc) {
            read_kidx_start = params.sr->flip_k_strand(read_kidx_start, k);
            read_kidx_end = params.sr->flip_k_strand(read_kidx_end, k);
        }
-        
+
        assert(read_kidx_start >= 0);
        assert(read_kidx_end >= 0);

@@ -663,7 +684,7 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa

            // Get the index of the aligned pair approximately align_stride away
            int end_pair_idx = get_end_pair(aligned_pairs, curr_start_ref + align_stride, curr_pair_idx);
-        
+
            int curr_end_ref = aligned_pairs[end_pair_idx].ref_pos;
            int curr_end_read = aligned_pairs[end_pair_idx].read_pos;

@@ -680,7 +701,7 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
            assert(fwd_subseq.length() == rc_subseq.length());

            HMMInputSequence hmm_sequence(fwd_subseq, rc_subseq, pore_model->pmalphabet);
-            
+
            // Require a minimum amount of sequence to align to
            if(hmm_sequence.length() < 2 * k) {
                break;
@@ -708,7 +729,7 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
            input.rc = rc_flags[params.strand_idx];

            std::vector<HMMAlignmentState> event_alignment = profile_hmm_align(hmm_sequence, input);
-            
+
            // Output alignment
            size_t num_output = 0;
            size_t event_align_idx = 0;
@@ -736,14 +757,14 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
            int last_event_output = 0;
            int last_ref_kmer_output = 0;

-            for(; event_align_idx < event_alignment.size() && 
+            for(; event_align_idx < event_alignment.size() &&
                  (num_output < output_stride || last_section); event_align_idx++) {

                HMMAlignmentState& as = event_alignment[event_align_idx];
                if(as.state != 'K' && (int)as.event_idx != curr_start_event) {

                    EventAlignment ea;
-                    
+
                    // ref
                    ea.ref_name = ref_name;
                    ea.ref_position = curr_start_ref + as.kmer_idx;
@@ -807,8 +828,10 @@ void parse_eventalign_options(int argc, char** argv)
            case 'b': arg >> opt::bam_file; break;
            case '?': die = true; break;
            case 't': arg >> opt::num_threads; break;
+            case 'q': arg >> opt::min_mapping_quality; break;
            case 'n': opt::print_read_names = true; break;
            case 'f': opt::full_output = true; break;
+            case OPT_SIGNAL_INDEX: opt::write_signal_index = true; break;
            case OPT_SAMPLES: opt::write_samples = true; break;
            case 'v': opt::verbose++; break;
            case OPT_MODELS_FOFN: arg >> opt::models_fofn; break;
@@ -843,7 +866,7 @@ void parse_eventalign_options(int argc, char** argv)
        std::cerr << SUBPROGRAM ": a --reads file must be provided\n";
        die = true;
    }
-    
+
    if(opt::genome_file.empty()) {
        std::cerr << SUBPROGRAM ": a --genome file must be provided\n";
        die = true;
@@ -859,7 +882,7 @@ void parse_eventalign_options(int argc, char** argv)
        PoreModelSet::initialize(opt::models_fofn);
    }

-    if (die) 
+    if (die)
    {
        std::cout << "\n" << EVENTALIGN_USAGE_MESSAGE;
        exit(EXIT_FAILURE);
@@ -900,6 +923,7 @@ int eventalign_main(int argc, char** argv)
    // bind the other parameters the worker function needs here
    auto f = std::bind(realign_read, std::ref(read_db), std::ref(fai), std::ref(writer), _1, _2, _3, _4, _5);
    BamProcessor processor(opt::bam_file, opt::region, opt::num_threads);
+    processor.set_min_mapping_quality(opt::min_mapping_quality);

    // Copy the bam header to std
    if(opt::output_sam) {

--- a/src/common/nanopolish_alphabet.h
+++ b/src/common/nanopolish_alphabet.h
@@ -239,6 +239,18 @@ class Alphabet

        // does this alphabet contain all of the nucleotides in bases?
        virtual bool contains_all(const char *bases) const = 0;
+
+        // check if the motif matches a recognition site
+        bool is_motif_match(const std::string& str, const size_t i) const
+        {
+            RecognitionMatch match;
+            for(size_t j = 0; j < num_recognition_sites(); ++j){
+                match = match_to_site(str, i, get_recognition_site(j), recognition_length());
+                if(match.length == recognition_length())
+                    return true;
+            }
+            return false;
+        }
 };

 #define BASIC_MEMBER_BOILERPLATE \

--- a/src/common/nanopolish_bam_processor.cpp
+++ b/src/common/nanopolish_bam_processor.cpp
@@ -16,10 +16,12 @@

 BamProcessor::BamProcessor(const std::string& bam_file,
                           const std::string& region,
-                           const int num_threads) :
+                           const int num_threads,
+                           const int batch_size) :
                            m_bam_file(bam_file),
                            m_region(region),
-                            m_num_threads(num_threads)
+                            m_num_threads(num_threads),
+                            m_batch_size(batch_size)

 {
    // load bam file
@@ -98,11 +100,11 @@ void BamProcessor::parallel_run( std::function<void(const bam_hdr_t* hdr,

        // realign if we've hit the max buffer size or reached the end of file
        if(num_records_buffered == records.size() || result < 0 || (num_records_buffered + num_reads_realigned == m_max_reads)) {
-            #pragma omp parallel for
+            #pragma omp parallel for schedule(dynamic)
            for(size_t i = 0; i < num_records_buffered; ++i) {
                bam1_t* record = records[i];
                size_t read_idx = num_reads_realigned + i;
-                if( (record->core.flag & BAM_FUNMAP) == 0) {
+                if( (record->core.flag & BAM_FUNMAP) == 0 && record->core.qual >= m_min_mapping_quality) {
                    func(m_hdr, record, read_idx, clip_start, clip_end);
                }
            }