Skip to content
Commits on Source (10)
......@@ -4,6 +4,12 @@
Software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more (see Nanopolish modules, below).
## Release notes
* 0.10.2: added new program `nanopolish polya` to estimate the length of poly-A tails on direct RNA reads (by @paultsw)
* 0.10.1: `nanopolish variants --consensus` now only outputs a VCF file instead of a fasta sequence. The VCF file describes the changes that need to be made to turn the draft sequence into the polished assembly. A new program, `nanopolish vcf2fasta`, is provided to generate the polished genome (this replaces `nanopolish_merge.py`, see usage instructions below). This change is to avoid issues when merging segments that end on repeat boundaries (reported by Michael Wykes and Chris Wright).
## Dependencies
A compiler that supports C++11 is needed to build nanopolish. Development of the code is performed using [gcc-4.8](https://gcc.gnu.org/gcc-4.8/).
......@@ -36,7 +42,7 @@ When major features have been added or bugs fixed, we will tag and release a new
```
git clone --recursive https://github.com/jts/nanopolish.git
cd nanopolish
git checkout v0.7.1
git checkout v0.9.2
make
```
......@@ -45,7 +51,6 @@ make
The main subprograms of nanopolish are:
```
nanopolish extract: extract reads in FASTA or FASTQ format from a directory of FAST5 files
nanopolish call-methylation: predict genomic bases that may be methylated
nanopolish variants: detect SNPs and indels with respect to a reference genome
nanopolish variants --consensus: calculate an improved consensus sequence for a draft genome assembly
......@@ -82,7 +87,7 @@ Now, we use nanopolish to compute the consensus sequence (the genome is polished
```
python nanopolish_makerange.py draft.fa | parallel --results nanopolish.results -P 8 \
nanopolish variants --consensus polished.{1}.fa -w {1} -r reads.fa -b reads.sorted.bam -g draft.fa -t 4 --min-candidate-frequency 0.1
nanopolish variants --consensus -o polished.{1}.vcf -w {1} -r reads.fa -b reads.sorted.bam -g draft.fa -t 4 --min-candidate-frequency 0.1
```
This command will run the consensus algorithm on eight 50kbp segments of the genome at a time, using 4 threads each. Change the ```-P``` and ```--threads``` options as appropriate for the machines you have available.
......@@ -90,7 +95,7 @@ This command will run the consensus algorithm on eight 50kbp segments of the gen
After all polishing jobs are complete, you can merge the individual 50kb segments together back into the final assembly:
```
python nanopolish_merge.py polished.*.fa > polished_genome.fa
nanopolish vcf2fasta -g draft.fa polished.*.vcf > polished_genome.fa
```
## Calling Methylation
......
nanopolish (0.10.2-1) unstable; urgency=medium
* Team upload.
* New upstream version
* debhelper 11
* Point Vcs fields to salsa.debian.org
* Standards-Version: 4.2.1
* Fix path for Perl interpreter
* Enhance long description
-- Andreas Tille <tille@debian.org> Thu, 13 Sep 2018 11:37:24 +0200
nanopolish (0.9.0-1) unstable; urgency=medium
* New upstream version
......
Source: nanopolish
Section: science
Priority: optional
Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Uploaders: Afif Elghraoui <afif@debian.org>
Build-Depends:
debhelper (>= 10),
zlib1g-dev,
libfast5-dev (>= 0.6.5),
libhts-dev,
libeigen3-dev,
Standards-Version: 4.1.3
Section: science
Priority: optional
Build-Depends: debhelper (>= 11~),
zlib1g-dev,
libfast5-dev (>= 0.6.5),
libhts-dev,
libeigen3-dev
Standards-Version: 4.2.1
Vcs-Browser: https://salsa.debian.org/med-team/nanopolish
Vcs-Git: https://salsa.debian.org/med-team/nanopolish.git
Homepage: https://github.com/jts/nanopolish
Vcs-Git: https://anonscm.debian.org/git/debian-med/nanopolish.git
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/nanopolish.git
Package: nanopolish
Architecture: any-amd64 any-i386
Depends:
${shlibs:Depends},
${misc:Depends},
Suggests:
python,
perl,
make,
Depends: ${shlibs:Depends},
${misc:Depends}
Suggests: python,
perl,
make
Description: consensus caller for nanopore sequencing data
Nanopolish uses a signal-level hidden Markov model for consensus
calling of nanopore genome sequencing data.
Nanopolish uses a signal-level hidden Markov model for consensus calling
of nanopore genome sequencing data. It can perform signal-level analysis
of Oxford Nanopore sequencing data. Nanopolish can calculate an improved
consensus sequence for a draft genome assembly, detect base
modifications, call SNPs and indels with respect to a reference genome
and more.
......@@ -7,13 +7,11 @@ Subject: [PATCH] check for existence of group before opening it (#335)
src/common/nanopolish_fast5_io.cpp | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/src/common/nanopolish_fast5_io.cpp b/src/common/nanopolish_fast5_io.cpp
index 0144062..7f38719 100644
--- a/src/common/nanopolish_fast5_io.cpp
+++ b/src/common/nanopolish_fast5_io.cpp
@@ -208,6 +208,13 @@ std::string fast5_get_fixed_string_attribute(hid_t hdf5_file, const std::string&
int ret;
std::string out;
@@ -215,6 +215,13 @@ std::string fast5_get_fixed_string_attri
return "";
}
+ // according to http://hdf-forum.184993.n3.nabble.com/check-if-dataset-exists-td194725.html
+ // we should use H5Lexists to check for the existence of a group/dataset using an arbitrary path
......
......@@ -28,3 +28,9 @@ override_dh_auto_clean:
$(MAKE) clean
$(RM) .depend
mv Makefile~ Makefile
override_dh_install:
dh_install
for pl in `find debian -name "*.pl"` ; do \
sed -i '1s?^#!/usr/bin/env.*perl?#!/usr/bin/perl?' $${pl} ; \
done
......@@ -10,7 +10,7 @@ Modules available: ::
nanopolish variants: detect SNPs and indels with respect to a reference genome
nanopolish variants --consensus: calculate an improved consensus sequence for a draft genome assembly
nanopolish eventalign: align signal-level events to k-mers of a reference genome
nanopolish phase-reads: Phase reads using heterozygous SNVs with respect to a reference genome
|
extract
......@@ -167,7 +167,7 @@ Usage example
* - ``-r``, ``--reads=FILE``
- Y
- NA
- the 2D ONT reads are in fasta FILE
- the ONT reads are in fasta FILE
* - ``-b``, ``--bam=FILE``
- Y
......@@ -253,7 +253,7 @@ Usage example
* - ``-r``, ``--reads=FILE``
- Y
- NA
- the 2D ONT reads are in fasta FILE
- the ONT reads are in fasta FILE
* - ``-b``, ``--bam=FILE``
- Y
......@@ -365,7 +365,7 @@ Usage example
* - ``-r, --reads=FILE``
- Y
- NA
- the 2D ONT reads are in fasta FILE
- the ONT reads are in fasta FILE
* - ``-b, --bam=FILE``
- Y
......@@ -411,3 +411,78 @@ Usage example
- N
- NA
- read alternative k-mer models from FILE
phase-reads - (experimental)
--------------------
Overview
"""""""""""""""""""""""
Phase reads using heterozygous SNVs with respect to a reference genome
Input
"""""""""""""""""""""""
* basecalled reads
* alignment information
* assembled genome
* variants (from nanopolish variants or from other sources eg. Illumina VCF)
Usage example
"""""""""""""""""""""""
::
nanopolish phase-reads [OPTIONS] --reads reads.fa --bam alignments.bam --genome genome.fa variants.vcf
.. list-table::
:widths: 20 10 20 50
:header-rows: 1
* - Argument name(s)
- Required
- Default value
- Description
* - ``-v``
- N
- NA
- write verbose output
* - ``-w, --window=STR``
- N
- NA
- Only phase reads in the window STR (format : ctg:start_id-end_id)
* - ``-r, --reads=FILE``
- Y
- NA
- the ONT reads are in fasta FILE
* - ``-b, --bam=FILE``
- Y
- NA
- the reads aligned to the genome assembly are in bam FILE
* - ``-g, --genome=FILE``
- Y
- NA
- the genome we are computing a consensus for is in FILE
* - ``variants.vcf``
- Y
- NA
- the variants (from nanopolish variants or Illumina in VCF format) to be phased are in FILE
* - ``-t, --threads=NUM``
- N
- 1
- use NUM threads
* - ``--progress``
- N
- NA
- print out a progress message
.. _quickstart_consensus:
Quickstart - how to polish the consensus assembly
Quickstart - how to polish a genome assembly
===================================================
The original purpose for nanopolish was to improve the consensus assembly accuracy for Oxford Nanopore Technology sequencing reads. Here we provide a step-by-step tutorial to help you get started with our tool.
The original purpose of nanopolish was to improve the consensus accuracy of an assembly of Oxford Nanopore Technology sequencing reads. Here we provide a step-by-step tutorial to help you get started.
**Requirements**:
* `nanopolish v0.8.4 <installation.html>`_
* `samtools v1.2 <https://htslib.org>`_
* `bwa v0.7.12 <https://github.com/lh3/bwa>`_
* `nanopolish <installation.html>`_
* `samtools <https://htslib.org>`_
* `minimap2 <https://github.com/lh3/minimap2>`_
* `MUMmer <https://github.com/mummer4/mummer>`_
Download example dataset
......@@ -29,8 +29,7 @@ You can download the example dataset we will use here: ::
* Region: "tig00000001:200000-202000"
* Note: Ligation-mediated PCR amplification performed
This is a subset of reads that aligned to a 2kb region in the E. coli draft assembly. To see how we generated these files please refer to the tutorial :ref:`creati
ng_example_dataset <here>`.
This is a subset of reads that aligned to a 2kb region in the E. coli draft assembly. To see how we generated these files please refer to the tutorial :ref:`creating_example_dataset <here>`.
You should find the following files:
......@@ -42,7 +41,7 @@ You should find the following files:
For the evaluation step you will need the reference genome: ::
wget -O ref.fa ftp://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn
curl -o ref.fa https://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn
Analysis workflow
-------------------------------
......@@ -50,7 +49,6 @@ Analysis workflow
The pipeline below describes the recommended analysis workflow for larger datasets. In this tutorial, we will run through the basic steps of the pipeline for this smaller (2kb) dataset.
.. figure:: _static/nanopolish-workflow.png
:scale: 90%
:alt: nanopolish-tutorial-workflow
Data preprocessing
......@@ -66,24 +64,20 @@ Compute the draft genome assembly using canu
-----------------------------------------------
As computing the draft genome assembly takes a few hours we have included the pre-assembled data for you (``draft.fa``).
We used the following parameters with `canu <canu.readthedocs.io>`_: ::
We used the following parameters with `canu <http://canu.readthedocs.io/en/latest/>`_: ::
canu \
-p ecoli -d outdir genomeSize=4.6m \
-nanopore-raw albacore-2.0.1-merged.fastq \
-nanopore-raw albacore-2.0.1-merged.fastq
Compute a new consensus sequence for a draft assembly
------------------------------------------------------------------------
Now that we have ``reads.fasta`` indexed with ``nanopolish index``, and have a draft genome assembly ``draft.fa``, we can begin to improve the assembly with nanopolish. Let us get started!
Now that we have ``reads.fasta`` indexed with ``nanopolish index``, and have a draft genome assembly ``draft.fa``, we can begin to improve the assembly with nanopolish. Let us get started!
First step, is to index the draft genome assembly. We can do that with the following command: ::
First, we align the original reads (``reads.fasta``) to the draft assembly (``draft.fa``) and sort alignments: ::
bwa index draft.fa
Next, we align the original reads (``reads.fasta``) to the draft assembly (``draft.fa``) and sort alignments: ::
bwa mem -x ont2d -t 8 draft.fa reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp
minimap2 -ax map-ont -t 8 draft.fa reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp
samtools index reads.sorted.bam
**Checkpoint**: we can do a quick check to see if this step worked. The bam file should not be empty. ::
......@@ -93,13 +87,24 @@ Next, we align the original reads (``reads.fasta``) to the draft assembly (``dra
Then we run the consensus algorithm. For larger datasets we use ``nanopolish_makerange.py`` to split the draft genome assembly into 50kb segments, so that we can run the consensus algorithm on each segment in parallel. The output would be the polished segments in ``fasta`` format.
Since our dataset is only covering a 2kb region, we skip this step and use the following command: ::
nanopolish variants --consensus polished.fa \
nanopolish variants --consensus -o polished.vcf \
-w "tig00000001:200000-202000" \
-r reads.fasta \
-b reads.sorted.bam \
-g draft.fa
We are left with our desired output: ``polished.fa``.
We are left with: ``polished.vcf``.
**Note**: As of v0.10.1, ``nanopolish variants --consensus`` only outputs a VCF file instead of a fasta sequence.
To generate the polished genome in fasta format: ::
nanopolish vcf2fasta --skip-checks -g draft.fa polished.vcf > polished_genome.fa
We only polished a 2kb region, so let's pull that out: ::
samtools faidx polished_genome.fa
samtools faidx polished_genome.fa "tig00000001:200000-202000" > polished.fa
Evaluate the assembly
---------------------------------
......
......@@ -12,9 +12,9 @@ The eventalign module in nanopolish is used to align events or "squiggles" to a
**Requirements**:
* `nanopolish v0.8.4 <installation.html>`_
* `samtools v1.2 <http://samtools.sourceforge.net/>`_
* `bwa v0.7.12 <https://github.com/lh3/bwa>`_
* `nanopolish <installation.html>`_
* `samtools <http://samtools.sourceforge.net/>`_
* `minimap2 <https://github.com/lh3/minimap2>`_
Download example dataset
------------------------------------
......@@ -42,18 +42,12 @@ You should find the following files:
You will need the E. coli reference genome: ::
wget -O ref.fa ftp://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn
curl -o ref.fa https://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn
Align the reads with bwa
Align the reads with minimap2
--------------------------------
In order to run bwa we first need to index the reference genome: ::
bwa index ref.fa
**Output files**: ``ref.fa.sa``, ``ref.fa.amb``, ``ref.fa.ann``, ``ref.fa.pac``, and ``ref.fa.bwt``.
We will need to index the reads as well: ::
We first need to index the reads: ::
nanopolish index -d fast5_files/ reads.fasta
......@@ -61,7 +55,7 @@ We will need to index the reads as well: ::
Then we can align the reads to the reference: ::
bwa mem -x ont2d ref.fa reads.fasta | samtools sort -o reads-ref.sorted.bam -T reads.tmp
minimap2 -ax map-ont -t 8 ref.fa reads.fasta | samtools sort -o reads-ref.sorted.bam -T reads.tmp
samtools index reads-ref.sorted.bam
**Output files**: ``reads-ref.sorted.bam`` and ``reads-ref.sorted.bam.bai``.
......
......@@ -48,7 +48,7 @@ csv_reader = csv.DictReader(in_fh, delimiter='\t')
for record in csv_reader:
num_sites = int(record['num_cpgs'])
num_sites = int(record['num_motifs'])
llr = float(record['log_lik_ratio'])
# Skip ambiguous call
......@@ -77,7 +77,7 @@ for record in csv_reader:
update_call_stats(key, num_sites, is_methylated, sequence)
# header
print("\t".join(["chromosome", "start", "end", "num_cpgs_in_group", "called_sites", "called_sites_methylated", "methylated_frequency", "group_sequence"]))
print("\t".join(["chromosome", "start", "end", "num_motifs_in_group", "called_sites", "called_sites_methylated", "methylated_frequency", "group_sequence"]))
sorted_keys = sorted(sites.keys(), key = lambda x: split_key(x))
......
from __future__ import print_function
import sys
import os.path
import os
......@@ -19,12 +21,12 @@ for model_file in sys.stdin:
functions.append(function_name)
outfiles.append(outfile)
print "// Autogenerated from convert_all_models.py"
print("// Autogenerated from convert_all_models.py")
for f in outfiles:
print "#include \"%s\"" % (f)
print("#include \"%s\"" % f)
print "\n// Autogenerated from convert_all_models.py"
print "const static std::vector<PoreModel> builtin_models({"
print "\t" + ",\n\t".join([f + "()" for f in functions])
print "});"
print("\n// Autogenerated from convert_all_models.py")
print("const static std::vector<PoreModel> builtin_models({")
print("\t" + ",\n\t".join([f + "()" for f in functions]))
print("});")
#! /usr/bin/env python
# Convert a .model file into a .inl file that can be directly compiled into nanopolish
from __future__ import print_function
import argparse
def assign(name, value):
assert(len(name) > 0)
assert(value is not None)
print "\ttmp.%s = %s;" % (name, value)
print("\ttmp.%s = %s;" % (name, value))
def quote(value):
assert(len(value) > 0)
......@@ -44,43 +45,43 @@ for line in f:
bases[b] = 1
# Preamble
print "// Autogenerated by convert_model_to_header.py"
print "#ifndef NANOPOLISH_%s_INL" % (args.function_name.upper())
print "#define NANOPOLISH_%s_INL" % (args.function_name.upper())
print("// Autogenerated by convert_model_to_header.py")
print("#ifndef NANOPOLISH_%s_INL" % args.function_name.upper())
print("#define NANOPOLISH_%s_INL" % args.function_name.upper())
data_name = "%s_data" % (args.function_name)
print "static std::vector<double> %s = {" % (data_name)
data_name = "%s_data" % args.function_name
print("static std::vector<double> %s = {" % data_name)
for ki, t in enumerate(model):
is_last = ki == len(model) - 1
sep = ',' if not is_last else ''
print "\t\t%.5f, %.5f, %.5f, %.5f%s // %s" % (float(t[1]), float(t[2]), float(t[3]), float(t[4]), sep, t[0])
print "};"
print("\t\t%.5f, %.5f, %.5f, %.5f%s // %s" % (float(t[1]), float(t[2]), float(t[3]), float(t[4]), sep, t[0]))
print("};")
print "PoreModel %s()\n{" % (args.function_name)
print "\tPoreModel tmp;"
print("PoreModel %s()\n{" % args.function_name)
print("\tPoreModel tmp;")
# Output metadata
assign("model_filename", quote(args.input))
assign("k", K)
num_states = len(model)
print "\ttmp.states.resize(%d);" % (num_states)
print "\tfor(size_t i = 0; i < %d; ++i) {" % (num_states)
print "\t\ttmp.states[i].level_mean = %s[4*i + 0];" % (data_name)
print "\t\ttmp.states[i].level_stdv = %s[4*i + 1];" % (data_name)
print "\t\ttmp.states[i].sd_mean = %s[4*i + 2];" % (data_name)
print "\t\ttmp.states[i].sd_stdv = %s[4*i + 3];" % (data_name)
print "\t\ttmp.states[i].update_sd_lambda();"
print "\t\ttmp.states[i].update_logs();"
print "\t}"
print("\ttmp.states.resize(%d);" % num_states)
print("\tfor(size_t i = 0; i < %d; ++i) {" % num_states)
print("\t\ttmp.states[i].level_mean = %s[4*i + 0];" % data_name)
print("\t\ttmp.states[i].level_stdv = %s[4*i + 1];" % data_name)
print("\t\ttmp.states[i].sd_mean = %s[4*i + 2];" % data_name)
print("\t\ttmp.states[i].sd_stdv = %s[4*i + 3];" % data_name)
print("\t\ttmp.states[i].update_sd_lambda();")
print("\t\ttmp.states[i].update_logs();")
print("\t}")
if "alphabet" in header_kv:
print "\ttmp.pmalphabet = get_alphabet_by_name(%s);" % (quote(header_kv["alphabet"]))
print("\ttmp.pmalphabet = get_alphabet_by_name(%s);" % (quote(header_kv["alphabet"])))
else:
print "\ttmp.pmalphabet = best_alphabet(%s);" % (quote("".join(bases)))
print("\ttmp.pmalphabet = best_alphabet(%s);" % (quote("".join(bases))))
print "\ttmp.set_metadata(%s, %s);" % (quote(header_kv["kit"]), quote(header_kv["strand"]))
print "\treturn tmp;\n}"
print "#endif"
print("\ttmp.set_metadata(%s, %s);" % (quote(header_kv["kit"]), quote(header_kv["strand"])))
print("\treturn tmp;\n}")
print("#endif")
#!/usr/bin/env python
'''
"""
========================================================
Extract info on reads that align to a given region
Extract info on reads that align to a given region
in draft genome assembly.
========================================================
'''
"""
from __future__ import print_function
try:
from Bio import SeqIO
from Bio.SeqIO.FastaIO import SimpleFastaParser
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import pysam
import argparse
import subprocess
......@@ -144,27 +147,27 @@ def main():
if ".gz" in file_type:
with gzip.open(o_fa, "rt") as fin:
if "fasta.gz" in file_type:
for record in SeqIO.parse(fin, "fasta"):
for title, seq in SimpleFastaParser(handle):
name = title.split(None, 1)[0]
if record.id in region_read_ids:
fout.write(">" + record.id + "\n")
fout.write(str(record.seq) + "\n")
fout.write(">%s\n%s\n" % (name, seq))
elif "fastq.gz" in file_type:
for record in SeqIO.parse(fin, "fastq"):
for title, seq, qual in FastqGeneralIterator(handle):
name = title.split(None, 1)[0]
if record.id in region_read_ids:
fout.write(">" + record.id + "\n")
fout.write(str(record.seq) + "\n")
fout.write(">%s\n%s\n" % (name, seq))
else:
with open(o_fa, "rt") as fin:
if "fasta" in file_type:
for record in SeqIO.parse(fin, "fasta"):
for title, seq in SimpleFastaParser(handle):
name = title.split(None, 1)[0]
if record.id in region_read_ids:
fout.write(">" + record.id + "\n")
fout.write(str(record.seq) + "\n")
fout.write(">%s\n%s\n" % (name, seq))
elif "fastq" in file_type:
for record in SeqIO.parse(fin, "fastq"):
for title, seq, qual in FastqGeneralIterator(handle):
name = title.split(None, 1)[0]
if record.id in region_read_ids:
fout.write(">" + record.id + "\n")
fout.write(str(record.seq) + "\n")
fout.write(">%s\n%s\n" % (name, seq))
# --------------------------------------------------------
# PART 7: Let's get to tarring
......@@ -178,6 +181,11 @@ def main():
archive = tarfile.open(tar_filename, "w:gz")
custom_print( "[ Creating a tar.gz file ] \t" + tar_filename )
custom_print( "[+] FAST5 files: " + op + "/fast5_files/<FAST5 file(s)>" )
# track missing fast5 files
bad_f5_found = False # true if missing fast5 file
bad_read_id = ""
bad_f5_path = ""
num_bad_cases = 0
for r in region_fast5_files.keys():
read_id = r
f5 = region_fast5_files[r]
......@@ -185,7 +193,23 @@ def main():
# get basename of fast5 file
f5_basename = extract_basename(f5)
an = op + "/fast5_files/" + f5_basename
archive.add(f5, arcname=an)
try:
archive.add(f5, arcname=an)
except:
bad_f5_found = True
bad_read_id = read_id
bad_f5_path = f5
num_bad_cases += 1
# handle missing fast5 files
if bad_f5_found:
print("\nERROR: For read " + read_id + ", could not add " + str(f5) + ".")
print("This path is inferred from the readdb file.")
print("Please check that this is the correct path in readdb file for this read.")
if num_bad_cases > 1:
print("There are " + str(num_bad_cases) + " other reads with this problem (out of " + str(len(region_fast5_files)) + ").")
print("\n")
sys.exit(1)
# --------------------------------------------------------
# PART 8: Add new files to tar
......@@ -289,7 +313,7 @@ def detect_fa_filetype(fa_filename):
for ext in ['fastq.gz', 'fasta.gz', 'fastq', 'fasta']:
if path.endswith(ext):
return ext
print( "Must be either fasta, fastq, fasta.gz, fastq.gz" )
print("Must be either fasta, fastq, fasta.gz, fastq.gz")
sys.exit(1)
def custom_print(s):
......@@ -302,7 +326,7 @@ def custom_print(s):
global verbose
global log
if verbose:
print s
print(s)
log.append(s)
if __name__ == "__main__":
......
from __future__ import print_function
import sys
import argparse
from Bio import SeqIO
from Bio.SeqIO.FastaIO import SimpleFastaParser
parser = argparse.ArgumentParser(description='Partition a genome into a set of overlapping segments')
parser.add_argument('--segment-length', type=int, default=50000)
......@@ -10,7 +12,9 @@ if len(extra) != 1:
sys.stderr.write("Error: a genome file is expected\n")
filename = extra[0]
recs = [ (rec.name, len(rec.seq)) for rec in SeqIO.parse(open(filename), "fasta")]
with open(filename) as handle:
recs = [(title.split(None, 1)[0], len(seq))
for title, seq in SimpleFastaParser(handle)]
SEGMENT_LENGTH = args.segment_length
OVERLAP_LENGTH = args.overlap_length
......
from __future__ import print_function
import sys
import glob
from Bio import SeqIO
from Bio.SeqIO.FastaIO import SimpleFastaParser
from Bio import pairwise2
def merge_into_consensus(consensus, incoming, overlap_length):
......@@ -67,14 +69,15 @@ segments_by_name = dict()
# Load the polished segments into a dictionary keyed by the start coordinate
for fn in sys.argv[1:]:
for rec in SeqIO.parse(open(fn), "fasta"):
(contig, segment_range) = rec.name.split(":")
with open(fn) as handle:
for title, seq in SimpleFastaParser(handle):
(contig, segment_range) = title.split(None, 1)[0].split(":")
if contig not in segments_by_name:
segments_by_name[contig] = dict()
if contig not in segments_by_name:
segments_by_name[contig] = dict()
segment_start, segment_end = segment_range.split("-")
segments_by_name[contig][int(segment_start)] = str(rec.seq)
segment_start, segment_end = segment_range.split("-")
segments_by_name[contig][int(segment_start)] = seq
# Assemble while making sure every segment is present
for contig_name in sorted(segments_by_name.keys()):
......
"""
reestimate_polya_emissions.py: given two `polya-samples` TSV files based on different
underlying kmer models (with the newer TSV giving failing poly(A) segmentations),
infer the best new parameters for the HMM emissions.
Usage:
$ python reestimate_polya_emissions.py samples.old.tsv seg.old.tsv samples.new.tsv
where:
* `samples.old.tsv` is the output of `nanopolish polya -vv [...] | grep 'polya-samples'`,
generated by the **old** kmer models;
* `seg.old.tsv` is the output of `nanopolish polya -v [...] | grep 'polya-segmentation'`,
generated by the **old** kmer models;
* `samples.new.tsv` is the output of `nanopolish polya -vv [...] | grep 'polya-samples'`,
generated by the **new** kmer models.
Dependencies:
* numpy >= 1.11.2
* scipy >= 0.18.1
* sklearn >= 0.18.1
"""
import csv
import numpy as np
import argparse
import os
from scipy.stats import norm
from sklearn.mixture import GaussianMixture
log_inv_sqrt_2pi = np.log(0.3989422804014327)
def log_normal_pdf(xs, mu, sigma):
"""Compute the log-normal PDF of a given sample(s) against a mu and sigma."""
alpha = (xs - mu) * np.reciprocal(sigma)
return ( log_inv_sqrt_2pi - np.log(sigma) + (-0.5 * alpha * alpha) )
def fit_gaussian(samples):
"""Given a numpy array of floating point samples, fit a gaussian distribution."""
mu, sigma = norm.fit(samples)
return (mu,sigma)
def fit_gmm(samples, ncomponents=2):
"""Given a numpy array of floating point samples, fit a gaussian mixture model."""
# assume samples is of shape (NSAMPLES,); unsqueeze to (NSAMPLES,1) and train a GMM:
gmm = GaussianMixture(n_components=ncomponents)
gmm.fit(samples.reshape(-1,1))
# return params of GMM in [(coeff, mu, sigma)] format:
params = [(gmm.weights_[c], gmm.means_[c][0], gmm.covariances_[c][0][0]) for c in range(ncomponents)]
return params
def old_tsv_to_numpy(tsv_path):
"""
Read a TSV containing raw samples and return a dictionary consisting
of the following numpy datasets:
* S_loglkhd: the log-likelihoods of the samples belonging to the START segment.
* L_loglkhd: the log-likelihoods of the samples belonging to the LEADER segment.
* A_loglkhd: the log-likelihoods of the samples belonging to the ADAPTER segment.
* P_loglkhd: the log-likelihoods of the samples belonging to the POLYA segment.
* T_loglkhd: the log-likelihoods of the samples belonging to the TRANSCRIPT segment.
"""
# instantiate arrays to hold values:
S_loglkhd = []
L_loglkhd = []
A_loglkhd = []
P_loglkhd = []
T_loglkhd = []
# loop over TSV file and append data to arrays:
str2int = { 'START': 0, 'LEADER': 1, 'ADAPTER': 2, 'POLYA': 3, 'TRANSCRIPT': 5 }
with open(tsv_path, 'r') as f:
headers = ['tag','read_id', 'chr', 'idx', 'sample', 'scaled_sample',
's_llh', 'l_llh','a_llh','p_llh', 'c_llh', 't_llh','region']
rdr = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE, fieldnames=headers)
for row in rdr:
# parse row fields:
s_llh = float(row['s_llh'])
l_llh = float(row['l_llh'])
a_llh = float(row['a_llh'])
p_llh = float(row['p_llh'])
t_llh = float(row['t_llh'])
region = row['region']
# append log-likelihoods to appropriate arrays:
if region == 'START':
S_loglkhd.append(s_llh)
if region == 'LEADER':
L_loglkhd.append(l_llh)
if region == 'ADAPTER':
A_loglkhd.append(a_llh)
if region == 'POLYA':
P_loglkhd.append(p_llh)
if region == 'TRANSCRIPT':
T_loglkhd.append(t_llh)
return { "S_loglkhd": np.array(S_loglkhd, dtype=float),
"L_loglkhd": np.array(L_loglkhd, dtype=float),
"A_loglkhd": np.array(A_loglkhd, dtype=float),
"P_loglkhd": np.array(P_loglkhd, dtype=float),
"T_loglkhd": np.array(T_loglkhd, dtype=float) }
def make_segmentation_dict(segmentations_tsv_path):
"""
Load a segmentations TSV file. Rows of `segmentations_tsv_path` look like this:
tag read_id: pos: L_0 A_0: P_0: P_1: RR: P(A)L: AL:
polya-segmentation fc06... 161684804 47.0 1851.0 8354.0 11424.0 73.76 75.18 35.23
Note that this function only takes the first available segmentation for each read, i.e.
if a read id appears more than once in the TSV, only the first segmentation is kept, and
later occurrences of the read id in the TSV are ignored.
"""
segments = {}
# loop thru TSV and update the list of segmentations:
with open(segmentations_tsv_path, 'r') as f:
headers = ['tag', 'read_id', 'pos', 'L_start', 'A_start', 'P_start', 'P_end', 'rate', 'plen', 'alen']
rdr = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE, fieldnames=headers)
for row in rdr:
if row['read_id'] not in segments.keys():
segments[row['read_id']] = { 'L_start': int(float(row['L_start'])),
'A_start': int(float(row['A_start'])),
'P_start': int(float(row['P_start'])),
'P_end': int(float(row['P_end'])) }
return segments
def region_search(read_id, sample_ix, segmentations):
"""
Given a dictionary of ("gold-standard") segmentations, look up the region that a
given read and sample index belongs to.
Returns an integer label out of 0,1,2,3,4,5 where:
0 => START, 1 => LEADER, 2 => ADAPTER, 3 => POLYA, 5 => TRANSCRIPT, 6 => UNKNOWN
(We skip label '4' because it represents CLIFFs, which we don't track here --- they have
a uniform distribution.)
"""
# find read ID in segmentations:
read_key = None
for long_read_id in segmentations.keys():
if long_read_id[0:len(read_id)] == read_id:
read_key = long_read_id
# return UNK if read not found:
if read_key == None:
return 6
# find region that `sample_ix` belongs to:
l_start = segmentations[read_key]['L_start']
a_start = segmentations[read_key]['A_start']
p_start = segmentations[read_key]['P_start']
p_end = segmentations[read_key]['P_end']
if (sample_ix < l_start):
return 0
if (sample_ix < a_start):
return 1
if (sample_ix < p_start):
return 2
if (sample_ix <= p_end):
return 3
if (sample_ix > p_end):
return 5
return 6
def new_tsv_to_numpy(tsv_path, segmentations):
"""
Read a TSV of new, miscalled samples and a dictionary of correct segmentations (coming from
an older, correct TSV) and return a dict of numpy arrays.
Args:
* tsv_path: path to a TSV generated by `nanopolish polya -vv [...]`.
* segmentations: a dictionary of segmentation intervals, given in numpy format.
Returns: a dictionary of numpy arrays.
"""
# instantiate arrays to hold values:
S_samples = []
L_samples = []
A_samples = []
P_samples = []
T_samples = []
# loop over TSV file and append data to arrays:
with open(tsv_path, 'r') as f:
headers = ['tag','read_id', 'chr', 'idx', 'sample', 'scaled_sample',
's_llh', 'l_llh','a_llh','p_llh', 'c_llh', 't_llh','region']
rdr = csv.DictReader(f, delimiter='\t', quoting=csv.QUOTE_NONE, fieldnames=headers)
for row in rdr:
scaled_sample = float(row['scaled_sample'])
read = row['read_id']
contig = row['chr']
index = int(row['idx'])
region = region_search(read, index, segmentations)
if region == 0:
S_samples.append(scaled_sample)
if region == 1:
L_samples.append(scaled_sample)
if region == 2:
A_samples.append(scaled_sample)
if region == 3:
P_samples.append(scaled_sample)
if region == 5:
T_samples.append(scaled_sample)
return { "S_samples": np.array(S_samples, dtype=float),
"L_samples": np.array(L_samples, dtype=float),
"A_samples": np.array(A_samples, dtype=float),
"P_samples": np.array(P_samples, dtype=float),
"T_samples": np.array(T_samples, dtype=float) }
def main(old_samples_tsv, old_segmentations_tsv, new_samples_tsv, benchmark=True):
"""
Infer and print the new values for mu and sigma (for each of S, L, A, P, C, T) to STDOUT.
Args:
* old_samples_tsv: path to TSV file containing polya-samples data from an older kmer model.
* old_segmentations_tsv: path to TSV file containing polya-segmentation data from an older kmer model.
* new_samples_tsv: path to TSV file containing polya-samples data from the newer kmer model.
Returns: N/A, prints outputs to STDOUT.
"""
### read all samples into numpy arrays:
print("Loading data from TSV...")
old_data = old_tsv_to_numpy(old_samples_tsv)
segmentations = make_segmentation_dict(old_segmentations_tsv)
new_data = new_tsv_to_numpy(new_samples_tsv, segmentations)
print("... Datasets loaded.")
### infer best possible new mu,sigma for each of S, L, A, P, T:
print("Fitting gaussians to new scaled samples (this may take a while)...")
new_mu_S, new_sigma_S = fit_gaussian(new_data['S_samples'])
new_mu_L, new_sigma_L = fit_gaussian(new_data['L_samples'])
(new_pi0_A, new_mu0_A, new_sig0_A), (new_pi1_A, new_mu1_A, new_sig1_A) = fit_gmm(new_data['A_samples'], ncomponents=2)
new_mu_P, new_sigma_P = fit_gaussian(new_data['P_samples'])
(new_pi0_T, new_mu0_T, new_sig0_T), (new_pi1_T, new_mu1_T, new_sig1_T) = fit_gmm(new_data['T_samples'], ncomponents=2)
### print to stdout:
print("New params for START: mu = {0}, var = {1}, stdv = {2}".format(new_mu_S, new_sigma_S, np.sqrt(new_sigma_S)))
print("New params for LEADER: mu = {0}, var = {1}, stdv = {2}".format(new_mu_L, new_sigma_L, np.sqrt(new_sigma_L)))
print("New params for ADAPTER0: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi0_A, new_mu0_A, new_sig0_A, np.sqrt(new_sig0_A)))
print("New params for ADAPTER1: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi1_A, new_mu1_A, new_sig1_A, np.sqrt(new_sig1_A)))
print("New params for POLYA: mu = {0}, var = {1}, stdv = {2}".format(new_mu_P, new_sigma_P, np.sqrt(new_sigma_P)))
print("New params for TRANSCR0: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi0_T, new_mu0_T, new_sig0_T, np.sqrt(new_sig0_T)))
print("New params for TRANSCR1: pi = {0}, mu = {1}, var = {2}, stdv = {3}".format(new_pi1_T, new_mu1_T, new_sig1_T, np.sqrt(new_sig1_T)))
### optionally, benchmark:
if not benchmark:
return
print("===== Emission Log-Likelihood Benchmarks =====")
old_S_llh = np.mean(old_data['S_loglkhd'])
new_S_llh = np.mean(norm.logpdf(new_data['S_samples'], loc=new_mu_S, scale=np.sqrt(new_sigma_S)))
print("> Average START log-probs:")
print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_S_llh, new_S_llh))
old_L_llh = np.mean(old_data['L_loglkhd'])
new_L_llh = np.mean(norm.logpdf(new_data['L_samples'], loc=new_mu_L, scale=np.sqrt(new_sigma_L)))
print("> Average LEADER log-probs:")
print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_L_llh, new_L_llh))
old_A_llh = np.mean(old_data['A_loglkhd'])
new_A_llh0 = new_pi0_A * norm.pdf(new_data['A_samples'], loc=new_mu0_A, scale=np.sqrt(new_sig0_A))
new_A_llh1 = new_pi1_A * norm.pdf(new_data['A_samples'], loc=new_mu1_A, scale=np.sqrt(new_sig1_A))
new_A_llh = np.mean(np.log(new_A_llh0 + new_A_llh1))
print("> Average ADAPTER log-probs:")
print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_A_llh, new_A_llh))
old_P_llh = np.mean(old_data['P_loglkhd'])
new_P_llh = np.mean(norm.logpdf(new_data['P_samples'], loc=new_mu_P, scale=np.sqrt(new_sigma_P)))
print("> Average POLYA log-probs:")
print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_P_llh, new_P_llh))
old_T_llh = np.mean(old_data['T_loglkhd'])
new_T_llh0 = new_pi0_T * norm.pdf(new_data['T_samples'], loc=new_mu0_T, scale=np.sqrt(new_sig0_T))
new_T_llh1 = new_pi1_T * norm.pdf(new_data['T_samples'], loc=new_mu1_T, scale=np.sqrt(new_sig1_T))
new_T_llh = np.mean(np.log(new_T_llh0 + new_T_llh1))
print("> Average TRANSCRIPT log-probs:")
print("> Old avg. log-likelihood: {0} | New avg. log-likelihood: {1}".format(old_T_llh, new_T_llh))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Infer new Poly(A) emission parameters.")
parser.add_argument("old_samples_tsv", help="Path to TSV file of old samples.")
parser.add_argument("segmentation_tsv", help="Path to segmentations for reads.")
parser.add_argument("new_samples_tsv", help="Path to TSV file of new samples.")
parser.add_argument("--benchmark", default=True, type=bool, dest="benchmark",
help="If `--benchmark=False`, don't the new estimated HMM parameters.")
args = parser.parse_args()
# sanity checks:
assert os.path.exists(args.old_samples_tsv)
assert os.path.exists(args.segmentation_tsv)
assert os.path.exists(args.new_samples_tsv)
# run inference and (optional) benchmarking of new parameters:
main(args.old_samples_tsv, args.segmentation_tsv, args.new_samples_tsv, benchmark=args.benchmark)
......@@ -63,11 +63,13 @@ static const char *EVENTALIGN_USAGE_MESSAGE =
" -b, --bam=FILE the reads aligned to the genome assembly are in bam FILE\n"
" -g, --genome=FILE the genome we are computing a consensus for is in FILE\n"
" -t, --threads=NUM use NUM threads (default: 1)\n"
" -q, --min-mapping-quality=NUM only use reads with mapping quality at least NUM (default: 0)\n"
" --scale-events scale events to the model, rather than vice-versa\n"
" --progress print out a progress message\n"
" -n, --print-read-names print read names instead of indexes\n"
" --summary=FILE summarize the alignment of each read/strand in FILE\n"
" --samples write the raw samples for the event to the tsv output\n"
" --signal-index write the raw signal start and end index values for the event to the tsv output\n"
" --models-fofn=FILE read alternative k-mer models from FILE\n"
"\nReport bugs to " PACKAGE_BUGREPORT "\n\n";
......@@ -84,32 +86,36 @@ namespace opt
static int progress = 0;
static int num_threads = 1;
static int scale_events = 0;
static int batch_size = 128;
static int batch_size = 512;
static int min_mapping_quality = 0;
static bool print_read_names;
static bool full_output;
static bool write_samples = false;
static bool write_signal_index = false;
}
static const char* shortopts = "r:b:g:t:w:vn";
static const char* shortopts = "r:b:g:t:w:q:vn";
enum { OPT_HELP = 1, OPT_VERSION, OPT_PROGRESS, OPT_SAM, OPT_SUMMARY, OPT_SCALE_EVENTS, OPT_MODELS_FOFN, OPT_SAMPLES };
enum { OPT_HELP = 1, OPT_VERSION, OPT_PROGRESS, OPT_SAM, OPT_SUMMARY, OPT_SCALE_EVENTS, OPT_MODELS_FOFN, OPT_SAMPLES, OPT_SIGNAL_INDEX };
static const struct option longopts[] = {
{ "verbose", no_argument, NULL, 'v' },
{ "reads", required_argument, NULL, 'r' },
{ "bam", required_argument, NULL, 'b' },
{ "genome", required_argument, NULL, 'g' },
{ "window", required_argument, NULL, 'w' },
{ "threads", required_argument, NULL, 't' },
{ "summary", required_argument, NULL, OPT_SUMMARY },
{ "models-fofn", required_argument, NULL, OPT_MODELS_FOFN },
{ "print-read-names", no_argument, NULL, 'n' },
{ "samples", no_argument, NULL, OPT_SAMPLES },
{ "scale-events", no_argument, NULL, OPT_SCALE_EVENTS },
{ "sam", no_argument, NULL, OPT_SAM },
{ "progress", no_argument, NULL, OPT_PROGRESS },
{ "help", no_argument, NULL, OPT_HELP },
{ "version", no_argument, NULL, OPT_VERSION },
{ "verbose", no_argument, NULL, 'v' },
{ "reads", required_argument, NULL, 'r' },
{ "bam", required_argument, NULL, 'b' },
{ "genome", required_argument, NULL, 'g' },
{ "window", required_argument, NULL, 'w' },
{ "threads", required_argument, NULL, 't' },
{ "min-mapping-quality", required_argument, NULL, 'q' },
{ "summary", required_argument, NULL, OPT_SUMMARY },
{ "models-fofn", required_argument, NULL, OPT_MODELS_FOFN },
{ "print-read-names", no_argument, NULL, 'n' },
{ "samples", no_argument, NULL, OPT_SAMPLES },
{ "signal-index", no_argument, NULL, OPT_SIGNAL_INDEX },
{ "scale-events", no_argument, NULL, OPT_SCALE_EVENTS },
{ "sam", no_argument, NULL, OPT_SAM },
{ "progress", no_argument, NULL, OPT_PROGRESS },
{ "help", no_argument, NULL, OPT_HELP },
{ "version", no_argument, NULL, OPT_VERSION },
{ NULL, 0, NULL, 0 }
};
......@@ -147,7 +153,7 @@ struct EventalignSummary
};
//
const PoreModel* EventAlignmentParameters::get_model() const
const PoreModel* EventAlignmentParameters::get_model() const
{
if(this->alphabet == "") {
return this->sr->get_base_model(this->strand_idx);
......@@ -176,12 +182,12 @@ void trim_aligned_pairs_to_ref_region(std::vector<AlignedPair>& aligned_pairs, i
{
std::vector<AlignedPair> trimmed;
for(size_t i = 0; i < aligned_pairs.size(); ++i) {
if(aligned_pairs[i].ref_pos >= ref_start &&
if(aligned_pairs[i].ref_pos >= ref_start &&
aligned_pairs[i].ref_pos <= ref_end) {
trimmed.push_back(aligned_pairs[i]);
}
}
aligned_pairs.swap(trimmed);
}
......@@ -194,7 +200,7 @@ int get_end_pair(const std::vector<AlignedPair>& aligned_pairs, int ref_pos_max,
return pair_idx - 1;
pair_idx += 1;
}
return aligned_pairs.size() - 1;
}
......@@ -206,7 +212,7 @@ std::string get_reference_region_ts(const faidx_t* fai, const char* ref_name, in
char* cref_seq;
#pragma omp critical
cref_seq = faidx_fetch_seq(fai, ref_name, start, end, fetched_len);
assert(cref_seq != NULL);
std::string out(cref_seq);
......@@ -225,6 +231,10 @@ void emit_tsv_header(FILE* fp)
fprintf(fp, "%s\t%s\t%s\t%s\t", "event_index", "event_level_mean", "event_stdv", "event_length");
fprintf(fp, "%s\t%s\t%s\t%s", "model_kmer", "model_mean", "model_stdv", "standardized_level");
if(opt::write_signal_index) {
fprintf(fp, "\t%s\t%s", "start_idx", "end_idx");
}
if(opt::write_samples) {
fprintf(fp, "\t%s", "samples");
}
......@@ -284,7 +294,7 @@ std::vector<uint32_t> event_alignment_to_cigar(const std::vector<EventAlignment>
incoming = (r_step - 1) << BAM_CIGAR_SHIFT;
incoming |= BAM_CDEL;
out.push_back(incoming);
incoming = 1 << BAM_CIGAR_SHIFT;
incoming |= BAM_CMATCH;
} else {
......@@ -296,7 +306,7 @@ std::vector<uint32_t> event_alignment_to_cigar(const std::vector<EventAlignment>
// If the operation matches the previous, extend the length
// otherwise append a new op
if(bam_cigar_op(out.back()) == bam_cigar_op(incoming)) {
uint32_t sum = bam_cigar_oplen(out.back()) +
uint32_t sum = bam_cigar_oplen(out.back()) +
bam_cigar_oplen(incoming);
out.back() = sum << BAM_CIGAR_SHIFT | bam_cigar_op(incoming);
} else {
......@@ -313,13 +323,13 @@ std::vector<uint32_t> event_alignment_to_cigar(const std::vector<EventAlignment>
void emit_event_alignment_sam(htsFile* fp,
const SquiggleRead& sr,
const bam_hdr_t* base_hdr,
const bam1_t* base_record,
const bam1_t* base_record,
const std::vector<EventAlignment>& alignments)
{
if(alignments.empty())
return;
bam1_t* event_record = bam_init1();
// Variable-length data
std::string qname = sr.read_name + (alignments.front().strand_idx == 0 ? ".template" : ".complement");
......@@ -332,7 +342,7 @@ void emit_event_alignment_sam(htsFile* fp,
event_record->core.flag = alignments.front().rc ? 16 : 0;
event_record->core.l_qseq = 0;
event_record->core.mtid = -1;
event_record->core.mpos = -1;
event_record->core.isize = 0;
......@@ -345,23 +355,23 @@ void emit_event_alignment_sam(htsFile* fp,
event_record->core.n_cigar * 4 + // 4 bytes per cigar op
event_record->core.l_qseq + // query seq
event_record->core.l_qseq; // query quality
// nothing copied yet
event_record->l_data = 0;
// allocate data
event_record->data = (uint8_t*)malloc(event_record->m_data);
// copy q name
assert(event_record->core.l_qname <= event_record->m_data);
strncpy(bam_get_qname(event_record),
strncpy(bam_get_qname(event_record),
qname.c_str(),
event_record->core.l_qname);
event_record->l_data += event_record->core.l_qname;
// cigar
assert(event_record->l_data + event_record->core.n_cigar * 4 <= event_record->m_data);
memcpy(bam_get_cigar(event_record),
memcpy(bam_get_cigar(event_record),
&cigar[0],
event_record->core.n_cigar * 4);
event_record->l_data += event_record->core.n_cigar * 4;
......@@ -445,6 +455,11 @@ void emit_event_alignment_tsv(FILE* fp,
model_stdv,
standard_level);
if(opt::write_signal_index) {
std::pair<size_t, size_t> signal_idx = sr.get_event_sample_idx(ea.strand_idx, ea.event_idx);
fprintf(fp, "\t%zu\t%zu", signal_idx.first, signal_idx.second);
}
if(opt::write_samples) {
std::vector<float> samples = sr.get_scaled_samples_for_event(ea.strand_idx, ea.event_idx);
std::stringstream sample_ss;
......@@ -525,7 +540,13 @@ void realign_read(const ReadDB& read_db,
std::string read_name = bam_get_qname(record);
// load read
SquiggleRead sr(read_name, read_db, opt::write_samples ? SRF_LOAD_RAW_SAMPLES : 0);
int sr_flag;
if(( opt::write_samples) || (opt::write_signal_index)) {
sr_flag = SRF_LOAD_RAW_SAMPLES;
} else {
sr_flag = 0;
}
SquiggleRead sr(read_name, read_db, sr_flag);
if(opt::verbose > 1) {
fprintf(stderr, "Realigning %s [%zu %zu]\n",
......@@ -545,7 +566,7 @@ void realign_read(const ReadDB& read_db,
params.hdr = hdr;
params.record = record;
params.strand_idx = strand_idx;
params.read_idx = read_idx;
params.region_start = region_start;
params.region_end = region_end;
......@@ -596,12 +617,12 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
int fetched_len = 0;
int ref_offset = params.record->core.pos;
std::string ref_name(params.hdr->target_name[params.record->core.tid]);
std::string ref_seq = get_reference_region_ts(params.fai, ref_name.c_str(), ref_offset,
std::string ref_seq = get_reference_region_ts(params.fai, ref_name.c_str(), ref_offset,
bam_endpos(params.record), &fetched_len);
// convert to upper case
std::transform(ref_seq.begin(), ref_seq.end(), ref_seq.begin(), ::toupper);
// k from read pore model
const uint32_t k = params.sr->get_model_k(params.strand_idx);
......@@ -641,12 +662,12 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
// get the event range of the read to re-align
int read_kidx_start = aligned_pairs.front().read_pos;
int read_kidx_end = aligned_pairs.back().read_pos;
if(do_base_rc) {
read_kidx_start = params.sr->flip_k_strand(read_kidx_start, k);
read_kidx_end = params.sr->flip_k_strand(read_kidx_end, k);
}
assert(read_kidx_start >= 0);
assert(read_kidx_end >= 0);
......@@ -663,7 +684,7 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
// Get the index of the aligned pair approximately align_stride away
int end_pair_idx = get_end_pair(aligned_pairs, curr_start_ref + align_stride, curr_pair_idx);
int curr_end_ref = aligned_pairs[end_pair_idx].ref_pos;
int curr_end_read = aligned_pairs[end_pair_idx].read_pos;
......@@ -680,7 +701,7 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
assert(fwd_subseq.length() == rc_subseq.length());
HMMInputSequence hmm_sequence(fwd_subseq, rc_subseq, pore_model->pmalphabet);
// Require a minimum amount of sequence to align to
if(hmm_sequence.length() < 2 * k) {
break;
......@@ -708,7 +729,7 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
input.rc = rc_flags[params.strand_idx];
std::vector<HMMAlignmentState> event_alignment = profile_hmm_align(hmm_sequence, input);
// Output alignment
size_t num_output = 0;
size_t event_align_idx = 0;
......@@ -736,14 +757,14 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
int last_event_output = 0;
int last_ref_kmer_output = 0;
for(; event_align_idx < event_alignment.size() &&
for(; event_align_idx < event_alignment.size() &&
(num_output < output_stride || last_section); event_align_idx++) {
HMMAlignmentState& as = event_alignment[event_align_idx];
if(as.state != 'K' && (int)as.event_idx != curr_start_event) {
EventAlignment ea;
// ref
ea.ref_name = ref_name;
ea.ref_position = curr_start_ref + as.kmer_idx;
......@@ -807,8 +828,10 @@ void parse_eventalign_options(int argc, char** argv)
case 'b': arg >> opt::bam_file; break;
case '?': die = true; break;
case 't': arg >> opt::num_threads; break;
case 'q': arg >> opt::min_mapping_quality; break;
case 'n': opt::print_read_names = true; break;
case 'f': opt::full_output = true; break;
case OPT_SIGNAL_INDEX: opt::write_signal_index = true; break;
case OPT_SAMPLES: opt::write_samples = true; break;
case 'v': opt::verbose++; break;
case OPT_MODELS_FOFN: arg >> opt::models_fofn; break;
......@@ -843,7 +866,7 @@ void parse_eventalign_options(int argc, char** argv)
std::cerr << SUBPROGRAM ": a --reads file must be provided\n";
die = true;
}
if(opt::genome_file.empty()) {
std::cerr << SUBPROGRAM ": a --genome file must be provided\n";
die = true;
......@@ -859,7 +882,7 @@ void parse_eventalign_options(int argc, char** argv)
PoreModelSet::initialize(opt::models_fofn);
}
if (die)
if (die)
{
std::cout << "\n" << EVENTALIGN_USAGE_MESSAGE;
exit(EXIT_FAILURE);
......@@ -900,6 +923,7 @@ int eventalign_main(int argc, char** argv)
// bind the other parameters the worker function needs here
auto f = std::bind(realign_read, std::ref(read_db), std::ref(fai), std::ref(writer), _1, _2, _3, _4, _5);
BamProcessor processor(opt::bam_file, opt::region, opt::num_threads);
processor.set_min_mapping_quality(opt::min_mapping_quality);
// Copy the bam header to std
if(opt::output_sam) {
......
......@@ -239,6 +239,18 @@ class Alphabet
// does this alphabet contain all of the nucleotides in bases?
virtual bool contains_all(const char *bases) const = 0;
// check if the motif matches a recognition site
bool is_motif_match(const std::string& str, const size_t i) const
{
RecognitionMatch match;
for(size_t j = 0; j < num_recognition_sites(); ++j){
match = match_to_site(str, i, get_recognition_site(j), recognition_length());
if(match.length == recognition_length())
return true;
}
return false;
}
};
#define BASIC_MEMBER_BOILERPLATE \
......
......@@ -16,10 +16,12 @@
BamProcessor::BamProcessor(const std::string& bam_file,
const std::string& region,
const int num_threads) :
const int num_threads,
const int batch_size) :
m_bam_file(bam_file),
m_region(region),
m_num_threads(num_threads)
m_num_threads(num_threads),
m_batch_size(batch_size)
{
// load bam file
......@@ -98,11 +100,11 @@ void BamProcessor::parallel_run( std::function<void(const bam_hdr_t* hdr,
// realign if we've hit the max buffer size or reached the end of file
if(num_records_buffered == records.size() || result < 0 || (num_records_buffered + num_reads_realigned == m_max_reads)) {
#pragma omp parallel for
#pragma omp parallel for schedule(dynamic)
for(size_t i = 0; i < num_records_buffered; ++i) {
bam1_t* record = records[i];
size_t read_idx = num_reads_realigned + i;
if( (record->core.flag & BAM_FUNMAP) == 0) {
if( (record->core.flag & BAM_FUNMAP) == 0 && record->core.qual >= m_min_mapping_quality) {
func(m_hdr, record, read_idx, clip_start, clip_end);
}
}
......