Skip to content
Commits on Source (5)
......@@ -52,7 +52,9 @@ ifeq ($(HTS), install)
HTS_INCLUDE=-I./htslib
else
# Use system-wide htslib
HTS_LIB=-lhts
HTS_LIB=
HTS_INCLUDE=
LIBS += -lhts
endif
# Include the header-only fast5 library
......
......@@ -59,22 +59,15 @@ nanopolish eventalign: align signal-level events to k-mers of a reference genome
Nanopolish needs access to the signal-level data measured by the nanopore sequencer. The first step of any nanopolish workflow is to prepare the input data by telling nanopolish where to find the signal files. If you ran Albacore 2.0 on your data you should run `nanopolish index` on your input reads (-d can be specified more than once if using multiple runs):
```
# Only run this if you used Albacore 2.0 or later
nanopolish index -d /path/to/raw_fast5s/ albacore_output.fastq
# Index the output of the albacore basecaller
nanopolish index -d /path/to/raw_fast5s/ -s sequencing_summary.txt albacore_output.fastq
```
If you basecalled your reads with Albacore 1.2 or earlier, you should run `nanopolish extract` on your input reads instead:
```
# Only run this if you used Albacore 1.2 or later
nanopolish extract --type template directory/pass/ > reads.fa
```
Note these two commands are mutually exclusive - you only need to run one of them. You need to decide what command to run depending on the version of Albacore that you used. In the following sections we assume you have preprocessed the data by following the instructions above and that your reads are in a file named `reads.fa`.
The `-s` option tells nanopolish to read the `sequencing_summary.txt` file from Albacore to speed up indexing. Without this option `nanopolish index` is extremely slow as it needs to read every fast5 file individually. If you basecalled your run in parallel, so you have multiple `sequencing_summary.txt` files, you can use the `-f` option to pass in a file containing the paths to the sequencing summary files (one per line).
### Computing a new consensus sequence for a draft assembly
The original purpose of nanopolish was to compute an improved consensus sequence for a draft genome assembly produced by a long-read assembly like [canu](https://github.com/marbl/canu). This section describes how to do this, starting with your draft assembly which should have megabase-sized contigs.
The original purpose of nanopolish was to compute an improved consensus sequence for a draft genome assembly produced by a long-read assembly like [canu](https://github.com/marbl/canu). This section describes how to do this, starting with your draft assembly which should have megabase-sized contigs. We've also posted a tutorial including example data [here](http://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html).
```
# Index the draft genome
......@@ -102,24 +95,7 @@ python nanopolish_merge.py polished.*.fa > polished_genome.fa
## Calling Methylation
nanopolish can use the signal-level information measured by the sequencer to detect 5-mC as described [here](http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.4184.html). Here's how you run it:
```
# Align the basecalled reads to a reference genome
bwa mem -x ont2d -t 8 reference.fa reads.fa | samtools sort -o reads.sorted.bam -T reads.tmp -
samtools index reads.sorted.bam
# Run the methylation caller
nanopolish call-methylation -t 8 -r reads.fa -g reference.fa -b reads.sorted.bam > methylation.tsv
```
The output of call-methylation is a tab-separated file containing per-read log-likelihood ratios (positive values indicate more evidence for 5-mC, negative values indicate more evidence for C). Each line contains the name of the read that covered the CpG site, which allows methylation calls to be phased. We have provided a script to calculate per-site methylation frequencies using call-methylation's output:
```
python /path/to/nanopolish/scripts/calculate_methylation_frequency -c 2.5 -i methylation.tsv > frequencies.tsv
```
The output of this script is a tab-seperated file containing the genomic position of the CpG site, the number of reads that covered the site, and the percentage of those reads that were predicted to be methylated. The `-c 2.5` option requires the absolute value of the log-likelihood ratio to be at least 2.5 to make a call, otherwise the read will be ignored. This helps reduce calling errors as only sites with sufficient evidence will be included in the calculation.
nanopolish can use the signal-level information measured by the sequencer to detect 5-mC as described [here](http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.4184.html). We've posted a tutorial on how to call methylation [here](http://nanopolish.readthedocs.io/en/latest/quickstart_call_methylation.html).
## To run using docker
......
nanopolish (0.9.0-1) UNRELEASED; urgency=medium
* New upstream version 0.9.0
* Refresh patches
* Bump upstream copyright year
-- Afif Elghraoui <afif@debian.org> Sat, 17 Feb 2018 18:06:03 -0500
nanopolish (0.8.5-2) unstable; urgency=low
* Remove dependency on nonexistent package sse-support [i386]
......
......@@ -4,7 +4,7 @@ Upstream-Contact: Jared Simpson <jared.simpson@oicr.on.ca>
Source: https://github.com/jts/nanopolish
Files: *
Copyright: 2015-2017 Ontario Institute for Cancer Research
Copyright: 2015-2018 Ontario Institute for Cancer Research
License: MIT
Files: src/test/catch.hpp
......
......@@ -4,7 +4,7 @@ Description: make build reproducible
Author: Sascha Steinbiss <satta@debian.org>
--- nanopolish.orig/Makefile
+++ nanopolish/Makefile
@@ -99,8 +99,8 @@
@@ -101,8 +101,8 @@
#
# Find the source files by searching subdirectories
......
......@@ -12,7 +12,7 @@ Oxford Nanopore sequencers are sensitive to base modifications. Here we provide
**Requirements**:
* `nanopolish v0.8.4 <installation.html>`_
* `samtools v1.2 <http://samtools.sourceforge.net/>`_
* `samtools v1.2 <https://htslib.org>`_
* `minimap2 <https://github.com/lh3/minimap2>`_
Download example dataset
......
......@@ -8,7 +8,7 @@ The original purpose for nanopolish was to improve the consensus assembly accura
**Requirements**:
* `nanopolish v0.8.4 <installation.html>`_
* `samtools v1.2 <http://samtools.sourceforge.net/>`_
* `samtools v1.2 <https://htslib.org>`_
* `bwa v0.7.12 <https://github.com/lh3/bwa>`_
* `MUMmer <https://github.com/mummer4/mummer>`_
......
......@@ -85,7 +85,7 @@ for contig_name in sorted(segments_by_name.keys()):
# Ensure the segments overlap
if not( prev_segment is None or prev_segment + SEGMENT_LENGTH + OVERLAP_LENGTH > segment_start ):
sys.stderr.write("error: segment starting at %s:%d is missing\n" % (contig, prev_segment + SEGMENT_LENGTH + 40))
sys.stderr.write("error: segment starting at %s:%d is missing\n" % (contig_name, prev_segment + SEGMENT_LENGTH + 40))
all_segments_found = False
sequence = segments_by_name[contig_name][segment_start]
......@@ -98,5 +98,5 @@ for contig_name in sorted(segments_by_name.keys()):
if all_segments_found:
print(">%s\n%s" % (contig_name, assembly))
else:
sys.stderr.write("error: some segments are missing, could not merge contig %s\n" % (contig))
sys.stderr.write("error: some segments are missing, could not merge contig %s\n" % (contig_name))
sys.exit(1)
......@@ -73,12 +73,21 @@ EventAlignmentRecord::EventAlignmentRecord(SquiggleRead* sr,
size_t kmer_pos_ref_strand = seq_record.aligned_bases[i].read_pos;
size_t kmer_pos_read_strand = seq_record.rc ? this->sr->flip_k_strand(kmer_pos_ref_strand, k) : kmer_pos_ref_strand;
size_t event_idx = this->sr->get_closest_event_to(kmer_pos_read_strand, strand_idx);
assert(event_idx != -1);
this->aligned_events.push_back( { seq_record.aligned_bases[i].ref_pos, (int)event_idx });
}
this->rc = strand_idx == 0 ? seq_record.rc : !seq_record.rc;
this->strand = strand_idx;
if(!this->aligned_events.empty()) {
this->stride = this->aligned_events.front().read_pos < this->aligned_events.back().read_pos ? 1 : -1;
// check for a degenerate alignment and discard the events if so
if(this->aligned_events.front().read_pos == this->aligned_events.back().read_pos) {
this->aligned_events.clear();
}
} else {
this->stride = 1;
}
}
//
......@@ -385,6 +394,9 @@ void AlignmentDB::load_region(const std::string& contig,
char* ref_segment = faidx_fetch_seq(fai, m_region_contig.c_str(), m_region_start, m_region_end, &fetched_len);
m_region_ref_sequence = ref_segment;
// ensure reference sequence is upper case
std::transform(m_region_ref_sequence.begin(), m_region_ref_sequence.end(), m_region_ref_sequence.begin(), ::toupper);
// load base-space alignments
m_sequence_records = _load_sequence_by_region(m_sequence_bam);
......@@ -548,11 +560,14 @@ std::vector<EventAlignmentRecord> AlignmentDB::_load_events_by_region_from_bam(c
std::vector<EventAlignmentRecord> AlignmentDB::_load_events_by_region_from_read(const std::vector<SequenceAlignmentRecord>& sequence_records)
{
std::vector<EventAlignmentRecord> records;
#pragma omp parallel for
for(size_t i = 0; i < sequence_records.size(); ++i) {
const SequenceAlignmentRecord& seq_record = sequence_records[i];
// conditionally load the squiggle read if it hasn't been loaded already
_load_squiggle_read(seq_record.read_name);
for(size_t si = 0; si < NUM_STRANDS; ++si) {
// skip complement
......@@ -566,6 +581,7 @@ std::vector<EventAlignmentRecord> AlignmentDB::_load_events_by_region_from_read(
continue;
}
#pragma omp critical
records.emplace_back(sr, si, seq_record);
}
}
......@@ -610,7 +626,11 @@ void AlignmentDB::_load_squiggle_read(const std::string& read_name)
{
// Do we need to load this fast5 file?
if(m_squiggle_read_map.find(read_name) == m_squiggle_read_map.end()) {
// Allow the load to happen in parallel but lock access to adding it into the map
SquiggleRead* sr = new SquiggleRead(read_name, m_read_db);
#pragma omp critical
m_squiggle_read_map[read_name] = sr;
}
}
......
......@@ -599,6 +599,9 @@ std::vector<EventAlignment> align_read_to_ref(const EventAlignmentParameters& pa
std::string ref_seq = get_reference_region_ts(params.fai, ref_name.c_str(), ref_offset,
bam_endpos(params.record), &fetched_len);
// convert to upper case
std::transform(ref_seq.begin(), ref_seq.end(), ref_seq.begin(), ::toupper);
// k from read pore model
const uint32_t k = params.sr->get_model_k(params.strand_idx);
......
......@@ -18,7 +18,7 @@
#include "logsum.h"
#define PACKAGE_NAME "nanopolish"
#define PACKAGE_VERSION "0.8.5"
#define PACKAGE_VERSION "0.9.0"
#define PACKAGE_BUGREPORT "https://github.com/jts/nanopolish/issues"
//
......
//---------------------------------------------------------
// Copyright 2018 Ontario Institute for Cancer Research
// Written by Jared Simpson (jared.simpson@oicr.on.ca)
//---------------------------------------------------------
//
// nanopolish_fast5_io -- lightweight functions
// to read specific data from fast5 files
//
#include <string.h>
#include <math.h>
#include "nanopolish_fast5_io.h"
//#define DEBUG_FAST5_IO 1
#define RAW_ROOT "/Raw/Reads/"
int verbose = 0;
//
hid_t fast5_open(const std::string& filename)
{
hid_t hdf5file = H5Fopen(filename.c_str(), H5F_ACC_RDONLY, H5P_DEFAULT);
return hdf5file;
}
//
void fast5_close(hid_t hdf5_file)
{
H5Fclose(hdf5_file);
}
//
std::string fast5_get_raw_read_group(hid_t hdf5_file)
{
std::string read_name = fast5_get_raw_read_name(hdf5_file);
if(read_name != "") {
return std::string(RAW_ROOT) + read_name;
} else {
return "";
}
}
//
std::string fast5_get_raw_read_name(hid_t hdf5_file)
{
// This code is From scrappie's fast5_interface
// retrieve the size of the read name
ssize_t size =
H5Lget_name_by_idx(hdf5_file, RAW_ROOT, H5_INDEX_NAME, H5_ITER_INC, 0, NULL,
0, H5P_DEFAULT);
if (size < 0) {
return "";
}
// copy the read name out of the fast5
char* name = (char*)calloc(1 + size, sizeof(char));
H5Lget_name_by_idx(hdf5_file, RAW_ROOT, H5_INDEX_NAME, H5_ITER_INC, 0, name,
1 + size, H5P_DEFAULT);
// cleanup
std::string out(name);
free(name);
return out;
}
//
std::string fast5_get_read_id(hid_t hdf5_file)
{
int ret;
hid_t read_name_attribute, raw_group, attribute_type;
size_t storage_size = 0;
char* read_name_str = NULL;
std::string out = "";
// Get the path to the raw read group
std::string raw_read_group = fast5_get_raw_read_group(hdf5_file);
if(raw_read_group == "") {
return out;
}
return fast5_get_fixed_string_attribute(hdf5_file, raw_read_group, "read_id");
}
//
raw_table fast5_get_raw_samples(hid_t hdf5_file, fast5_raw_scaling scaling)
{
float* rawptr = NULL;
hid_t space;
hsize_t nsample;
herr_t status;
float raw_unit;
raw_table rawtbl = { 0, 0, 0, NULL };
// mostly from scrappie
std::string raw_read_group = fast5_get_raw_read_group(hdf5_file);
// Create data set name
std::string signal_path = raw_read_group + "/Signal";
hid_t dset = H5Dopen(hdf5_file, signal_path.c_str(), H5P_DEFAULT);
if (dset < 0) {
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "Failed to open dataset '%s' to read raw signal from.\n", signal_path.c_str());
#endif
goto cleanup2;
}
space = H5Dget_space(dset);
if (space < 0) {
fprintf(stderr, "Failed to create copy of dataspace for raw signal %s.\n", signal_path.c_str());
goto cleanup3;
}
H5Sget_simple_extent_dims(space, &nsample, NULL);
rawptr = (float*)calloc(nsample, sizeof(float));
status = H5Dread(dset, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, rawptr);
if (status < 0) {
free(rawptr);
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "Failed to read raw data from dataset %s.\n", signal_path.c_str());
#endif
goto cleanup4;
}
// convert to pA
rawtbl = (raw_table) { nsample, 0, nsample, rawptr };
raw_unit = scaling.range / scaling.digitisation;
for (size_t i = 0; i < nsample; i++) {
rawptr[i] = (rawptr[i] + scaling.offset) * raw_unit;
}
cleanup4:
H5Sclose(space);
cleanup3:
H5Dclose(dset);
cleanup2:
return rawtbl;
}
//
std::string fast5_get_experiment_type(hid_t hdf5_file)
{
return fast5_get_fixed_string_attribute(hdf5_file, "/UniqueGlobalKey/context_tags", "experiment_type");
}
// from scrappie
float fast5_read_float_attribute(hid_t group, const char *attribute) {
float val = NAN;
if (group < 0) {
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "Invalid group passed to %s:%d.", __FILE__, __LINE__);
#endif
return val;
}
hid_t attr = H5Aopen(group, attribute, H5P_DEFAULT);
if (attr < 0) {
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "Failed to open attribute '%s' for reading.", attribute);
#endif
return val;
}
H5Aread(attr, H5T_NATIVE_FLOAT, &val);
H5Aclose(attr);
return val;
}
//
fast5_raw_scaling fast5_get_channel_params(hid_t hdf5_file)
{
// from scrappie
fast5_raw_scaling scaling = { NAN, NAN, NAN, NAN };
const char *scaling_path = "/UniqueGlobalKey/channel_id";
hid_t scaling_group = H5Gopen(hdf5_file, scaling_path, H5P_DEFAULT);
if (scaling_group < 0) {
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "Failed to group %s.", scaling_path);
#endif
return scaling;
}
scaling.digitisation = fast5_read_float_attribute(scaling_group, "digitisation");
scaling.offset = fast5_read_float_attribute(scaling_group, "offset");
scaling.range = fast5_read_float_attribute(scaling_group, "range");
scaling.sample_rate = fast5_read_float_attribute(scaling_group, "sampling_rate");
H5Gclose(scaling_group);
return scaling;
}
//
// Internal functions
//
//
std::string fast5_get_fixed_string_attribute(hid_t hdf5_file, const std::string& group_name, const std::string& attribute_name)
{
size_t storage_size;
char* buffer;
hid_t group, attribute, attribute_type;
int ret;
std::string out;
// Open the group /Raw/Reads/Read_nnn
group = H5Gopen(hdf5_file, group_name.c_str(), H5P_DEFAULT);
if(group < 0) {
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "could not open group %s\n", group_name.c_str());
#endif
goto close_group;
}
// Ensure attribute exists
ret = H5Aexists(group, attribute_name.c_str());
if(ret <= 0) {
goto close_group;
}
// Open the attribute
attribute = H5Aopen(group, attribute_name.c_str(), H5P_DEFAULT);
if(attribute < 0) {
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "could not open attribute: %s\n", attribute_name.c_str());
#endif
goto close_attr;
}
// Get data type and check it is a fixed-length string
attribute_type = H5Aget_type(attribute);
if(H5Tis_variable_str(attribute_type)) {
#ifdef DEBUG_FAST5_IO
fprintf(stderr, "variable length string detected -- ignoring attribute\n");
#endif
goto close_type;
}
// Get the storage size and allocate
storage_size = H5Aget_storage_size(attribute);
buffer = (char*)calloc(storage_size + 1, sizeof(char));
// finally read the attribute
ret = H5Aread(attribute, attribute_type, buffer);
if(ret >= 0) {
out = buffer;
}
// clean up
free(buffer);
close_type:
H5Tclose(attribute_type);
close_attr:
H5Aclose(attribute);
close_group:
H5Gclose(group);
return out;
}
//---------------------------------------------------------
// Copyright 2018 Ontario Institute for Cancer Research
// Written by Jared Simpson (jared.simpson@oicr.on.ca)
//---------------------------------------------------------
//
// nanopolish_fast5_io -- lightweight functions
// to read specific data from fast5 files. Some functions
// ported from scrappie's fast5_interface.c
//
#ifndef NANOPOLISH_FAST5_IO_H
#define NANOPOLISH_FAST5_IO_H
#include <string>
#include <vector>
#include <hdf5.h>
extern "C" {
#include "event_detection.h"
#include "scrappie_common.h"
}
// From scrappie
typedef struct {
// Information for scaling raw data from ADC values to pA
float digitisation;
float offset;
float range;
float sample_rate;
} fast5_raw_scaling;
//
// External API
//
// open the file and return the hdf ID
hid_t fast5_open(const std::string& filename);
// close the file
void fast5_close(hid_t hdf5_file);
// get the raw samples from this file
raw_table fast5_get_raw_samples(hid_t hdf5_file, fast5_raw_scaling scaling);
// get the name of the raw read in the file (eg Read_1234)
std::string fast5_get_raw_read_name(hid_t hdf5_file);
// get the name of the raw read group (eg /Raw/Read/Read_1234)
std::string fast5_get_raw_read_group(hid_t hdf5_file);
// Get the identifier of a read from the hdf5 file
std::string fast5_get_read_id(hid_t hdf5_file);
// Get the experiment type attribute
std::string fast5_get_experiment_type(hid_t hdf5_file);
// Get sample rate, and ADC-to-pA scalings
fast5_raw_scaling fast5_get_channel_params(hid_t hdf5_file);
//
// Internal utility functions
//
std::string fast5_get_fixed_string_attribute(hid_t hdf5_file, const std::string& group_name, const std::string& attribute_name);
#endif
......@@ -32,8 +32,8 @@ float profile_hmm_score(const HMMInputSequence& sequence, const HMMInputData& da
float profile_hmm_score_set(const std::vector<HMMInputSequence>& sequences, const HMMInputData& data, const uint32_t flags)
{
assert(!sequences.empty());
assert(sequences[0].get_alphabet()->get_name() == "nucleotide");
assert(data.pore_model->pmalphabet->get_name() == "nucleotide");
assert(std::string(sequences[0].get_alphabet()->get_name()) == "nucleotide");
assert(std::string(data.pore_model->pmalphabet->get_name()) == "nucleotide");
HMMInputData alt_data = data;
size_t num_models = sequences.size();
......
......@@ -63,7 +63,7 @@ int print_version(int, char **)
int main(int argc, char** argv)
{
// Turn off HDF's exception printing, which is generally unhelpful for users
hdf5::H5Eset_auto(0, NULL, NULL);
H5Eset_auto(0, NULL, NULL);
int ret = 0;
if(argc <= 1) {
......
......@@ -95,7 +95,7 @@ static const char *CALL_METHYLATION_USAGE_MESSAGE =
" -v, --verbose display verbose output\n"
" --version display version\n"
" --help display this help and exit\n"
" -r, --reads=FILE the 2D ONT reads are in fasta FILE\n"
" -r, --reads=FILE the ONT reads are in fasta FILE\n"
" -b, --bam=FILE the reads aligned to the genome assembly are in bam FILE\n"
" -g, --genome=FILE the genome we are computing a consensus for is in FILE\n"
" -t, --threads=NUM use NUM threads (default: 1)\n"
......
......@@ -414,9 +414,14 @@ std::vector<Variant> expand_variants(const AlignmentDB& alignments,
// deletion
Variant v = candidate_variants[vi];
if(alignments.are_coordinates_valid(v.ref_name, v.ref_position, v.ref_position + v.ref_seq.size())) {
v.ref_seq = alignments.get_reference_substring(v.ref_name, v.ref_position, v.ref_position + v.ref_seq.size());
// Do not allow deletions to extend within opt::min_flanking_sequence of the end of the haplotype
int deletion_end = v.ref_position + v.ref_seq.size();
if(alignments.are_coordinates_valid(v.ref_name, v.ref_position, deletion_end) &&
alignments.get_region_end() - deletion_end > opt::min_flanking_sequence ) {
v.ref_seq = alignments.get_reference_substring(v.ref_name, v.ref_position, deletion_end);
assert(v.ref_seq != candidate_variants[vi].ref_seq);
assert(v.ref_seq.length() == candidate_variants[vi].ref_seq.length() + 1);
assert(v.ref_seq.substr(0, candidate_variants[vi].ref_seq.size()) == candidate_variants[vi].ref_seq);
out_variants.push_back(v);
}
......@@ -783,6 +788,7 @@ Haplotype call_haplotype_from_candidates(const AlignmentDB& alignments,
int calling_end = candidate_variants[end_variant_idx - 1].ref_position +
candidate_variants[end_variant_idx - 1].ref_seq.length() +
opt::min_flanking_sequence;
int calling_size = calling_end - calling_start;
if(opt::verbose > 2) {
......@@ -1097,6 +1103,11 @@ void parse_call_variants_options(int argc, char** argv)
}
}
void print_invalid_window_error(int start_base, int end_base)
{
fprintf(stderr, "[error] Invalid polishing window: [%d %d] - please adjust -w parameter.\n", start_base, end_base);
}
int call_variants_main(int argc, char** argv)
{
parse_call_variants_options(argc, argv);
......@@ -1121,9 +1132,16 @@ int call_variants_main(int argc, char** argv)
end_base = contig_length - 1;
}
// Verify window coordinates are correct
if(start_base > end_base) {
print_invalid_window_error(start_base, end_base);
fprintf(stderr, "The starting coordinate of the polishing window must be less than or equal to the end coordinate\n");
exit(EXIT_FAILURE);
}
int MIN_DISTANCE_TO_END = 40;
if(contig_length - start_base < MIN_DISTANCE_TO_END) {
fprintf(stderr, "Invalid polishing window: [%d %d] - please adjust -w parameter.\n", start_base, end_base);
print_invalid_window_error(start_base, end_base);
fprintf(stderr, "The starting coordinate of the polishing window must be at least %dbp from the contig end\n", MIN_DISTANCE_TO_END);
exit(EXIT_FAILURE);
}
......
......@@ -14,6 +14,7 @@
#define SUBPROGRAM "index"
#include <iostream>
#include <fstream>
#include <sstream>
#include <getopt.h>
......@@ -24,6 +25,7 @@
#include "fs_support.hpp"
#include "logger.hpp"
#include "profiler.h"
#include "nanopolish_fast5_io.h"
static const char *INDEX_VERSION_MESSAGE =
SUBPROGRAM " Version " PACKAGE_VERSION "\n"
......@@ -39,39 +41,59 @@ static const char *INDEX_USAGE_MESSAGE =
" --version display version\n"
" -v, --verbose display verbose output\n"
" -d, --directory path to the directory containing the raw ONT signal files. This option can be given multiple times.\n"
" -f, --fast5-fofn file containing the paths to each fast5 file for the run\n"
" -s, --sequencing-summary the sequencing summary file from albacore, providing this option will make indexing much faster\n"
" -f, --summary-fofn file containing the paths to the sequencing summary files (one per line)\n"
"\nReport bugs to " PACKAGE_BUGREPORT "\n\n";
namespace opt
{
static unsigned int verbose = 0;
static std::vector<std::string> raw_file_directories;
static std::string fast5_fofn;
static std::string reads_file;
static std::vector<std::string> sequencing_summary_files;
static std::string sequencing_summary_fofn;
}
static std::ostream* os_p;
void index_file(ReadDB& read_db, const std::string& fn)
void index_file_from_map(ReadDB& read_db, const std::string& fn, const std::map<std::string, std::string>& fast5_to_read_name_map)
{
PROFILE_FUNC("index_file")
fast5::File* fp = NULL;
try {
fp = new fast5::File(fn);
if(fp->is_open()) {
fast5::Raw_Samples_Params params = fp->get_raw_samples_params();
std::string read_id = params.read_id;
read_db.add_signal_path(read_id, fn);
PROFILE_FUNC("index_file_from_map")
// Check if this fast5 file is in the map
size_t last_dir_pos = fn.find_last_of("/");
std::string fast5_basename = last_dir_pos == std::string::npos ? fn : fn.substr(last_dir_pos + 1);
const auto& iter = fast5_to_read_name_map.find(fast5_basename);
if(iter != fast5_to_read_name_map.end()) {
if(read_db.has_read(iter->second)) {
read_db.add_signal_path(iter->second.c_str(), fn);
}
} else {
if(opt::verbose > 0) {
fprintf(stderr, "Could not find read %s in sequencing summary file\n", fn.c_str());
}
} catch(hdf5_tools::Exception e) {
fprintf(stderr, "skipping invalid fast5 file: %s\n", fn.c_str());
}
delete fp;
} // process_file
void index_path(ReadDB& read_db, const std::string& path)
void index_file_from_fast5(ReadDB& read_db, const std::string& fn, const std::map<std::string, std::string>& fast5_to_read_name_map)
{
fprintf(stderr, "Indexing %s\n", path.c_str());
PROFILE_FUNC("index_file_from_fast5")
hid_t hdf5_file = fast5_open(fn);
if(hdf5_file < 0) {
fprintf(stderr, "could not open fast5 file: %s\n", fn.c_str());
}
std::string read_id = fast5_get_read_id(hdf5_file);
if(read_id != "") {
read_db.add_signal_path(read_id, fn);
}
fast5_close(hdf5_file);
} // process_file
void index_path(ReadDB& read_db, const std::string& path, const std::map<std::string, std::string>& fast5_to_read_name_map)
{
fprintf(stderr, "[readdb] indexing %s\n", path.c_str());
if (is_directory(path)) {
auto dir_list = list_directory(path);
for (const auto& fn : dir_list) {
......@@ -82,36 +104,94 @@ void index_path(ReadDB& read_db, const std::string& path)
std::string full_fn = path + "/" + fn;
if(is_directory(full_fn)) {
// recurse
index_path(read_db, full_fn);
read_db.print_stats();
index_path(read_db, full_fn, fast5_to_read_name_map);
} else if (full_fn.find(".fast5") != -1) {
index_file(read_db, full_fn);
if(!fast5_to_read_name_map.empty()) {
index_file_from_map(read_db, full_fn, fast5_to_read_name_map);
} else {
index_file_from_fast5(read_db, full_fn, fast5_to_read_name_map);
}
}
}
}
} // process_path
void process_fast5_fofn(ReadDB& read_db, const std::string& fast5_fofn)
// read sequencing summary files from the fofn and add them to the list
void process_summary_fofn()
{
//
std::ifstream in_file(fast5_fofn.c_str());
if(opt::sequencing_summary_fofn.empty()) {
return;
}
// open
std::ifstream in_file(opt::sequencing_summary_fofn.c_str());
if(in_file.bad()) {
fprintf(stderr, "error: could not file %s\n", fast5_fofn.c_str());
fprintf(stderr, "error: could not file %s\n", opt::sequencing_summary_fofn.c_str());
exit(EXIT_FAILURE);
}
// read
size_t count = 0;
std::string filename;
while(getline(in_file, filename)) {
if(count++ % 1000 == 0) {
fprintf(stderr, "Loaded %zu from fofn\n", count);
opt::sequencing_summary_files.push_back(filename);
}
index_file(read_db, filename);
}
void exit_bad_header(const std::string& str, const std::string& filename)
{
fprintf(stderr, "Could not find %s column in the header of %s\n", str.c_str(), filename.c_str());
exit(EXIT_FAILURE);
}
static const char* shortopts = "vd:f:";
//
void parse_sequencing_summary(const std::string& filename, std::map<std::string, std::string>& out_map)
{
// open
std::ifstream in_file(filename.c_str());
if(in_file.bad()) {
fprintf(stderr, "error: could not file %s\n", filename.c_str());
exit(EXIT_FAILURE);
}
// read header to get the column index of the read and file name
std::string header;
getline(in_file, header);
std::vector<std::string> fields = split(header, '\t');
const std::string READ_NAME_STR = "read_id";
const std::string FILENAME_STR = "filename";
size_t filename_idx = -1;
size_t read_name_idx = -1;
for(size_t i = 0; i < fields.size(); ++i) {
if(fields[i] == READ_NAME_STR) {
read_name_idx = i;
}
if(fields[i] == FILENAME_STR) {
filename_idx = i;
}
}
if(filename_idx == -1) {
exit_bad_header(FILENAME_STR, filename);
}
if(read_name_idx == -1) {
exit_bad_header(READ_NAME_STR, filename);
}
// read records and add to map
std::string line;
while(getline(in_file, line)) {
fields = split(line, '\t');
std::string fast5_filename = fields[filename_idx];
std::string read_name = fields[read_name_idx];
out_map[fast5_filename] = read_name;
}
}
static const char* shortopts = "vd:f:s:";
enum {
OPT_HELP = 1,
......@@ -125,7 +205,8 @@ static const struct option longopts[] = {
{ "log-level", required_argument, NULL, OPT_LOG_LEVEL },
{ "verbose", no_argument, NULL, 'v' },
{ "directory", required_argument, NULL, 'd' },
{ "fast5-fofn", required_argument, NULL, 'f' },
{ "sequencing-summary-file", required_argument, NULL, 's' },
{ "summary-fofn", required_argument, NULL, 'f' },
{ NULL, 0, NULL, 0 }
};
......@@ -146,8 +227,9 @@ void parse_index_options(int argc, char** argv)
log_level.push_back(arg.str());
break;
case 'v': opt::verbose++; break;
case 's': opt::sequencing_summary_files.push_back(arg.str()); break;
case 'd': opt::raw_file_directories.push_back(arg.str()); break;
case 'f': arg >> opt::fast5_fofn; break;
case 'f': arg >> opt::sequencing_summary_fofn; break;
}
}
......@@ -179,10 +261,20 @@ int index_main(int argc, char** argv)
{
parse_index_options(argc, argv);
// Read a map from fast5 files to read name from the sequencing summary files (if any)
process_summary_fofn();
std::map<std::string, std::string> fast5_to_read_name_map;
for(const auto& ss_filename : opt::sequencing_summary_files) {
if(opt::verbose > 2) {
fprintf(stderr, "summary: %s\n", ss_filename.c_str());
}
parse_sequencing_summary(ss_filename, fast5_to_read_name_map);
}
// import read names, and possibly fast5 paths, from the fasta/fastq file
ReadDB read_db;
read_db.build(opt::reads_file);
bool all_reads_have_paths = read_db.check_signal_paths();
// if the input fastq did not contain a complete set of paths
......@@ -190,15 +282,17 @@ int index_main(int argc, char** argv)
if(!all_reads_have_paths) {
for(const auto& dir_name : opt::raw_file_directories) {
index_path(read_db, dir_name);
}
if(!opt::fast5_fofn.empty()) {
process_fast5_fofn(read_db, opt::fast5_fofn);
index_path(read_db, dir_name, fast5_to_read_name_map);
}
}
size_t num_with_path = read_db.get_num_reads_with_path();
if(num_with_path == 0) {
fprintf(stderr, "Error: no fast5 files found\n");
exit(EXIT_FAILURE);
} else {
read_db.print_stats();
read_db.save();
}
return 0;
}
......@@ -209,7 +209,9 @@ void phase_single_read(const ReadDB& read_db,
upper_search.ref_position = alignment_end_pos;
auto upper_iter = std::upper_bound(variants.begin(), variants.end(), upper_search, sortByPosition);
fprintf(stderr, "%s %s:%u-%u %zu\n", read_name.c_str(), ref_name.c_str(), alignment_start_pos, alignment_end_pos, upper_iter - lower_iter);
if(opt::verbose >= 1) {
fprintf(stderr, "Phasing read %s %s:%u-%u %zu\n", read_name.c_str(), ref_name.c_str(), alignment_start_pos, alignment_end_pos, upper_iter - lower_iter);
}
// no variants to phase?
if(lower_iter == variants.end()) {
......@@ -223,6 +225,9 @@ void phase_single_read(const ReadDB& read_db,
alignment_end_pos,
&fetched_len);
// convert to upper case to avoid calling c>C as variants
std::transform(reference_seq.begin(), reference_seq.end(), reference_seq.begin(), ::toupper);
std::string read_outseq = reference_seq;
std::string read_outqual(reference_seq.length(), MAX_Q_SCORE + BAM_Q_OFFSET);
......@@ -259,6 +264,7 @@ void phase_single_read(const ReadDB& read_db,
data.strand = event_align_record.strand;
data.rc = event_align_record.rc;
data.event_stride = event_align_record.stride;
data.pore_model = data.read->get_base_model(data.strand);
int e1,e2;
bool bounded = AlignmentDB::_find_by_ref_bounds(event_align_record.aligned_events,
......@@ -301,7 +307,10 @@ void phase_single_read(const ReadDB& read_db,
//fprintf(stderr, "\t%s score: %.2lf %.2lf %c p_wrong: %.3lf Q: %d QC: %c\n", v.key().c_str(), ref_score, alt_score, call, log_p_wrong, (int)q_score, q_char);
int out_position = v.ref_position - alignment_start_pos;
assert(read_outseq[out_position] == v.ref_seq[0]);
if(read_outseq[out_position] != v.ref_seq[0]) {
fprintf(stderr, "warning: reference base at position %d does not match variant record (%c != %c)\n",
v.ref_position, v.ref_seq[0], read_outseq[out_position]);
}
read_outseq[out_position] = call;
read_outqual[out_position] = q_char;
}
......@@ -314,7 +323,7 @@ void phase_single_read(const ReadDB& read_db,
out_record->core.tid = record->core.tid;
out_record->core.pos = alignment_start_pos;
out_record->core.qual = record->core.qual;
out_record->core.flag = record->core.flag & BAM_FSUPPLEMENTARY;
out_record->core.flag = record->core.flag;
// no read pairs
out_record->core.mtid = -1;
......@@ -345,6 +354,9 @@ int phase_reads_main(int argc, char** argv)
// load reference fai file
faidx_t *fai = fai_load(opt::genome_file.c_str());
if(fai == NULL) {
exit(EXIT_FAILURE);
}
std::vector<Variant> variants;
if(!opt::region.empty()) {
......@@ -375,8 +387,10 @@ int phase_reads_main(int argc, char** argv)
BamProcessor processor(opt::bam_file, opt::region, opt::num_threads);
// Copy the bam header to std
sam_hdr_write(sam_out, processor.get_bam_header());
int ret = sam_hdr_write(sam_out, processor.get_bam_header());
if(ret != 0) {
fprintf(stderr, "[warning] sam_hdr_write returned %d\n", ret);
}
processor.parallel_run(f);
fai_destroy(fai);
......
......@@ -190,6 +190,12 @@ void ReadDB::add_signal_path(const std::string& read_id, const std::string& path
m_data[read_id].signal_data_path = path;
}
bool ReadDB::has_read(const std::string& read_id) const
{
const auto& iter = m_data.find(read_id);
return iter != m_data.end();
}
//
std::string ReadDB::get_signal_path(const std::string& read_id) const
{
......@@ -238,14 +244,21 @@ void ReadDB::save() const
//
bool ReadDB::check_signal_paths() const
size_t ReadDB::get_num_reads_with_path() const
{
size_t num_reads_with_path = 0;
for(const auto& iter : m_data) {
if(iter.second.signal_data_path == "") {
return false;
if(iter.second.signal_data_path != "") {
num_reads_with_path += 1;
}
}
return num_reads_with_path;
}
return true;
bool ReadDB::check_signal_paths() const
{
size_t num_reads_with_path = get_num_reads_with_path();
return num_reads_with_path == m_data.size();
}
//
......@@ -255,5 +268,5 @@ void ReadDB::print_stats() const
for(const auto& iter : m_data) {
num_reads_with_path += iter.second.signal_data_path != "";
}
fprintf(stderr, "[readdb] num reads: %zu, num reads with path: %zu\n", m_data.size(), num_reads_with_path);
fprintf(stderr, "[readdb] num reads: %zu, num reads with path to fast5: %zu\n", m_data.size(), num_reads_with_path);
}