Skip to content
Commits on Source (5)
......@@ -53,6 +53,7 @@ And read about how we use it to complete bacterial genomes here:
* [Known contamination](#known-contamination)
* [Manual multiplicity](#manual-multiplicity)
* [Manual completion](#manual-completion)
* [Using an external long-read assembly](#using-an-external-long-read-assembly)
* [Other included tools](#other-included-tools)
* [Paper](#paper)
* [Acknowledgements](#acknowledgements)
......@@ -374,7 +375,8 @@ Help:
Input:
-1 SHORT1, --short1 SHORT1 FASTQ file of first short reads in each pair (required)
-2 SHORT2, --short2 SHORT2 FASTQ file of second short reads in each pair (required)
-s UNPAIRED, --unpaired UNPAIRED FASTQ file of unpaired short reads (optional)
-s UNPAIRED, --unpaired UNPAIRED
FASTQ file of unpaired short reads (optional)
-l LONG, --long LONG FASTQ or FASTA file of long reads (optional)
Output:
......@@ -382,8 +384,9 @@ Output:
--verbosity VERBOSITY Level of stdout and log file information (default: 1)
0 = no stdout, 1 = basic progress indicators, 2 = extra info,
3 = debugging info
--min_fasta_length MIN_FASTA_LENGTH Exclude contigs from the FASTA file which are shorter than this length
(default: 100)
--min_fasta_length MIN_FASTA_LENGTH
Exclude contigs from the FASTA file which are shorter than this
length (default: 100)
--keep KEEP Level of file retention (default: 1)
0 = only keep final files: assembly (FASTA, GFA and log),
1 = also save graphs at main checkpoints,
......@@ -394,13 +397,13 @@ Output:
Other:
-t THREADS, --threads THREADS Number of threads used (default: 8)
--mode {conservative,normal,bold} Bridging mode (default: normal)
--mode {conservative,normal,bold}
Bridging mode (default: normal)
conservative = smaller contigs, lowest misassembly rate
normal = moderate contig size and misassembly rate
bold = longest contigs, higher misassembly rate
--linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in the
underlying sequence (default: 0)
--linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in
the underlying sequence (default: 0)
```
### Advanced options
......@@ -409,53 +412,72 @@ Run `unicycler --help_all` to see a complete list of the program's options. Thes
```
SPAdes assembly:
These options control the short-read SPAdes assembly at the beginning of the Unicycler pipeline.
These options control the short-read SPAdes assembly at the beginning of the Unicycler
pipeline.
--spades_path SPADES_PATH Path to the SPAdes executable (default: spades.py)
--no_correct Skip SPAdes error correction step (default: conduct SPAdes error
correction)
--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read
length (default: 0.2)
--max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction of the read
length (default: 0.95)
--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of
the read length (default: 0.2)
--max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction
of the read length (default: 0.95)
--kmers KMERS Exact k-mers to use for SPAdes assembly, comma-separated
(example: 22,33,44, default: automatic)
--kmer_count KMER_COUNT Number of k-mer steps to use in SPAdes assembly (default: 10)
--depth_filter DEPTH_FILTER Filter out contigs lower than this fraction of the chromosomal depth, if
doing so does not result in graph dead ends (default: 0.25)
--depth_filter DEPTH_FILTER Filter out contigs lower than this fraction of the chromosomal
depth, if doing so does not result in graph dead ends (default:
0.25)
--spades_tmp_dir SPADES_TMP_DIR
Specify SPAdes temporary directory using the SPAdes --tmp-dir
option (default: make a temporary directory in the output
directory)
miniasm+Racon assembly:
These options control the use of miniasm and Racon to produce long-read bridges.
--no_miniasm Skip miniasm+Racon bridging (default: use miniasm and Racon to produce
long-read bridges)
--no_miniasm Skip miniasm+Racon bridging (default: use miniasm and Racon to
produce long-read bridges)
--racon_path RACON_PATH Path to the Racon executable (default: racon)
--existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY
A pre-prepared long read assembly for the sample in GFA format.
If this option is used, Unicycler will skip the miniasm/Racon
steps and instead use the given assembly (default: perform long
read assembly using miniasm/Racon)
Assembly rotation:
These options control the rotation of completed circular sequence near the end of the Unicycler pipeline.
--no_rotate Do not rotate completed replicons to start at a standard gene (default:
completed replicons are rotated)
--start_genes START_GENES FASTA file of genes for start point of rotated replicons (default:
start_genes.fasta)
--start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene search
(default: 90.0)
--start_gene_cov START_GENE_COV The minimum required BLAST percent coverage for a start gene search
(default: 95.0)
These options control the rotation of completed circular sequence near the end of the
Unicycler pipeline.
--no_rotate Do not rotate completed replicons to start at a standard gene
(default: completed replicons are rotated)
--start_genes START_GENES FASTA file of genes for start point of rotated replicons
(default: start_genes.fasta)
--start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene
search (default: 90.0)
--start_gene_cov START_GENE_COV
The minimum required BLAST percent coverage for a start gene
search (default: 95.0)
--makeblastdb_path MAKEBLASTDB_PATH
Path to the makeblastdb executable (default: makeblastdb)
--tblastn_path TBLASTN_PATH Path to the tblastn executable (default: tblastn)
Pilon polishing:
These options control the final assembly polish using Pilon at the end of the Unicycler pipeline.
These options control the final assembly polish using Pilon at the end of the Unicycler
pipeline.
--no_pilon Do not use Pilon to polish the final assembly (default: Pilon is used)
--no_pilon Do not use Pilon to polish the final assembly (default: Pilon is
used)
--bowtie2_path BOWTIE2_PATH Path to the bowtie2 executable (default: bowtie2)
--bowtie2_build_path BOWTIE2_BUILD_PATH
Path to the bowtie2_build executable (default: bowtie2-build)
--samtools_path SAMTOOLS_PATH Path to the samtools executable (default: samtools)
--pilon_path PILON_PATH Path to a Pilon executable or the Pilon Java archive file (default: pilon)
--pilon_path PILON_PATH Path to a Pilon executable or the Pilon Java archive file
(default: pilon)
--java_path JAVA_PATH Path to the java executable (default: java)
--min_polish_size MIN_POLISH_SIZE Contigs shorter than this value (bp) will not be polished using Pilon
(default: 10000)
--min_polish_size MIN_POLISH_SIZE
Contigs shorter than this value (bp) will not be polished using
Pilon (default: 10000)
VCF:
These options control the production of the VCF of the final assembly.
......@@ -466,20 +488,20 @@ Graph cleaning:
These options control the removal of small leftover sequences after bridging is complete.
--min_component_size MIN_COMPONENT_SIZE
Graph components smaller than this size (bp) will be removed from the final
graph (default: 1000)
Graph components smaller than this size (bp) will be removed from
the final graph (default: 1000)
--min_dead_end_size MIN_DEAD_END_SIZE
Graph dead ends smaller than this size (bp) will be removed from the final
graph (default: 1000)
Graph dead ends smaller than this size (bp) will be removed from
the final graph (default: 1000)
Long read alignment:
These options control the alignment of long reads to the assembly graph.
--contamination CONTAMINATION FASTA file of known contamination in long reads
--scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap open, gap
extend (default: 3,-6,-5,-2)
--low_score LOW_SCORE Score threshold - alignments below this are considered poor (default: set
threshold automatically)
--scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap
open, gap extend (default: 3,-6,-5,-2)
--low_score LOW_SCORE Score threshold - alignments below this are considered poor
(default: set threshold automatically)
```
......@@ -620,6 +642,12 @@ If you believe this has happened in your assembly, you can manually assign multi
If Unicycler doesn't complete your bacterial genome assembly on its own, you may be able to complete it manually with a bit of bioinformatics detective work. There's no single, straight-forward procedure for doing so, but I've put together [a few examples on the Unicycler wiki](https://github.com/rrwick/Unicycler/wiki/Tips-for-finishing-genomes) which may be helpful.
### Using an external long-read assembly
If you have a long-read assembly that you've prepared outside Unicycler and trust (e.g. with Canu), you can give it to Unicycler with `--existing_long_read_assembly`. Unicycler will then skip its miniasm/Racon step and use this assembly instead.
# Other included tools
......
unicycler (0.4.6+dfsg-1) UNRELEASED; urgency=medium
unicycler (0.4.7+dfsg-1) UNRELEASED; urgency=medium
* Initial release (Closes: #<bug>)
Needs: racon (#890187)
pilon (#903911)
-- Andreas Tille <tille@debian.org> Fri, 25 May 2018 10:59:36 +0200
-- Andreas Tille <tille@debian.org> Thu, 06 Sep 2018 08:32:49 +0200
......@@ -58,7 +58,7 @@ Description: SPAdes is in Debian at /usr/bin/spades
spades_cmd += ['-1', reads_1, '-2', reads_2]
--- a/test/test_misc.py
+++ b/test/test_misc.py
@@ -384,14 +384,14 @@ class TestMiscFunctions(unittest.TestCas
@@ -391,14 +391,14 @@ class TestMiscFunctions(unittest.TestCas
def test_spades_version_parsing_3(self):
spades_version_output = 'option -v not recognized\nSPAdes genome assembler v.3.5.0\n\n' \
......@@ -77,7 +77,7 @@ Description: SPAdes is in Debian at /usr/bin/spades
self.assertEqual(version, '2.4.0')
--- a/README.md
+++ b/README.md
@@ -93,7 +93,7 @@ Reasons to __not__ use Unicycler:
@@ -94,7 +94,7 @@ Reasons to __not__ use Unicycler:
* [ICC](https://software.intel.com/en-us/c-compilers) also works (though I don't know the minimum required version number)
* [setuptools](https://packaging.python.org/installing/#install-pip-setuptools-and-wheel) (only required for installation of Unicycler)
* For short-read or hybrid assembly:
......@@ -86,15 +86,15 @@ Description: SPAdes is in Debian at /usr/bin/spades
* For long-read or hybrid assembly:
* [Racon](https://github.com/isovic/racon) (`racon`)
* For polishing
@@ -411,7 +411,7 @@ Run `unicycler --help_all` to see a comp
SPAdes assembly:
These options control the short-read SPAdes assembly at the beginning of the Unicycler pipeline.
@@ -415,7 +415,7 @@ SPAdes assembly:
These options control the short-read SPAdes assembly at the beginning of the Unicycler
pipeline.
- --spades_path SPADES_PATH Path to the SPAdes executable (default: spades.py)
+ --spades_path SPADES_PATH Path to the SPAdes executable (default: spades)
--no_correct Skip SPAdes error correction step (default: conduct SPAdes error
correction)
--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read
--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of
--- a/setup.py
+++ b/setup.py
@@ -63,7 +63,7 @@ def missing_tool(tool_name):
......@@ -137,7 +137,7 @@ Description: SPAdes is in Debian at /usr/bin/spades
help='Path to the SPAdes executable'
if show_all_args else argparse.SUPPRESS)
spades_group.add_argument('--no_correct', action='store_true',
@@ -753,7 +753,7 @@ def check_dependencies(args, short_reads
@@ -756,7 +756,7 @@ def check_dependencies(args, short_reads
spades_path, spades_version, spades_status = '', '', 'not used'
else:
spades_path, spades_version, spades_status = spades_path_and_version(args.spades_path)
......@@ -146,7 +146,7 @@ Description: SPAdes is in Debian at /usr/bin/spades
if args.verbosity > 1:
spades_row.append(spades_path)
program_table.append(spades_row)
@@ -861,7 +861,7 @@ def quit_if_dependency_problem(spades_st
@@ -864,7 +864,7 @@ def quit_if_dependency_problem(spades_st
quit_with_error('SPAdes cannot run due to an incompatible Python version')
if spades_status == 'bad':
quit_with_error('SPAdes was found but does not produce output (make sure to use '
......
......@@ -9,6 +9,8 @@ export DH_BUILD_MAINT_OPTIONS=nocheck
export DEB_BUILD_MAINT_OPTIONS=hardening=+all
include /usr/share/dpkg/default.mk
%:
dh $@ --with python3 --buildsystem=pybuild
......@@ -28,10 +30,12 @@ override_dh_auto_build:
override_dh_auto_test:
BUILDPATH=$(shell pybuild --print build_dir --interpreter python3) ; \
ln -s $(CURDIR)/sample_data $${BUILDPATH} ; \
ln -s $(CURDIR)/unicycler-runner.py $${BUILDPATH} ; \
cp -a unicycler/*.so $${BUILDPATH}/$(PYBUILD_NAME) ; \
PYTHONPATH=$${BUILDPATH}/$(PYBUILD_NAME) dh_auto_test || true
for bp in .pybuild/cpython*$(DEB_SOURCE)/build ; do \
ln -s $(CURDIR)/sample_data $${bp} ; \
ln -s $(CURDIR)/unicycler-runner.py $${bp} ; \
cp -a unicycler/*.so $${bp}/$(PYBUILD_NAME) ; \
done ; \
PYTHONPATH=$${BUILDPATH}/$(PYBUILD_NAME) dh_auto_test # || true
override_dh_link:
dh_link
......
......@@ -85,6 +85,7 @@ Generic long reads:
Polishing settings:
Various settings for polishing behaviour (defaults should work well in most cases)
--no_fix_local do not fix local misassemblies (default: False)
--min_insert MIN_INSERT minimum valid short read insert size (default: auto)
--max_insert MAX_INSERT maximum valid short read insert size (default: auto)
--min_align_length MIN_ALIGN_LENGTH Minimum long read alignment length (default: 1000)
......
......@@ -372,6 +372,13 @@ class TestMiscFunctions(unittest.TestCase):
version = unicycler.misc.java_version_from_java_output(java_version_output)
self.assertEqual(version, '')
def test_java_version_parsing_6(self):
java_version_output = 'openjdk version "9-Ubuntu"\n' \
'OpenJDK Runtime Environment (build 9-Ubuntu+0-9b181-4)\n' \
'OpenJDK 64-Bit Server VM (build 9-Ubuntu+0-9b181-4, mixed mode)'
version = unicycler.misc.java_version_from_java_output(java_version_output)
self.assertEqual(version, '9-Ubuntu')
def test_spades_version_parsing_1(self):
spades_version_output = 'SPAdes v3.10.1'
version = unicycler.misc.spades_version_from_spades_output(spades_version_output)
......
......@@ -36,6 +36,7 @@ def find_start_gene(sequence, start_genes_fasta, identity_threshold, coverage_th
# gene overlaps from the end to the start, we have to duplicate some of the replicon sequence
# for the BLAST database.
seq_len = len(sequence)
start_genes_fasta = os.path.abspath(start_genes_fasta)
queries = load_fasta(start_genes_fasta)
if not queries:
raise CannotFindStart
......
......@@ -1070,6 +1070,7 @@ def java_path_and_version(java_path):
# Make sure Java is 1.7+
try:
if '.' in version:
major_version = int(version.split('.')[0])
if major_version < 1:
status = 'too old'
......@@ -1081,8 +1082,17 @@ def java_path_and_version(java_path):
status = 'too old'
else:
status = 'good'
elif '-' in version:
# E.g. '9-Ubuntu'
major_version = int(version.split('-')[0])
if major_version < 7:
status = 'too old'
else:
status = 'good'
else:
status = 'bad'
except (ValueError, IndexError):
version, status = '?', 'too old'
status = 'bad'
return found_java_path, version, status
......
......@@ -336,7 +336,7 @@ def polish_with_pilon(graph, args, polish_dir, insert_size_1st, insert_size_99th
for f in list_of_files:
try:
os.remove(f)
except FileNotFoundError:
except (FileNotFoundError, OSError):
pass
return total_count
......
......@@ -73,7 +73,7 @@ def get_best_spades_graph(short1, short2, short_unpaired, out_dir, read_depth_fi
kmer_range = kmers
else:
kmer_range = get_kmer_range(short1, short2, short_unpaired, spades_dir, kmer_count,
min_k_frac, max_k_frac)
min_k_frac, max_k_frac, spades_path)
assem_dir = os.path.join(spades_dir, 'assembly')
log.log_section_header('SPAdes assemblies')
......@@ -427,8 +427,27 @@ def spades_assembly(read_files, out_dir, kmers, threads, spades_path, spades_tmp
return graph_files, insert_size_mean, insert_size_deviation
def get_max_spades_kmer(spades_path):
"""
SPAdes usually has a maximum k-mer size of 127, but this can be changed when compiling SPAdes,
so this function checks the help text to see what it is.
https://github.com/ablab/spades/issues/40
"""
try:
process = subprocess.Popen([spades_path, '--help'], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = process.communicate()
all_output = out.decode() + err.decode()
all_output = all_output.replace('\n', ' ')
all_output = ' '.join(all_output.split())
max_kmer = all_output.split('must be odd and less than ')[1].split(')')[0]
return int(max_kmer) - 1
except (IndexError, ValueError):
return 127
def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename, spades_dir,
kmer_count, min_kmer_frac, max_kmer_frac):
kmer_count, min_kmer_frac, max_kmer_frac, spades_path):
"""
Uses the read lengths to determine the k-mer range to be used in the SPAdes assembly.
"""
......@@ -453,6 +472,11 @@ def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename,
except ValueError:
pass
max_spades_kmer = get_max_spades_kmer(spades_path)
log.log('SPAdes maximum k-mer: {}'.format(max_spades_kmer))
if max_spades_kmer != 127:
log.log(' (unusual value, probably indicates custom SPAdes compilation)')
# If the code got here, then the k-mer range doesn't already exist and we'll create one by
# examining the read lengths.
read_lengths = get_read_lengths(reads_1_filename) + get_read_lengths(reads_2_filename) + \
......@@ -460,18 +484,25 @@ def get_kmer_range(reads_1_filename, reads_2_filename, unpaired_reads_filename,
read_lengths = sorted(read_lengths)
median_read_length = read_lengths[len(read_lengths) // 2 - 1]
max_kmer = round_to_nearest_odd(max_kmer_frac * median_read_length)
if max_kmer > 127:
max_kmer = 127
if max_kmer > max_spades_kmer:
max_kmer = max_spades_kmer
starting_kmer = round_to_nearest_odd(min_kmer_frac * max_kmer / max_kmer_frac)
if starting_kmer < 11:
starting_kmer = 11
# Create the k-mer range from a non-linear function that spaces out the early k-mers more and
# makes the later k-mers (which are most likely to be the good, used ones) closer together.
if kmer_count == 1:
kmer_range = [max_kmer]
elif kmer_count == 2:
kmer_range = [starting_kmer, max_kmer]
else:
# Create the k-mer range from a non-linear function that spaces out the early k-mers more
# and makes the later k-mers (which are most likely to be the good, used ones) closer
# together.
kmer_range = []
for x in [x / (kmer_count - 1) for x in range(kmer_count)]:
kmer_range.append((max_kmer - starting_kmer) * (2 - 2 / (x + 1)) + starting_kmer)
kmer_range = sorted(list(set([round_to_nearest_odd(x) for x in kmer_range])))
kmer_range_str = ', '.join([str(x) for x in kmer_range])
log.log('Median read length: ' + str(median_read_length))
......
......@@ -493,6 +493,9 @@ def get_arguments():
if args.threads <= 0:
quit_with_error('--threads must be at least 1')
if args.kmer_count < 1:
quit_with_error('--kmer_count must be at least 1')
if args.kmers is not None:
args.kmers = args.kmers.split(',')
try:
......
......@@ -106,6 +106,8 @@ def get_arguments():
settings_group = parser.add_argument_group('Polishing settings',
'Various settings for polishing behaviour '
'(defaults should work well in most cases)')
settings_group.add_argument('--no_fix_local', action='store_true',
help='do not fix local misassemblies')
settings_group.add_argument('--min_insert', type=int,
help='minimum valid short read insert size (default: auto)')
settings_group.add_argument('--max_insert', type=int,
......@@ -410,6 +412,11 @@ def pilon_small_changes(fasta, round_num, args, all_ale_scores):
def pilon_large_changes(fasta, round_num, args, all_ale_scores):
if args.no_fix_local:
if args.verbosity > 0:
print('Not fixing local misassemblies', flush=True)
return fasta, round_num, []
current, round_num, applied_variant = ale_assessed_changes(fasta, round_num, args, True, False,
all_ale_scores, 'local',
'Pilon polish, large variants, '
......
......@@ -13,4 +13,4 @@ details. You should have received a copy of the GNU General Public License along
not, see <http://www.gnu.org/licenses/>.
"""
__version__ = '0.4.6'
__version__ = '0.4.7'