Skip to content
Commits on Source (7)
## Quickstart
a5_pipeline.pl read_1.fastq.gz read_2.fastq.gz mygenome
This will assemble the paired reads in the two files `read_1.fastq.gz` and `read_2.fastq.gz`. These example files are gzip compressed just like standard MiSeq reporter output, but need not be. The final scaffolded assembly will be saved in `mygenome.final.scaffolds.fasta`.
## Copyright and license
The A5-miseq pipeline is (c) 2011-2014 Andrew Tritt and Aaron Darling. A5-miseq is free, open source software licensed under the GPLv3, please see the file LICENSE for details. The A5-miseq pipeline includes open-source components developed and copyright by 3rd parties: `bwa`, `samtools`, `SGA`, `bowtie`, `IDBA-UD`, `SSPACE`, and `Trimmomatic`. Source code for these components is available from the following locations:
`bwa`: https://sourceforge.net/projects/bio-bwa/
`samtools`: https://sourceforge.net/projects/samtools/
`SGA`: https://github.com/jts/sga
`bowtie`: https://sourceforge.net/projects/bowtie-bio/
`Trimmomatic`: http://www.usadellab.org/cms/?page=trimmomatic
Please see their license agreements for further details.
The following two components have been modified from their original versions and the corresponding GPL licensed source code is available in the A5-miseq repository:
`IDBA-UD`: https://sourceforge.net/p/ngopt/code/HEAD/tree/trunk/idba-1.1.1/
`SSPACE`: https://sourceforge.net/p/ngopt/code/HEAD/tree/trunk/SSPACE/
## What is A5-miseq?
_A5-miseq_ is a pipeline for assembling DNA sequence data generated on the Illumina sequencing platform. This README will take you through the steps necessary for running _A5-miseq_.
## What A5-miseq can't do
There are many situations where A5-miseq is not the right tool for the job. In order to produce accurate results, A5-miseq requires Illumina data with certain characteristics. A5-miseq will likely not work well with Illumina reads shorter than around 80nt, or reads where the base qualities are low in all or most reads before 60nt. A5-miseq assumes it is assembling homozygous haploid genomes. Use a different assembler for metagenomes and heterozygous diploid or polyploid organisms. Use a different assembler if a tool like FastQC reports your data quality is dubious. You have been warned!
Datasets consisting solely of unpaired reads are not currently supported.
## Requirements
A5-miseq requires 64-bit Linux (kernel 2.6.15 or later) or Mac OS X 10.6 or later. A Java Runtime Environment is also required. Mac OS X includes Java. On Linux, check with your distribution provider for details on installing Java.
## Installation
Once you have downloaded and extracted the pipeline, the `a5_pipeline.pl` script can be run in place. Optionally, the pipeline's bin/ directory can be added to the `$PATH` environment variable so that specifying the full path to `a5_pipeline.pl` is not necessary:
export PATH=$PATH:/path/to/ngopt/bin
Please change /path/to/ngopt/bin appropriately, and put this command in ~/.bashrc or ~/.profile or another script that gets run at login time. The pipeline does not need to be copied to a specific location in order to be installed. You do not need root or superuser or administrator access to install and use the pipeline with this approach.
## Usage details
Usage: a5_pipeline.pl [--begin=1-5] [--end=1-5] [--preprocessed] <lib_file> <out_base>
Or: a5_pipeline.pl <Read 1 FastQ> <Read 2 FastQ> <out_base>
Or: a5_pipeline.pl <Read 1,2 Interleaved FastQ> <out_base>
<out_base> is the base file name for all output files. When assembling from
a single library, the fastq files may be given directly on the command line.
If using more than one library, a library file must be given as <lib_file>.
The library file must contain the filenames of all read files.
If --preprocessed is used, <lib_file> is expected to be the library file
created during step 2 of the pipeline, named <out_base>.preproc.libs. Note
that this flag only applies if beginning pipeline after step 2.
#### Example usage 1
a5_pipeline.pl my_libs assembly.out
#### Example usage 2
a5_pipeline.pl my_reads.1.fastq my_reads.2.fastq assembly.out
All output will be contained within the directory &lt;output base&gt;. The above example will produce the following directory and files:
#### Output files
Running the pipeline using either of the two examples will produce the following output:
assembly.out.ec.fastq.gz // Error corrected reads
assembly.out.contigs.fasta // Contigs
assembly.out.crude.scaffolds.fasta // Crude scaffolds: Not checked for misassemblies
assembly.out.broken.scaffolds.fasta // Broken scaffolds: Checked for misassemblies, but not rescaffolded
assembly.out.final.scaffolds.fasta // Final scaffolds: Checked for misassemblies, and rescaffolded
assembly.out.final.scaffolds.fastq // Final scaffolds in FastQ format with base call qualities
assembly.out.final.scaffolds.qvl // Quality values for the final scaffolds in QVL format for submission to NCBI
assembly.out.final.scaffolds.agp // AGP file describing scaffolding for submission to NCBI
assembly.out.assembly_stats.csv // A tab-separated file with assembly summary statistics
### Compute requirements
A5-miseq can assemble microbial genomes on modern laptop hardware, but will require substantial disk space for temporary files in the process. Most bacterial genomes can be assembled with less than 4GB RAM in a few hours. Several steps in the A5-miseq pipeline are parallelized and the extent of parallelism can be controlled with the `--threads` option to A5-miseq and the OMP_NUM_THREADS environment variable.
### If you are using a library file, please read the following
#### Making a library file
The library file must contain file paths to all libraries being assembled. Separate libraries are delimited in the library file by `[LIB]`. For a paired-end or mate-pair library, reads can be passed as two files, one for each end, or as a single shuffled (aka interleaved) file. Additionally, unpaired reads can be passed as a single file. For a single-end library, reads must be passed as a single file. Reads must be in FASTQ format. In addition to file locations, you can also give an insert size for paired reads. Please note that this is NOT necessary. The given number will only be used when _A5-miseq_ is not able to reliably calculate the insert size on its own.
Example library file:
[LIB]
p1=library1_p1.fastq
p2=library1_p2.fastq
up=library1_up.fastq
ins=350
[LIB]
shuf=library2.fastq
up=library2.unpaired.fq
ins=6500
[LIB]
up=unpaired_library.fastq
### Adapter sequences
A5-miseq screens input sequence reads for all standard Illumina genomic DNA library adapter sequences available as of April 2014. This includes screening for Illumina PE v1.0 adapters, TruSeq adapters, and Nextera (XT) adapters. If your libraries were prepared using alternative adapter sequences and these have not been trimmed from the data prior to assembly, it will be necessary to copy the adapter.fasta file provided with A5-miseq and edit it to include the appropriate adapter set. The path to the new adapter file can be provided with the `--adapter=` command line option when running `a5_pipeline.pl`.
### The assembly stats file
A5-miseq produces summary statistics on the assembly in a tab delimited file with the suffix `.assembly_stats.csv`. This file can be opened in a spreadsheet program such as Excel, Numbers, or LibreOffice Calc. If you have assembled many genomes in a particular directory or directories, the summary statistics from all of them can be combined into a single table with the following command:
`cat *.assembly_stats.csv | sort | uniq > all_assembly_stats.csv`
The resulting table will be in a file called `all_assembly_stats.csv`.
The columns of the assembly stats file are as follows:
Contigs // Number of contigs generated
Scaffolds // Number of scaffolds generated
Assembly size // Sum of lengths of all scaffolds in assembly
Longest Scaffold // Length of the longest scaffold
N50 // N50 value for the scaffolds (N50 length is defined as the length for which
// the collection of all scaffolds of that length or longer contains at
// least half of the total of the lengths of the scaffolds)
Raw reads // Number of raw reads used as input
EC reads // Number of reads used for assembly, after quality control and error correction
% reads passing EC // Percentage of the raw reads that the EC reads represent
Raw nt // Number of nucleotides used as input
EC nt // Number of nucleotides after quality control and error correction
% nt passing EC // Percentage of the raw nucleotides that the EC nucleotides represent
Raw cov // Average depth of sequence coverage provided by the raw data
Median cov // Median actual depth of coverage in the assembly
10th percentile cov // 10th percentile depth of coverage -- 90% of sites have greater coverage
bases >= Q40 // Number of bases that have a PHRED-scale quality greater or equal to Q40
Assembler version // Version used to run the assembly.
### Other notes
Various stages of the A5-miseq pipeline estimate assembly parameters from subsets of the input reads. This parameter estimation can vary from run to run, and can be dependent on the order of the input reads. Therefore it is possible that two runs of A5-miseq on the same data will not generate the exact same assembly.
>Prefix_NexteraV2/1
AGATGTGTATAAGAGACAG
>Prefix_NexteraV2/2
AGATGTGTATAAGAGACAG
>Trans1_NexteraV2
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Trans1_NexteraV2_rc
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>Trans2_NexteraV2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>Trans2_NexteraV2_rc
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>NexteraV1_1
GCCTTGCCAGCCCGCTCAGAGATGTGTATAAGAGACAG
>NexteraV1_1_rc
CTGTCTCTTATACACATCTCTGAGCGGGCTGGCAAGGC
>NexteraV1_2
GCCTCCCTCGCGCCATCAGAGATGTGTATAAGAGACAG
>NexteraV1_2_rc
CTGTCTCTTATACACATCTCTGATGGCGCGAGGGAGGC
>PrefixPE_TruSeq3/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE_TruSeq3/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE1_TruSeq3
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PE1_TruSeq3_rc
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
>PE2_TruSeq3
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE2_TruSeq3_rc
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>PrefixPE_TruSeq2/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE_TruSeq2/2
TCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>PCR_Primer2_TruSeq2
TCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>PCR_Primer2_TruSeq2_rc
AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA
>FlowCell1
TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC
>FlowCell2
TTTTTTTTTTCAAGCAGAAGACGGCATACGA
This diff is collapsed.
###############################################################
#Marten Boetzer 23-10-2010 #
#SSPACE perl subscript ExtendOrFormatContigs.pl #
#This script, based on the the -x parameter; #
# -Formats the contigs to appropriate format (-x 0) #
# -Extends the contigs with available unmapped reads (-x 1) #
###############################################################
use strict;
use File::Basename;
my ($MAX, $MAX_TOP, $TRACK_COUNT) = (0, 100, 1);
my $seplines = ("-" x 60)."\n";
my $contig = $ARGV[0];
my $base_name = $ARGV[1];
my $extending = $ARGV[2];
my $filecontig = $ARGV[3];
my $MIN_READ_LENGTH = $ARGV[4];
my $base_overlap = $ARGV[5];
my $min_overlap = $ARGV[6];
my $min_base_ratio = $ARGV[7];
my $max_trim = $ARGV[8];
my $verbose = $ARGV[9];
my $Bin = $ARGV[10];
my $outdir = $ARGV[11];
my $seed;
my $log = $outdir."/".$base_name . ".logfile.txt";
my $summaryfile = $outdir."/".$base_name.".summaryfile.txt";
open (SUMFILE, ">>$summaryfile") || die "Can't open $summaryfile -- fatal\n";
open (LOG, ">>$log") || die "Can't write to logfile$log -- fatal\n";
my $filenameOutExt = $base_name . ".singlereads.fasta";
&ExtendContigs($base_name, $filecontig, $filenameOutExt, $Bin) if($extending == 1);
&FormatContigs() if($extending == 0);
close SUMFILE;
close LOG;
#--------------------------------------------------
###EXTEND CONTIGS WITH UNMAPPED READS
sub ExtendContigs{
my ($base_name, $filecontig, $filenameOutExt, $Bin) = @_;
my ($set,$bin,$seq);
my $encoded = &encodeBases();
#-------------------------------------------------NOW MAP SINGLE READS TO INITIAL CONTIGS FILE.
my $outfile = $outdir . "/reads/" . $filenameOutExt;
my $lib = "start";
system("perl $Bin/bin/mapWithBowtie.pl $base_name $filecontig $outfile $lib $Bin $outdir");
#-------------------------------------------------READ UNMAPPED READS FOR EXTENSION
($set,$bin) = &readUnmappedReads($set,$bin,$base_name, $encoded);
#-------------------------------------------------CONTIG EXTENSION USING UNMAPPED PAIRS STORED IN $SET
&printMessage("\n=>".getDate().": Contig extension initiated\n");
open (TIG, ">$contig") || die "Can't write to $contig -- fatal\n";
#--------------------------------------------ASSEMBLY START
ASSEMBLY:
open(IN,$filecontig) || die "Can't open $filecontig -- fatal\n";
my ($exttig_count, $counter, $NCount, $orig_mer, $prevhead) = (0,0,0,0,'');
while(<IN>){
s/\r\n/\n/;
chomp;
$seq.= uc($_) if(eof(IN));
if (/\>(\S+)/ || eof(IN)){
my $head=$1;
$orig_mer = length($seq);
if($seq ne ''){
$NCount++ if($seq=~/([NX])/i);
my $start_sequence = uc($seq);
my $reads_needed = 1; #tracks coverage
my $total_bases = $orig_mer * $reads_needed;
($seq, $set, $bin, $reads_needed, $total_bases) = doExtension("3", $orig_mer, $seq, $set, $bin, $reads_needed, $total_bases, $min_overlap, $base_overlap, $min_base_ratio, $verbose, $counter, $max_trim, $encoded) if($orig_mer >= $MIN_READ_LENGTH && $orig_mer >= $min_overlap);
my $seqrc = reverseComplement($seq);
($seqrc, $set, $bin, $reads_needed, $total_bases) = doExtension("5", $orig_mer, $seqrc, $set, $bin, $reads_needed, $total_bases, $min_overlap, $base_overlap, $min_base_ratio, $verbose, $counter, $max_trim, $encoded) if($orig_mer >= $MIN_READ_LENGTH && $orig_mer >= $min_overlap);
my $leng = length($seqrc);
my $reversetig = reverseComplement($seqrc); ### return to sequence, as inputted
my $trimmed_length = length($start_sequence) - 2*($max_trim);
if($leng > $trimmed_length){ ### commented out: && $start_sequence ne $seqrc && $start_sequence ne $reversetig
my $cov = $total_bases / $leng;
printf TIG ">extcontig%i|size%i|read%i|cov%.2f|seed:$prevhead\n%s\n", ($counter,$leng,$reads_needed,$cov,$reversetig); #print contigs to file
$exttig_count++;
}else{
my $cov = $reads_needed = 0;
my $singlet_leng = length($start_sequence);
printf TIG ">contig%i|size%i|read%i|cov%.2f|seed:$prevhead\n%s\n", ($counter,$leng,$reads_needed,$cov,$start_sequence); #print singlets to file
}
}
CounterPrint(++$counter);
$prevhead = $head;
$seq='';
}else{
$seq .= uc($_);
}
}
CounterPrint(" ");
print SUMFILE "\tNumber of contig sequences =".($counter-1). "\n";
print SUMFILE "\t\tNumber of contigs containing N's (may prevent proper contig extension) = $NCount\n";
print SUMFILE "\tNumber of contigs extended = $exttig_count\n".$seplines;
close IN;
if($@){
my $message = $@;
&printMessage("\nSomething went wrong running $0 ".getDate()."\n$message\n");
}
close TIG;
}
###STORE CONTIGS TO APPROPRIATE FORMAT WHEN CONTIGS WILL NOT BE EXTENDED
sub FormatContigs{
&printMessage("\n=>".getDate().": Storing contigs to format for scaffolding\n");
open (TIG, ">$contig") || die "Can't write to $contig -- fatal\n";
open(IN,$filecontig) || die "Can't open $filecontig -- fatal\n";
my ($counter, $seq, $prevhead) = (0,'','');
while(<IN>){
s/\r\n/\n/;
chomp;
$seq.= uc($_) if(eof(IN));
if (/\>(\S+)/ || eof(IN)){
my $head=$1;
my $length_seq = length($seq);
printf TIG ">contig%i|size%i|read%i|cov%.2f|seed:$prevhead\n%s\n", ($counter,$length_seq,0,0.00,$seq) if($seq ne '');
CounterPrint(++$counter);
$prevhead = $head;
$seq = '';
}else{
$seq .= uc($_);
}
}
CounterPrint(" ");
print SUMFILE "CONTIG SUMMARY:\n".$seplines;
print SUMFILE "\tNumber of unique contig sequences =".($counter-1). "\n".$seplines;
close IN;
close TIG;
}
###EXTEND CONTIGS
sub doExtension{
my ($direction, $orig_mer, $seq, $set, $bin, $reads_needed, $total_bases, $min_overlap, $base_overlap, $min_base_ratio, $verbose, $tig_count, $max_trim, $e) = @_;
my $previous = $seq;
my ($extended, $trim_ct) = (1,0);
if($orig_mer > $MAX){$orig_mer=$MAX;} ### Deals with special cases where the seed sequences are different from the read set (and possibly very large) - goal here is not to increase sequence coverage of seed, but rather to extend it.
TRIM:
while($trim_ct <= $max_trim){
while($extended){
my ($pos,$current_reads,$current_bases,$span) = (0,0,0,"");
### Added 19March08
if(length($seq) >= $MAX){ # $seq is length of contig being extended -- if larger than largest read, make sure the largest read could align and all subsequent rds.
$span = $MAX - $TRACK_COUNT;
}else{
$span = length($seq) - $TRACK_COUNT;
}
my $startspan = $span;
my $overhang = {};
my @overlapping_reads = ();
for (my $x=1;$x <= ($orig_mer * 2);$x++){
($overhang->{$x}{'A'},$overhang->{$x}{'C'},$overhang->{$x}{'G'},$overhang->{$x}{'T'}) = (0,0,0,0);
}
### COLLECT SEQUENCES
while ($span >= $min_overlap){ # will slide the subseq, until the user-defined min overlap size
$pos = length($seq) - $span;
print "MAX:$MAX, SPAN:$span, POS:$pos" if ($verbose);
my $subseq = substr($seq, $pos, $span); #make a sub-sequence of length l-(1..i) for searching
my $sub = substr($subseq, 0, 10); #grab first 10 nucleotides and get all reads having this subset stored in $bin
my $subset = $bin->{$sub}; #Will grab everything even the reverse complement ones
print "####$direction' SEARCH Position:$pos Span:$span - Subseq:$subseq Previous:$previous\n" if ($verbose);
### SEARCH -- this cycles through limited k-mer space
foreach my $pass (sort {$subset->{$b} <=> $subset->{$a}} keys %$subset){
if($pass =~ /^$subseq([ACGT]*)/){
#can we align perfectly that subseq to another rd start?
my $dangle = $1;
print "\n", "=" x 80, "\n$direction'- FOUND sequence: $pass -> subset: $subseq -> overhang: $dangle\n", "=" x 80, "\n\n" if ($verbose);
# Collect all overhangs
push @overlapping_reads, $pass; ### all overlapping reads
my @over = split(//,$dangle);
my $ct_oh = 0;
foreach my $bz(@over){
$ct_oh++; ### tracks overhang position passed the seed
if(defined $set->{$pass}){
$overhang->{$ct_oh}{$bz} += $set->{$pass}; ### reflects read coverage (often real duplicates)
}else{
my $pass_rc = reverseComplement($pass);
$overhang->{$ct_oh}{$bz} += $set->{$pass_rc};
}
print "$ct_oh - $bz = $overhang->{$ct_oh}{$bz}\n" if($verbose);
}
}
}
$span--;
}#while overlap >= user-defined -m minimum
my $consensus = "";
print "Finished Collecting Overlapping Reads - BUILDING CONSENSUS...\n" if ($verbose);
print Dumper(@overlapping_reads) if ($verbose);
### Build consensus
CONSENSUS:
foreach my $ohpos (sort {$a<=>$b} keys %$overhang){
if($ohpos){
my $coverage = $overhang->{$ohpos}{'A'}+$overhang->{$ohpos}{'C'}+$overhang->{$ohpos}{'G'}+$overhang->{$ohpos}{'T'};
print "pos:$ohpos cov:$coverage A:$overhang->{$ohpos}{'A'} C:$overhang->{$ohpos}{'C'} G:$overhang->{$ohpos}{'G'} T:$overhang->{$ohpos}{'T'}\n" if($verbose);
if ($coverage < $base_overlap){
print "COVERAGE BELOW THRESHOLD: $coverage < -o $base_overlap @ $ohpos :: will extend by: $consensus\n" if ($verbose);
last CONSENSUS;
}
my $baselist = $overhang->{$ohpos};
my ($ct_dna, $previous_bz) = (0, "");
BASE:
foreach my $bz (sort {$baselist->{$b}<=>$baselist->{$a}} keys %$baselist){
if($ct_dna){## the two most abundant bases at that position
if($previous_bz ne "" && ($baselist->{$previous_bz} / $coverage) >= $min_base_ratio && $baselist->{$previous_bz} > $baselist->{$bz}){### a simple consensus btw top 2
$consensus .= $previous_bz; ### build consensus
print "Added base $previous_bz (cov = $baselist->{$previous_bz}) to $consensus **\n" if ($verbose);
last BASE;
}else{
print "ISSUES EXTENDING: best base = $previous_bz (cov=$baselist->{$previous_bz}) at $ohpos. Second-Best: $bz (cov=$baselist->{$bz}) (ratio best=$baselist->{$previous_bz} / total=$coverage) >= $min_base_ratio (-r) -- will terminate with $consensus\n" if($verbose);
last CONSENSUS;
}
}
$previous_bz = $bz;
$ct_dna++;
}
}
}
### deal with sequence reads making up the consensus/newly formed contig
if($consensus ne ""){
print "Will extend $seq\nwith: $consensus\n\n" if($verbose);
my $temp_sequence = $seq . $consensus; ## this is the contig extension
my $integral = 0;
my $position = length($temp_sequence) - ($startspan + length($consensus));
my $temp_sequence_end = substr($temp_sequence, $position);
foreach my $ro (@overlapping_reads){
my $result = index($temp_sequence_end, $ro);
if($temp_sequence_end =~ /$ro/){
my $complement_ro = reverseComplement($ro);
$integral=1;
if(defined $set->{$ro}){
$current_reads = $set->{$ro};
$current_bases = length($ro) * $current_reads;
$reads_needed += $current_reads;
$total_bases += $current_bases;
($bin,$set,$seed) = deleteData($bin,$set,$seed,$ro,$e)
}
if(defined $set->{$complement_ro}){
$current_reads = $set->{$complement_ro};
$current_bases = length($complement_ro) * $current_reads;
$reads_needed += $current_reads;
$total_bases += $current_bases;
($bin,$set,$seed) = deleteData($bin,$set,$seed,$complement_ro,$e);
}
}
}
if(! $integral){### no reads are found overlapping with the consensus might be indicative of low complexity regions -- Stop the extension
print "No overlapping reads agree with the consensus sequence. Stopping extension" if ($verbose);
$extended = 0;
}else{
$seq = $temp_sequence;
$temp_sequence = "";
print "New Contig is: $seq\n" if ($verbose);
$extended = 1;
}
$previous = $seq;
}else{### no consensus built, will stop the extension
$extended = 0;
}
}###while get the OK for extension
$trim_ct++;
if ($trim_ct <= $max_trim){
last TRIM if (length($seq) <= $MIN_READ_LENGTH); #terminate assembly if trimming becomes too agressive
$seq = substr($seq, 0, -1);
$extended = 1;
print "\n$direction prime EXTENSION ROUND $trim_ct COMPLETE UNTIL $max_trim nt TRIMMED OFF => TRIMMED SEQUENCE:$seq\n\n" if ($verbose);
}
}### while trimming within bounds
print "\n*** NOTHING ELSE TO BE DONE IN $direction prime- PERHAPS YOU COULD DECREASE THE MINIMUM OVERLAP -m (currently set to -m $min_overlap) ***\n\n" if ($verbose);
return $seq, $set, $bin, $reads_needed, $total_bases;
}
###DELETE READ DATA IF IT HAS BEEN USED FOR EXTENDING A CONTIG
sub deleteData {
my ($bin,$set,$seed,$sequence,$e) = @_;
my $subnor = substr($sequence, 0, 10);
my $comp_seq = reverseComplement($sequence);
my $subrv = substr($comp_seq, 0, 10);
#remove k-mer from hash table and prefix tree
delete $bin->{$subrv}{$comp_seq};
delete $bin->{$subnor}{$sequence};
delete $set->{$sequence};
return $bin, $set, $seed;
}
sub readUnmappedReads{
my ($set,$bin,$base_name, $encoded) = @_;
&printMessage("\n=>".getDate().": Reading unmapped reads to memory \n");
my $outfileExt = $base_name . ".start.unmapped";
my $file = $outdir."/"."bowtieoutput/$outfileExt";
my ($counter, $step) = (0, 100000);
my $e = $encoded;
if(-e $file){
open(INUNMAPPED,$file);
while(<INUNMAPPED>){
s/\r\n/\n/, chomp;
if(!(/\>(\S+)/)){
++$counter;
if($counter == $step){
CounterPrint($counter);
$step = $step + 100000;
}
my $orig = $_;
my $orig_mer = length($orig);
if ($orig ne '' && $orig_mer >= $MIN_READ_LENGTH && $orig_mer >= $min_overlap){
$_ = $orig;
tr/ACTG/TGAC/;
my $rc=reverse();
$MAX=$orig_mer if ($orig_mer > $MAX);
my $subrv = substr($rc, 0, 10);
my $subnor = substr($orig, 0, 10);
$set->{$orig}++;
$bin->{$subnor}{$orig}++;
$bin->{$subrv}{$rc}++;
}
$MAX = $MAX_TOP if ($MAX > $MAX_TOP);
}
}
}
else{
print "WARNING: NO SEQUENCES USED FOR EXTENSION\n\n";
}
close INUNMAPPED;
CounterPrint(" ");
print SUMFILE "CONTIG EXTENSION:\n".$seplines;
print SUMFILE "\tNumber of unmapped reads used for contig extension = $counter\n";
&FlushFiles();
return ($set,$bin);
}
#CONVERT EACH THREE NUCLEOTIDE CODON TO COMPACT FORMAT
sub encodeBases{
my $encoded;
my @pos1= ('A','C','G','T');
my @pos2 = @pos1;
my @pos3 = @pos1;
my $ascii_num = 33;
foreach my $p1 (@pos1){
foreach my $p2 (@pos2){
foreach my $p3 (@pos3){
my $codon = $p1.$p2.$p3;
my $char = chr($ascii_num);
$encoded->{$codon}=$char;
$ascii_num++;
}
}
}
return $encoded;
}
###FUNCTION TO REVERSE COMPLEMENT A SEQUENCE
sub reverseComplement{
$_ = shift;
tr/ATGC/TACG/;
return (reverse());
}
###PRINTS A COUNTER ON THE SCREEN AND OVERWRITES PREVIOUS LINE
sub CounterPrint{
my $countingMessager = shift;
print "\r$countingMessager";
$|++;
}
###FUNCTION TO PRINT MESSAGES TO THE SCREEN AND TO THE LOG FILE
sub printMessage{
my $message = shift;
print $message;
print LOG $message;
}
###FUNCTION TO GET THE CURRENT DATE
sub getDate{
my $date = scalar(localtime);
return $date;
}
###FLUSHES THE SUMMARY AND LOG FILE
sub FlushFiles{
select((select(SUMFILE), $| = 1)[0]);
select((select(LOG), $| = 1)[0]);
$|++;
}
#########END ExtendOrFormatContigs.pl
This diff is collapsed.
####################################################################
#Marten Boetzer 23-10-2010 #
#SSPACE perl subscript mapWithBowtie.pl #
#This script; #
# -Calls the external program Bowtie to map the reads to contigs #
####################################################################
use strict;
use File::Basename;
my $base_name = $ARGV[0];
my $contigFile = $ARGV[1];
my $singlereads = $ARGV[2];
my $library = $ARGV[3];
my $Bin = $ARGV[4];
my $outdir = $ARGV[5];
my $log = "$outdir/$base_name.logfile.txt";
open (LOG, ">>$log") || die "Can't write to $log -- fatal\n";
&MapReadsToContigs($base_name,$contigFile, $singlereads, $library);
close LOG;
#--------------------------------------------------
###MAP SINGLE READS TO CONTIGS WITH BOWTIE
sub MapReadsToContigs{
my ($base_name, $contigFile, $singlereads, $library) = @_;
#my $bowtieout = $outdir."/".$base_name . ".$library.bowtieIndex";
# Andrew Tritt 8/10/2011: bowtieout was constructed incorrectly
my $bowtieout = $base_name . ".$library.bowtieIndex";
my $bowbuildpath = "$Bin"."/bowtie/bowtie-build";
my $bowtiepath = "$Bin"."/bowtie/bowtie";
$bowtiepath =~ s/ /\\ /g;
$bowbuildpath =~ s/ /\\ /g;
#my $outfileExt = $outdir."/".$base_name . ".$library.unmapped";
#my $outfileNotExt = $outdir."/".$base_name . ".$library.mapped";
# Andrew Tritt 8/10/2011: outfileExt and outfileNotExt were constructed incorrectly. Same as bowtieout
my $outfileExt = $base_name . ".$library.unmapped";
my $outfileNotExt = $base_name . ".$library.mapped";
die "Contig file ($contigFile) not found. Exiting...\n" if(!(-e $contigFile));
&printMessage("\n=>".getDate().": Building Bowtie index for contigs ($contigFile)\n");
system("$bowbuildpath $contigFile $outdir/bowtieoutput/$bowtieout --quiet --noref") == 0 || die "\nBowtie-build error; $?"; # returns exit status values
die "Single read file ($singlereads) not found. Exiting...\n" if(!(-e $singlereads));
&printMessage("\n=>".getDate().": Mapping reads ($singlereads) to Bowtie index\n");
system("$bowtiepath -v 0 $outdir/bowtieoutput/$bowtieout --suppress 2,3,4,6,7 -f $singlereads $outdir/bowtieoutput/$outfileNotExt --un $outdir/bowtieoutput/$outfileExt --quiet --refidx") == 0 || die "\nBowtie error; $?" if($library eq "start"); # returns exit status values
system("$bowtiepath -v 0 -m 1 $outdir/bowtieoutput/$bowtieout --suppress 6,7 -f $singlereads $outdir/bowtieoutput/$outfileNotExt --quiet --refidx") == 0 || die "\nBowtie error; $?" if($library ne "start"); # returns exit status values
&FlushFiles();
}
###FLUSHES THE SUMMARY AND LOG FILE
sub FlushFiles{
select((select(LOG), $| = 1)[0]);
$|++;
}
###FUNCTION TO PRINT MESSAGES TO THE SCREEN AND TO THE LOG FILE
sub printMessage{
my $message = shift;
print $message;
print LOG $message;
}
###FUNCTION TO GET THE CURRENT DATE
sub getDate{
my $date = scalar(localtime);
return $date;
}
#########END MapWithBowtie.pl
#############################################################
#Marten Boetzer 23-11-2010 #
#SSPACE perl subscript readLibFiles.pl #
#This script; #
# -reads, converts and filters original input sequences #
#############################################################
use strict;
use Storable;
use File::Path;
use File::Basename;
my $seplines = ("-" x 60)."\n";
my $libraryfile = $ARGV[0];
my $base_name = $ARGV[1];
my $extending = $ARGV[2];
my $unpaired_file = $ARGV[3];
my $outdir = $ARGV[4];
my $log = $outdir."/".$base_name . ".logfile.txt";
my $summaryfile = $outdir."/".$base_name.".summaryfile.txt";
open (SUMFILE, ">>$summaryfile") || die "Can't open $summaryfile -- fatal\n";
open (LOG, ">>$log") || die "Can't write to $log -- fatal\n";
my $filenameOutFilt = "filtered.readpairs.fasta";
my $filenameOutExt = $base_name . ".singlereads.fasta";
open OUTFILEExt, "> $outdir/reads/$filenameOutExt" if($extending == 1);
#-------------------------------------------------READ UNPAIRED FILE CONTAINING SINGLE READS
&readUnpairedFile($unpaired_file) if ($unpaired_file);
#-------------------------------------------------LOOP THROUGH EACH LIBRARY IN LIBRARYFILE AND STORE AND FILTER READS
open(FILELIB, "< $libraryfile");
my ($library, $fileA, $fileB, $insert_size, $insert_stdev, $reverse);
my ($totcounter, $totNcount, $prevlibrary) = (0,0, "");
my ($filesA, $filesB, $inserts, $stdevs, $reverses);
while(<FILELIB>){
chomp;
($library, $fileA, $fileB, $insert_size, $insert_stdev, $reverse) = split(/\s+/, $_);
next if($library eq "");
my ($fileBaseName1, $dirName1, $fileExtension1) = fileparse($fileA);
my ($fileBaseName2, $dirName2, $fileExtension2) = fileparse($fileB);
if($library ne $prevlibrary){
if($prevlibrary ne ""){
my $filt = $totcounter-$totNcount;
print SUMFILE "READING READS $prevlibrary:\n";
print SUMFILE "$seplines\tTotal inserted pairs = $totcounter \n";
print SUMFILE "\tNumber of pairs containing N's = $totNcount \n\tRemaining pairs = $filt\n$seplines\n";
&printMessage("\n=>".getDate().": Reading, filtering and converting input sequences of library '$library' initiated\n");
($filesA, $filesB, $inserts, $stdevs, $reverses) = ("", "","","","");
&FlushFiles();
($totNcount, $totcounter) = (0,0);
close OUTSINGLEFILE;
}elsif($prevlibrary eq ""){
&printMessage("\n=>".getDate().": Reading, filtering and converting input sequences of library '$library' initiated\n");
}
open (OUTSINGLEFILE, ">$outdir/reads/$base_name.$library.filtered.readpairs.singles.fasta") || die "Can't write to single file -- fatal\n";
CounterPrint(" ");
}
print "Reading: $fileBaseName1 and $fileBaseName2...\n";
$filesA .= "$fileA | ";
$filesB .= "$fileB | ";
$inserts .= "$insert_size | ";
$stdevs .= "$insert_stdev | ";
$reverses .= "$reverse | ";
my ($counter2, $Ncount2) = &generateInputFiles($library, $fileA, $fileB, $extending, $reverse);
$totcounter += $counter2;
$totNcount += $Ncount2;
$prevlibrary = $library;
}
if($prevlibrary ne ""){
my $filt = $totcounter-$totNcount;
print SUMFILE "READING READS $prevlibrary:\n";
print SUMFILE "$seplines\tTotal inserted pairs = $totcounter \n";
print SUMFILE "\tNumber of pairs containing N's = $totNcount \n\tRemaining pairs = $filt\n$seplines\n";
}
&printMessage("\n$seplines");
close OUTSINGLEFILE;
close FILELIB;
close OUTFILEExt if($extending == 1);
close SUMFILE;
close LOG;
#--------------------------------------------------
###CONVERT INPUT SEQUENCES BY REMOVING PAIRED READS HAVING AN 'N' OR DUPLICATE PAIRS
sub generateInputFiles{
my ($lib, $fileA, $fileB, $extension, $reverse) = @_;
my ($combined, $name,$seq1,$seq2);
my ($counterext, $Ncount, $counter, $countsinglet, $fastq, $step) = (0,0,0,0,0,100000);
open(TEST, "< $fileA");
$name = <TEST>;
close TEST;
$fastq = 1 if ($name =~ /^[@]/);
open(FILEA, "< $fileA");
open(FILEB, "< $fileB");
while(<FILEA>) {
<FILEB>;
$seq1 = uc(<FILEA>), $seq1 =~ s/\r\n/\n/, chomp $seq1;
$seq2 = uc(<FILEB>), $seq2 =~ s/\r\n/\n/, chomp $seq2;
#FASTQ FORMAT
<FILEA>,<FILEA>,<FILEB>,<FILEB> if ($fastq);
# ELSE FASTA FORMAT
$seq1 = reverseComplement($seq1), $seq2 = reverseComplement($seq2) if($reverse);
if ($extension == 1){
if($seq1 =~ /^([ACGT]*)$/i){
print OUTFILEExt ">$counterext\n$seq1\n";
$counterext++;
}
if ($seq2 =~ /^([ACGT]*)$/i){
print OUTFILEExt ">$counterext\n$seq2\n";
$counterext++;
}
}
$combined = "$seq1:$seq2";
if(!($combined =~ /[N]/)){
++$countsinglet;
print OUTSINGLEFILE ">read$countsinglet\n$seq1\n";
print OUTSINGLEFILE ">read$countsinglet\n$seq2\n";
}
else{
$Ncount++;
}
++$counter;
if($counter == $step){
CounterPrint($counter);
$step = $step + 100000;
}
}
CounterPrint(" ");
close FILEA;
close FILEB;
return $counter, $Ncount;
}
#------------------READ UNPAIRED SINGLE READS FILE WHEN -u IS SET
sub readUnpairedFile{
my ($file) = @_;
open(INUNPAIRED, "< $file") || die "Can't open $file -- fatal\n";
&printMessage("\n=>Reading, filtering and converting unpaired input sequences initiated ".getDate()."\n");
my ($seq1, $name);
my ($counterext, $counter, $step) = (0,0, 100000);
while(<INUNPAIRED>) {
chomp;
$name = $_;
$seq1 = uc(<INUNPAIRED>); $seq1 =~ s/\r\n/\n/; chomp $seq1;
#FASTQ FORMAT
if ($name =~ /^[@]/){
<INUNPAIRED>; <INUNPAIRED>;
}
# ELSE FASTA FORMAT
if ($seq1 =~ /^([ACGT]*)$/i){
print OUTFILEExt ">$counterext\n$seq1\n";
$counterext++;
}
++$counter;
if($counter == $step){
CounterPrint($counter);
$step = $step + 100000;
}
}
CounterPrint(" ");
print SUMFILE "READING UNPAIRED READS:\n";
print SUMFILE "$seplines\tTotal inserted reads = $counter \n";
print SUMFILE "\tNumber of reads containing N's = ".($counter-$counterext)."\n\tRemaining reads = $counterext\n";
close INUNPAIRED;
}
###FUNCTION TO REVERSE COMPLEMENT A SEQUENCE
sub reverseComplement{
$_ = shift;
tr/ATGC/TACG/;
return (reverse());
}
###PRINTS A COUNTER ON THE SCREEN AND OVERWRITES PREVIOUS LINE
sub CounterPrint{
my $countingMessager = shift;
print "\r$countingMessager";
$|++;
}
###FUNCTION TO PRINT MESSAGES TO THE SCREEN AND TO THE LOG FILE
sub printMessage{
my $message = shift;
print $message;
print LOG $message;
}
###FUNCTION TO GET THE CURRENT DATE
sub getDate{
my $date = scalar(localtime);
return $date;
}
###FLUSHES THE SUMMARY AND LOG FILE
sub FlushFiles{
select((select(SUMFILE), $| = 1)[0]);
select((select(LOG), $| = 1)[0]);
$|++;
}
#########END readLibFiles.pl
# $Id: DotLib.pm,v 1.3 2003/02/24 17:33:00 mpop Exp $
#
# DotLib.pm - set of procedures for generating .dot files
#
# Copyright @ 2002, 2003, The Institute for Genomic Research (TIGR). All
# rights reserved.
=head1 Name
DotLib - library of routines for generating .dot files
=head1 Synopsis
use DotLib;
=head1 Description
A set of procedures used to create various .dot objects such as
file headers, file tails, components, nodes, edges, etc.
=cut
package DotLib;
use strict;
BEGIN {
use Exporter ();
use vars qw(@EXPORT @EXPORT_OK @ISA %EXPORT_TAGS);
@ISA = qw(Exporter);
@EXPORT = qw(&printHeader
&printFooter
&printNode
&printEdge
&startCluster
&endCluster
);
%EXPORT_TAGS = ();
@EXPORT_OK = ();
}
our $VERSION = '1.0';
our $REVISION = '$Revision: 1.3 $ ';
our $VERSION_STRING = "$VERSION ($REVISION)";
use vars @EXPORT;
use vars @EXPORT_OK;
=over 4
=item B<my $ret = printHeader($file, $type);>
Prints a .dot header for the type of output specified in the $type variable.
Allowable types are "printer", "plotter". If $type is undefined or not
passed, it generates a default header. Returns 1 upon successful
completion and 'undef' otherwise.
Example:
my $err = printHeader(\*STDOUT, "plotter");
=cut
sub printHeader
{
my $file = shift;
my $type = shift;
print $file "digraph ROOT {\n";
print $file " rankdir = LR\n";
print $file " orientation = landscape\n";
print $file " ranksep = 0.3\n";
print $file " nodesep = 0.3\n";
print $file " fontsize = 8\n";
print $file " margin = \".2,.2\"\n";
if ($type eq "printer"){
print $file " ratio = auto\n";
print $file " page = \"8.5,11\"\n";
} elsif ($type eq "plotter"){
print $file " ratio = auto\n";
print $file " page = \"36,48\"\n";
}
print $file "\n";
return 1;
} # printHeader
=item B<my $ret = printFooter($file);>
Prints a .dot footer (currently just a closed brace). Returns 1 upon
successful completion and 'undef' otherwise.
Example:
my $err = printFooter(\*STDOUT);
=cut
sub printFooter
{
my $file = shift;
print $file "}\n";
return 1;
} # printFooter
=item B<my $ret = printNode($file, $id, $label, $ori);>
Prints a "contig" node with the specified id, label, and orientation.
If orientation is 1 then the node is a forward facing arrow, otherwise
it is a backward facing arror. Returns 1 upon successful completion
and 'undef' otherwise.
Example:
my $err = printNode(\*STDOUT, $node_id, "$node_id ($node_len)", 1);
=cut
sub printNode
{
my $file = shift;
my $id = shift;
my $label = shift;
my $ori = shift;
my $angle;
$id =~ s/(\W)/_/g;
if ($ori == 1){
$angle = -90;
} else {
$angle = 90;
}
print $file " $id [ label = \"$label\" height = 0.2, fontsize = 8, shape = \"house\", orientation = $angle ]\n";
return 1;
} # printNode
=item B<my $ret = printEdge($file, $nodeA, $nodeB, $label, $style);>
Prints an edge between two nodes with the specified label. The style can
be any of the GraphViz acceptable styles ("dotted", "solid", "dashed",
"invis") or undefined in which case the default is used. Returns 1 upon
successful completion and 'undef' otherwise.
Example:
my $err = printEdge(\*STDOUT, $nodeA, $nodeB, "A to B", "invis");
=cut
sub printEdge
{
my $file = shift;
my $nodeA = shift;
my $nodeB = shift;
my $label = shift;
my $instyle = shift;
my $style;
$nodeA =~ s/(\W)/_/g;
$nodeB =~ s/(\W)/_/g;
if (defined $instyle){
$style = "style = \"" . $instyle . "\"";
if ($instyle eq "invis"){
$style .= " color = \"white\" ";
}
}
print $file " $nodeA -> $nodeB [ label =\"$label\" fontsize = 8 $style ]\n";
return 1;
} # printEdge
=item B<my $err = startCluster($file, $id, $label);>
Starts a cluster in the .dot output file with the given label and id.
Returns 1 upon successful completion and 'undef' otherwise.
Example:
my $err = startCluster(\*STDOUT, $clust_id, "first cluster");
=cut
sub startCluster
{
my $file = shift;
my $id = shift;
my $label = shift;
$id =~ s/(\W)/_/g;
print $file " subgraph cluster_$id {\n";
print $file " label = \"$label\"\n";
return 1;
} # startCluster
=item B<my $err = endCluster($file);>
Ends a cluster in the .dot output. Returns 1 upon successful
completion and 'undef' otherwise.
Example:
my $err = endCluster(\*STDOUT);
=cut
sub endCluster
{
my $file = shift;
print $file " }\n";
return 1;
} # endCluster
1;
;# getopts.pl - a better getopt.pl
#
# This library is no longer being maintained, and is included for backward
# compatibility with Perl 4 programs which may require it.
#
# In particular, this should not be used as an example of modern Perl
# programming techniques.
#
# Suggested alternatives: Getopt::Long or Getopt::Std
;# Usage:
;# do Getopts('a:bc'); # -a takes arg. -b & -c not. Sets opt_* as a
;# # side effect.
sub Getopts {
local($argumentative) = @_;
local(@args,$_,$first,$rest);
local($errs) = 0;
@args = split( / */, $argumentative );
while(@ARGV && ($_ = $ARGV[0]) =~ /^-(.)(.*)/) {
($first,$rest) = ($1,$2);
$pos = index($argumentative,$first);
if($pos >= 0) {
if($args[$pos+1] eq ':') {
shift(@ARGV);
if($rest eq '') {
++$errs unless(@ARGV);
$rest = shift(@ARGV);
}
eval "
push(\@opt_$first, \$rest);
if (!defined \$opt_$first or \$opt_$first eq '') {
\$opt_$first = \$rest;
}
else {
\$opt_$first .= ' ' . \$rest;
}
";
}
else {
eval "\$opt_$first = 1";
if($rest eq '') {
shift(@ARGV);
}
else {
$ARGV[0] = "-$rest";
}
}
}
else {
print STDERR "Unknown option: $first\n";
++$errs;
if($rest ne '') {
$ARGV[0] = "-$rest";
}
else {
shift(@ARGV);
}
}
}
$errs == 0;
}
1;
eliminate false alarm
--This line, and those below, will be ignored--
M getopts.pl
This diff is collapsed.
File added
#!/usr/bin/perl
### david.studholme@tsl.ac.uk
### modified for input from A5 assemblies and updated NCBI requirements for contigs: aaron.darling@uts.edu.au
### Bio::SeqIO requirement removed by Guillaume Jospin: gjospin@ucdavis.edu
### Generates contigs (in FastA) and scaffolding information (in AGP) from Velvet 'contigs.fa' supercontigs file
### Use entirely at you own risk!! There may be bugs!
use strict;
use warnings;
my $sequence_file = shift or die "Usage: $0 <sequence file>\n" ;
### Output file for contigs in Fasta format
my $fasta_outfile = "$sequence_file.contigs.fsa";
open (FILE, ">$fasta_outfile") and
warn "Will write contigs to file '$fasta_outfile' and AGP to STDOUT\n" or
die "Failed to write to file '$fasta_outfile'\n";
print "# Generated from assembly file $sequence_file using script $0\n";
my $i = 0;# a counter, used for generating unique contig names
my %sequences = ();
my %desc = ();
open(INSEQ, $sequence_file);
my $seq = "";
my $seqID = "";
my $seqDesc = "";
while (<INSEQ>) {
chomp($_);
if($_ =~ m/^>(\S+)\s?(.*)$/){
if ($seq ne ""){
$sequences{$seqID} = $seq;
$desc{$seqID}= $seqDesc;
}
$seqID = $1;
$seqDesc = $2;
$seq = "";
}
else{
$seq .= $_;
}
}
#adding the last sequence read.
$sequences{$seqID}=$seq;
$desc{$seqID}=$seqDesc;
close(INSEQ);
#print "Found : ".scalar(keys(%sequences))."\n";
foreach my $sID(keys %sequences){
my $supercontig_id = $sID ;
my $supercontig_seq = $sequences{$sID} ;
my $supercontig_desc = $desc{$sID} ;
my $supercontig_length = length($supercontig_seq);
### NCBI do not allow coverage and length information in the FastA identifier
### e.g. NODE_1160_length_397673_cov_14.469489 is an illegal FastA ID
### So we will replace these with simple numbers
if ($supercontig_id =~ m/NODE_(\d+)_length_\d+_cov_\d+/ or
$supercontig_id =~ m/^(\d+)$/) {
$supercontig_id = "scf_$1";
}
$supercontig_id =~ s/\|size\d+//g;
my $j = 1;# another counter, to generate part IDs
my $start_pos = 1; # keep track of whereabouts in this supercontig we are
my %substring_sequences;
## contigs < 200nt must be eliminated as NCBI does not accept these. Replace with N and merge to neighboring contig
my $subseqs_join;
foreach my $substring_sequence ( split /(N{10,})/i, $supercontig_seq ) {
if ( $substring_sequence =~ m/^N+$/i ) {
$subseqs_join .= $substring_sequence;
}elsif(length($substring_sequence) >= 200){
$subseqs_join .= $substring_sequence;
}else{
$subseqs_join .= "N" x length($substring_sequence);
}
}
# can't include this if the total amount of resolved sequence is < 200nt
my $nfree_seq = $subseqs_join;
$nfree_seq =~ s/N//g;
next if length($nfree_seq) < 200;
$subseqs_join =~ s/^N+//g;
$subseqs_join =~ s/N+$//g;
foreach my $substring_sequence ( split /(N{10,})/i, $subseqs_join ) {
### NB that NCBI do not allow gaps of fewer than 10 nucleotides between contigs.
### Gaps of fewer than 10 nucleotides are treated as ambiguities rather than gaps.
### So this split is a bit of a fudge.
#warn "\n$substring_sequence\n" if $supercontig_id eq '1160'; for #debugging only
### Define the AGP column contents
my $object1 = $supercontig_id;
my $object_beg2 = $start_pos;
my $object_end3 = $start_pos + length($substring_sequence) - 1;
my $part_number4 = $j;
my $component_type5;
my $component_id6a;
my $gap_length6b;
my $component_beg7a;
my $gap_type7b;
my $component_end8a;
my $linkage8b;
my $orientation9a;
my $filler9b;
$j++;
if ( $substring_sequence =~ m/^N+$/i ) {
### This is poly-N gap between contigs
$component_type5 = 'N';
$gap_length6b = length($substring_sequence);
$gap_type7b = 'scaffold';
$linkage8b = 'yes';
$filler9b = 'paired-ends';
} elsif ( $substring_sequence =~ m/^[ACGTRYSWKMBDHVN]+$/i ) {
### This is a contig
$i++; # a counter, used for generating unique contig names
$component_type5 = 'W';
$component_id6a = "contig_$i";
$component_beg7a = 1;
$component_end8a = length($substring_sequence);
$orientation9a = '+';
### Print FastA formatted contig
print FILE ">$component_id6a\n$substring_sequence\n";
} else {
die "Illegal characters in sequence $sID\n";#$substring_sequence\n";
}
$start_pos += length ($substring_sequence);
if ($component_type5 eq 'N') {
### print AGP line for gap
print "$object1\t$object_beg2\t$object_end3\t$part_number4\t$component_type5\t$gap_length6b\t$gap_type7b\t$linkage8b\t$filler9b\n";
} else {
### print AGP line for contig
print "$object1\t$object_beg2\t$object_end3\t$part_number4\t$component_type5\t$component_id6a\t$component_beg7a\t$component_end8a\t$orientation9a\n";
}
}
}