Compare revisions

Étienne Mollier · Étienne Mollier · Étienne Mollier · Étienne Mollier · Étienne Mollier · Étienne Mollier
--- a/README.md
+++ b/README.md
 Parsnp is a command-line-tool for efficient microbial core genome alignment and SNP detection. Parsnp was designed to work in tandem with Gingr, a flexible platform for visualizing genome alignments and phylogenetic trees; both Parsnp and Gingr form part of the Harvest suite :

- [Harvest project page](http://harvest.readthedocs.org)
-  -  url: http://harvest.readthedocs.org
-


 # Installation
@@ -12,7 +9,44 @@ Parsnp is available on the [Bioconda](https://bioconda.github.io/user/install.ht
 conda install parsnp
 ```

-## From source
+Instructions for building Parsnp from source are available towards the end of this README.
+
+# Running Parsnp
+Parsnp can be run multiple ways, but the most common is with a set of genomes and a reference. 
+```
+parsnp -g <reference_genbank> -d <genomes> 
+```
+```
+parsnp -r <reference_fasta> -d <genomes> 
+```
+For example, 
+```
+parsnp -r examples/mers_virus/ref/England1.fna -d examples/mers_virus/genomes/*.fna -o examples-out
+```
+
+## Partition mode
+Parsnp 2 will group query genomes up into random partitions of at least `--min-partition-size` genomes each (50 by default). Parsnp is then run independently on each group, and the resulting alignment of each group is merged into a single alignment of all input genomes. This limits the input size for an individual "core" Parsnp step, leading to significantly less memory and CPU usage. We've also shown, on simulated and empirical data, that this partitioning step often leads to increased core-genome size and better phylogenetic signal.
+
+The `--no-partition` flag allows users to run all query genomes at once.
+
+## Output files
+* `parsnp.xmfa` is the core-genome alignment.
+* `parsnp.ggr` is the compressed representation of the alignment generated by the harvest-toolkit. This file can be used to visualize alignments with Gingr.
+* `parsnp.snps.mblocks` is the core-SNP signature of each sequence in fasta format. This is the file which is used to generate `parsnp.tree`
+* `parsnp.tree` is the resulting phylogeny.
+* If run in partition mode, Parsnp will produce a `partition` folder in the output directory, which contains the output of each of the partitioned runs. 
+
+
+### XMFA format
+The output XMFA file contains a header section mapping contig names to indices. Following the header section, the LCBs/clusters are reported in the XMFA format, where the ID for each record in an LCB is formatted as:
+
+```
+[fileidx]:[concat_start]-[concat_end] [strand] cluster[x] s[contig_idx]:p[contig_pos]
+```
+
+The `concat_start` and `concat_end` values are internal to parsnp. The sequence for this record can be found in the file at index `fileidx` (these are declared at the top of the xmfa) on the `contig_idx`th contig starting at position `contig_pos`. 
+
+## Building from source 

 To build Parsnp from source, users must have automake 1.15, autoconf, and libtool installed. Parsnp also requires RaxML (or FastTree), Harvest-tools, and numpy. Some additional features require  pySPOA, Mash, FastANI, and Phipack. All of these packages are available via Conda (many on the Bioconda channel).

@@ -48,39 +82,14 @@ Note that the `parsnp` executable in `bin/` is not the same as the one in the ro
 ## OSX Users (Catalina)
 Recent OSX have a Gatekeeper, that's designed to ensure that only softwre from known developers runs on  tour Mac. Please refer to this link to enable the binaries shipped with Parsnp to run: https://support.apple.com/en-us/HT202491

-# Running Parsnp
-Parsnp can be run multiple ways, but the most common is with a set of genomes and a reference. 
-```
-parsnp -g <reference_genbank> -d <genomes> 
-```
-```
-parsnp -r <reference_fasta> -d <genomes> 
-```
-For example, 
-```
-parsnp -r examples/mers_virus/ref/England1.fna -d examples/mers_virus/genomes/*.fna -o examples-out
-```
-
-## Partition mode
-Parsnp 2 includes a new mode which can be activated with `--partition`. This mode randomly splits the input genomes up into groups of *p* genomes each, where *p* defaults to 50 and can be changed with `--partition-size=p`. Parsnp is then run independently on each group, and the resulting alignment of each group is merged into a single alignment of all input genomes. This mode is intended for large datasets, as it reduces the computational requirements. 
-
-```
-parsnp -r examples/mers_virus/ref/England1.fna -d examples/mers_virus/genomes/*.fna --partition --partition-size 10 -o examples-out-partitioned
-```
+## Misc

-More examples can be found in the [readthedocs tutorial](https://harvest.readthedocs.io/en/latest/content/parsnp/tutorial.html)
+CITATION provides details on how to cite Parsnp.

-## Output files
-* `parsnp.xmfa` is the core-genome alignment.
-* `parsnp.ggr` is the compressed representation of the alignment generated by the harvest-toolkit. This file can be used to visualize alignments with Gingr.
-* `parsnp.snps.mblocks` is the core-SNP signature of each sequence in fasta format. This is the file which is used to generate `parsnp.tree`
-* `parsnp.tree` is the resulting phylogeny.
-* If run in partition mode, Parsnp will produce a `partition` folder in the output directory, which contains the output of each of the partitioned runs. 
+LICENSE provides licensing information.


+- The [Harvest project page](http://harvest.readthedocs.org) provides examples of how to use Gingr and HarvestTools with Parsnp, however the Parsnp examples are not up-to-date with the current interface.

-## Misc

-CITATION provides details on how to cite Parsnp.

-LICENSE provides licensing information.
--- a/debian/changelog
+++ b/debian/changelog
+parsnp (2.0.5+dfsg-1) unstable; urgency=medium
+
+  * New upstream version 2.0.5+dfsg
+  * d/control: declare compliance to standards version 4.7.0.
+  * proper_calls_to_tools.patch: unfuzz.
+  * py3-parsnp-libs.patch: unfuzz.
+
+ -- Étienne Mollier <emollier@debian.org>  Wed, 24 Apr 2024 12:37:21 +0200
+
 parsnp (2.0.4+dfsg-1) unstable; urgency=medium

  * New upstream version

--- a/debian/control
+++ b/debian/control
@@ -10,7 +10,7 @@ Build-Depends: debhelper-compat (= 13),
               python3-setuptools,
               cython3,
               libmuscle-dev
-Standards-Version: 4.6.2
+Standards-Version: 4.7.0
 Vcs-Browser: https://salsa.debian.org/med-team/parsnp
 Vcs-Git: https://salsa.debian.org/med-team/parsnp.git
 Homepage: https://harvest.readthedocs.org/en/latest/content/parsnp.html

--- a/debian/patches/proper_calls_to_tools.patch
+++ b/debian/patches/proper_calls_to_tools.patch
@@ -5,7 +5,7 @@ Forwarded: not-needed

 --- parsnp.orig/parsnp
 +++ parsnp/parsnp
-@@ -146,7 +146,7 @@
+@@ -194,7 +194,7 @@
 def run_phipack(query,seqlen,workingdir):
     currdir = os.getcwd()
     os.chdir(workingdir)
@@ -14,7 +14,7 @@ Forwarded: not-needed
     run_command(command, 1)
     os.chdir(currdir)
 
-@@ -644,7 +644,7 @@
+@@ -696,7 +696,7 @@
         missing = True
         logger.critical("{} not in system path!".format(exe))
     if use_phipack:
@@ -23,7 +23,7 @@ Forwarded: not-needed
         if shutil.which(exe) is None:
             missing = True
             logger.critical("{} not in system path!".format(exe))
-@@ -659,7 +659,7 @@
+@@ -711,7 +711,7 @@
             logger.critical("No fasttree executable found in system path!".format(exe))
         missing = missing or (not has_fasttree)
     else:
@@ -32,7 +32,7 @@ Forwarded: not-needed
         if shutil.which(exe) is None:
             missing = True
             logger.critical("{} not in system path!".format(exe))
-@@ -976,7 +976,7 @@
+@@ -1029,7 +1029,7 @@
     logger.debug("Writing .ini file")
     if xtrafast or 1:
         args.extend = False
@@ -41,7 +41,7 @@ Forwarded: not-needed
     inifiled = inifiled.replace("$REF", ref)
     inifiled = inifiled.replace("$EXTEND", "%d" % (args.extend))
     inifiled = inifiled.replace("$ANCHORS", str(args.min_anchor_length))
-@@ -1075,7 +1075,7 @@
+@@ -1128,7 +1128,7 @@
     if not os.path.exists(inifile):
         logger.error("ini file %s does not exist!\n"%(inifile))
         sys.exit(1)
@@ -50,7 +50,7 @@ Forwarded: not-needed
     # with open(f"{outputDir}/parsnpAligner.out", 'w') as stdout_f, open(f"{outputDir}/parsnpAligner.err", 'w') as stderr_f:
         # rc = run_command(command, ignorerc=1, stdout=stdout_f, stderr=stderr_f, prepend_time=True)
     rc = run_logged_command(command=command, ignorerc=1, label="parsnp-aligner", outputDir=outputDir)
-@@ -1301,10 +1301,10 @@
+@@ -1352,10 +1352,10 @@
         logger.info("Recruiting genomes...")
         if use_parsnp_mumi:
             if not inifile_exists:
@@ -63,7 +63,7 @@ Forwarded: not-needed
             run_logged_command(command=command, outputDir=outputDir, label="parsnp-mumi")
             # Takes eeach sequence and computes its mumi distance to the reference
             try:
-@@ -1740,7 +1740,7 @@
+@@ -1798,7 +1798,7 @@
                 break
         if not use_fasttree:
             with TemporaryDirectory() as raxml_output_dir:

--- a/debian/patches/py3-parsnp-libs.patch
+++ b/debian/patches/py3-parsnp-libs.patch
@@ -36,4 +36,4 @@ This patch header follows DEP-3: http://dep.debian.net/deps/dep3/
 +import parsnp.extend as ext
 from tqdm import tqdm
 
- __version__ = "2.0.4"
+ __version__ = "2.0.5"
--- a/logger.py
+++ b/logger.py
@@ -15,7 +15,7 @@ RESET_SEQ = "\033[0m"
 COLOR_SEQ = "\033[1;%dm"
 BOLD_SEQ = "\033[1m"

-MIN_TQDM_INTERVAL=30
+MIN_TQDM_INTERVAL=5


 # Logging redirect copied from https://stackoverflow.com/questions/14897756/python-progress-bar-through-logging-module

--- a/parsnp
+++ b/parsnp
@@ -6,7 +6,7 @@
 '''

 import os, sys, string, random, subprocess, time, operator, math, datetime, numpy #pysam
-from collections import defaultdict
+from collections import defaultdict, namedtuple
 import shutil
 import multiprocessing
 import shlex
@@ -17,7 +17,7 @@ from logger import logger, TqdmToLogger, MIN_TQDM_INTERVAL
 import argparse
 import signal
 from multiprocessing import Pool
-from Bio import SeqIO
+from Bio import SeqIO, AlignIO
 from glob import glob
 from pathlib import Path

@@ -25,7 +25,7 @@ from pathlib import Path
 import extend as ext
 from tqdm import tqdm

-__version__ = "2.0.4"
+__version__ = "2.0.5"
 reroot_tree = True #use --midpoint-reroot
 random_seeded = random.Random(42)

@@ -141,6 +141,54 @@ def handler(signum, frame):

 signal.signal(signal.SIGINT, handler)

+def xmfa_to_maf(xmfa_path, maf_path, all_input_paths):
+    sample_delim = '#'
+    SeqInfo = namedtuple("SeqInfo", "cid seq_length")
+    hdr_block_pattern = re.compile("##SequenceIndex (\d+)\n##SequenceFile (.+)\n##SequenceHeader >\s*(\S+).*\n##SequenceLength (\d+)bp")
+    idx_to_fname = {}
+    ref_fname = ""
+    with open(xmfa_path) as xmfa_in:
+        next(xmfa_in) # skip version
+        line = next(xmfa_in)
+        seq_count = int(re.match("#SequenceCount (\d+)\n", line).groups()[0])
+        for i in range(seq_count):
+            info_block = ""
+            for _ in range(4):
+                info_block += next(xmfa_in)
+            parsed_info = hdr_block_pattern.match(info_block).groups()
+            fname = parsed_info[1]
+            if fname.endswith(".ref") and ref_fname == "":
+                fname = fname[:-4]
+                ref_fname = fname
+            idx_to_fname[parsed_info[0]] = fname
+
+    fname_to_info = defaultdict(list)
+    for f in all_input_paths:
+        fname = Path(f).name
+        if fname.endswith(".ref"):
+            fname = fname[:-4]
+        fname_to_info[fname] = [SeqInfo(rec.id.split(" ")[0], len(rec.seq)) for rec in SeqIO.parse(f, "fasta")]
+
+    seqid_pattern = re.compile(r'^cluster(\d+) s(\d+):p(\d+)')
+    with open(maf_path, 'w') as maf_out:
+        for aln in AlignIO.parse(xmfa_path, "mauve"):
+            for rec_idx, rec in enumerate(aln):
+                aln_len = rec.annotations["end"] - rec.annotations["start"]
+                cluster_idx, contig_idx, startpos = [int(x) for x in seqid_pattern.match(rec.id).groups()]
+                fname = idx_to_fname[rec.name]
+                if rec_idx == 0:
+                    maf_out.write(f"a cluster={cluster_idx}\n")
+                elif fname == ref_fname:
+                    continue
+                rec_info = fname_to_info[fname][contig_idx-1]
+                maf_out.write(f"s\t{'.'.join(fname.split('.')[:-1])}{sample_delim}{rec_info.cid}")
+                maf_out.write(f"\t{startpos}")
+                maf_out.write(f"\t{aln_len}")
+                maf_out.write(f"\t{'+' if rec.annotations['strand'] == 1 else '-'}")
+                maf_out.write(f"\t{rec_info.seq_length}")
+                maf_out.write(f"\t{rec.seq}\n")
+            maf_out.write("\n")
+    return

 #TODO Merge run fns
 def run_phipack(query,seqlen,workingdir):
@@ -430,14 +478,14 @@ def parse_args():

    partition_args = parser.add_argument_group("Sequence Partitioning")
    partition_args.add_argument(
-        "--partition",
+        "--no-partition",
        action='store_true',
-        help="Evenly split input sequences across separate runs of parsnp, then merge results")
+        help="Run all query genomes in single parsnp alignment, no partitioning.")
    partition_args.add_argument(
-        "--partition-size",
+        "--min-partition-size",
        type=int,
        default=50,
-        help="Number of sequences in a partitioned run of parsnp")
+        help="Minimum size of a partition. Input genomes will be split evenly across partitions at least this large.")

    extend_args = parser.add_argument_group("LCB Extension")
    extend_args.add_argument(
@@ -487,6 +535,10 @@ def parse_args():
        "--vcf",
        action = "store_true",
        help = "Generate VCF file.")
+    misc_args.add_argument(
+        "--no-maf",
+        action = "store_true",
+        help = "Do not generage MAF file (XMFA only)")
    misc_args.add_argument(
        "-p",
        "--threads",
@@ -893,6 +945,7 @@ def parse_input_files(input_files, curated, validate_input):
    '''
    # global ff, hdr, seq
    fnafiles = []
+    fname_to_info = defaultdict(list)
    for input_file in input_files[:]:
        # If biopython cannot parse the input as a fasta, kick it out of the list and move on:
        if validate_input:
@@ -1242,9 +1295,7 @@ SETTINGS:
        # fnafiles.remove(ref)

    #sort reference by largest replicon to smallest
-    if ref in fnafiles:
-        fnafiles = [f for f in fnafiles if f != ref]
-    elif sortem and os.path.exists(ref) and not autopick_ref:
+    if sortem and os.path.exists(ref) and not autopick_ref:
        sequences = SeqIO.parse(ref, "fasta")
        new_ref = os.path.join(outputDir, os.path.basename(ref)+".ref")
        SeqIO.write(sequences, new_ref, "fasta")
@@ -1453,13 +1504,15 @@ SETTINGS:

    finalfiles = sorted(finalfiles)
    random_seeded.shuffle(finalfiles)
-    totseqs = len(finalfiles)
+    totseqs = len(finalfiles) + 1 # Add one to include reference

    #initiate parallelPhiPack tasks
    run_recomb_filter = 0

    #3)run parsnp (cores, grid?)
-    if not args.partition:
+    if args.no_partition or len(finalfiles) < 2*args.min_partition_size:
+        if len(finalfiles) < 2*args.min_partition_size:
+            logger.info(f"Too few genomes to run partitions of size >{args.min_partition_size}. Running all genomes at once.")
        # Editing the ini file to be used for parnsp-aligner (which is different from parsnip as a mumi finder)
        if not inifile_exists:
            write_inifile_2(inifiled, outputDir, unaligned, args, auto_ref, query, finalfiles, ref, args.threads)
@@ -1620,16 +1673,15 @@ SETTINGS:

    else:
        import partition 
-
-        if len(finalfiles) % args.partition_size == 1:
-            logger.warning("Incrementing partition size by 1 to avoid having a remainder partition of size 1")
-            args.partition_size += 1
+        full_partitions = len(finalfiles) // args.min_partition_size
+        effective_partition_size = len(finalfiles) // full_partitions
+        logger.info(f"Setting the partition size to {effective_partition_size}")
        partition_output_dir = f"{outputDir}/partition"
        partition_list_dir = f"{partition_output_dir}/input-lists"
        os.makedirs(partition_list_dir, exist_ok=True)
-        for partition_idx in range(math.ceil(len(finalfiles) / args.partition_size)):
+        for partition_idx in range(math.ceil(len(finalfiles) / effective_partition_size)):
            with open(f"{partition_list_dir}/{partition.CHUNK_PREFIX}-{partition_idx:010}.txt", 'w') as part_out:
-                for qf in finalfiles[partition_idx*args.partition_size : (partition_idx+1)*args.partition_size]:
+                for qf in finalfiles[partition_idx*effective_partition_size : (partition_idx+1)*effective_partition_size]:
                    part_out.write(f"{qf}\n")

        chunk_label_parser = re.compile(f'{partition.CHUNK_PREFIX}-(.*).txt')
@@ -1704,6 +1756,12 @@ SETTINGS:
            for lcb in new_lcbs:
                partition.write_aln_to_xmfa(lcb, out_f)

+
+    # Generate MAF file
+    if not args.no_maf:
+        logger.debug(f"Writing MAF file to {Path(parsnp_output).with_suffix('.maf')}")
+        xmfa_to_maf(parsnp_output, Path(parsnp_output).with_suffix(".maf"), finalfiles + [ref])
+
    #add genbank here, if present
    if len(genbank_ref) != 0:
        rnc = f"harvesttools -q -o {outputDir}/parsnp.ggr -x {parsnp_output}"

--- a/src/parsnp.cpp
+++ b/src/parsnp.cpp
@@ -1957,7 +1957,7 @@ void Aligner::setMumi(TRegion r1, vector<TMum>& mums, bool anchors = true, bool
        
        time ( &start);
        if(anchors)
-            cerr << "        Calculting pairwise MUMi distances...\n";
+            cerr << "        Calculating pairwise MUMi distances...\n";
        
        
        FILE* mumifile;
No results found