Steffen Möller · Steffen Möller · 53ab0439 · 53ab0439 · 53ab0439 · 4016ff05
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+ advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+ address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+ professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at chapmanb@fastmail.com. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
--- a/HISTORY.md
+++ b/HISTORY.md
+## 1.1.8
+- Add `antibody` configuration option. Setting a specific antibody for ChIP-seq will use appropriate
+  settings for that antibody. See the documentation for supported antibodies.
+- Add `use_lowfreq_filter` for forcing vardict to report variants with low allelic frequency,
+ useful for calling somatic variants in panels with high coverage.
+- Fix for checking for pre-existing inputs with python3.
+- Add `keep_duplicates` option for ChIP/ATAC-seq which does not remove duplicates before peak calling.
+  Defaults to False.
+- Add `keep_multimappers` for ChIP/ATAC-seq which does not remove multimappers before peak calling.
+  Defaults to False.
+- Remove ethnicity as a required column in PED files.
+
+## 1.1.7 (10 October 2019)
+=======
+- hot fix for dataclasses not being supported in 3.6. Use namedtuple instead.
+
+## 1.1.6 (10 October 2019)
+
+- GATK ApplyBQSRSpark: avoid StreamClosed issue with GATK 4.1+
+- RNA-seq: fixes for cufflinks preparation due to python3 transition.
+- RNA-seq: output count tables from tximport for genes and transcripts. These
+  are in `bcbioRNASeq/results/date/genes/counts` and `bcbioRNASeq/results/data/transcripts/counts`.
+- qualimap (RNA-seq): disable stranded mode for qualimap, as it gives incorrect
+  results with the hisat2 aligner and for RNA-seq just setting it to unstranded
+- Add `quantify_genome_alignments` option to use genome alignments to quantify
+  with Salmon.
+- Add `--validateMappings` flag to Salmon read quantification mode.
+- VEP cache is not installing anymore from bcbio run
+- Add support for Salmon SA method when STAR alignments are not available 
+  (for hg38).
+- Add support for the new read model for filtering in Mutect2. This is
+  experimental, and a little flaky, so it can optionally be turned on via:
+  `tools_on: mutect2_readmodel`. Thanks to @lbeltrame for implementing this
+  feature and doing a ton of work debugging.
+- Swap pandas `from_csv` call to `read_csv`.
+- Make STAR respect the `transcriptome_gtf` option.
+- Prefix regular expression with r. Thanks to @smoe for finding all of these.
+- Add informative logging messages at beginning of bcbio run. Includes the version
+  and the configuration files being used.
+- Swap samtools mpileup to use bcftools mpileup as samtools mpileup is being 
+  deprecated (https://github.com/samtools/samtools/releases/tag/1.9).
+- Ensure locale is set to one supporting UTF-8 bcbio-wide. This may need to get
+  reverted if it introduces issues.
+- Added hg38 support for STAR. We did this by taking hg38 and removing the alts,
+  decoys and HLA sequences.
+- Added support for the arriba fusion caller.
+- Added back missing programs from the version provenance file. Fixed formatting
+  problems introduced by switch to python3.
+- Added initial support for whole genome bisulfite sequencing using bismark. Thanks to
+  @hackdna for implementing this and @jnhutchinson for drafting the initial
+  pipeline. This is a work in progress in collaboration with @gcampanella, who
+  has a similar implementation with some extra features that we will be merging
+  in soon.
+- qualimap for RNA-seq runs on the downsampled BAM files by default. Set 
+  `tools_on: [qualimap_full]` to run on the full BAM files.
+- Add STAR junction files to the files captured at the end of a run.
+  
 ## 1.1.5 (12 April 2019)

 - Fixes for Python3 incompatibilities on distributed IPython runs.

--- a/README.rst
+++ b/README.rst
@@ -123,6 +123,7 @@ Contributors
 - `Roman Valls`_, Science for Life Laboratory, Stockholm
 - `Kevin Ying`_, Garvan Institute of Medical Research, Sydney, Australia
 - `Vlad Saveliev`_, Center for Algorithmic Biotechnology, St. Petersburg University
+- `Sergey Naumenko`_, Harvard Chan Bioinformatics Core


 .. _Miika Ahdesmaki: https://github.com/mjafin
@@ -146,6 +147,7 @@ Contributors
 .. _Matt Edwards: https://github.com/matted
 .. _Saket Choudhary: https://github.com/saketkc
 .. _Vlad Saveliev: https://github.com/vladsaveliev
+.. _Sergey Naumenko: https://github.com/naumenko-sa

 License
 -------

--- a/artwork/logo.png
+++ b/artwork/logo.png
--- a/bcbio/bam/__init__.py
+++ b/bcbio/bam/__init__.py
@@ -514,7 +514,7 @@ def estimate_max_mapq(in_bam, nreads=1e6):
    """Guess maximum MAPQ in a BAM file of reads with alignments
    """
    with pysam.Samfile(in_bam, "rb") as work_bam:
-        reads = tz.take(nreads, work_bam)
+        reads = tz.take(int(nreads), work_bam)
        return max([x.mapq for x in reads if not x.is_unmapped])

 def convert_cufflinks_mapq(in_bam, out_bam=None):
@@ -561,3 +561,21 @@ def convert_invalid_mapq(in_bam, out_bam=None):
                    read.mapq = VALIDMAPQ
                out_bam_fh.write(read)
    return out_bam
+
+def remove_duplicates(in_bam, data):
+    """
+    remove duplicates from a duplicate marked BAM file
+    """
+    base, ext = os.path.splitext(in_bam)
+    out_bam = base + "-noduplicates" + ext
+    if utils.file_exists(out_bam):
+        return out_bam
+    num_cores = dd.get_num_cores(data)
+    sambamba = config_utils.get_program("sambamba", data)
+    with file_transaction(out_bam) as tx_out_bam:
+        cmd = (f'{sambamba} view -h --nthreads {num_cores} -f bam -F "not duplicate" '
+               f'{in_bam} > {tx_out_bam}')
+        message = f"Removing duplicates from {in_bam}, saving as {out_bam}."
+        do.run(cmd, message)
+    index(out_bam, dd.get_config(data))
+    return out_bam
--- a/bcbio/bam/coverage.py
+++ b/bcbio/bam/coverage.py
@@ -41,7 +41,7 @@ def _calc_regional_coverage(in_bam, chrom, start, end, samplename, work_dir, dat
           "{bedtools} coverage -a {region_file} -b - -d > {tx_tmp_file}")
    do.run(cmd.format(**locals()), "Plotting coverage for %s %s" % (samplename, coords))
    names = ["chom", "start", "end", "offset", "coverage"]
-    df = pd.io.parsers.read_table(tx_tmp_file, sep="\t", header=None,
+    df = pd.io.parsers.read_csv(tx_tmp_file, sep="\t", header=None,
                                  names=names).dropna()
    os.remove(tx_tmp_file)
    df["sample"] = samplename

--- a/bcbio/bam/fastq.py
+++ b/bcbio/bam/fastq.py
@@ -221,8 +221,9 @@ def is_fastq(in_file, bzip=True):
    else:
        return False

-def downsample(f1, f2, data, N, quick=False):
-    """ get N random headers from a fastq file without reading the
+
+def downsample(f1, f2, N, quick=False):
+    """Get N random headers from a fastq file without reading the
    whole thing into memory
    modified from: http://www.biostars.org/p/6544/
    quick=True will just grab the first N reads rather than do a true
@@ -231,9 +232,9 @@ def downsample(f1, f2, data, N, quick=False):
    if quick:
        rand_records = range(N)
    else:
-        records = sum(1 for _ in open(f1)) / 4
+        records = int(sum(1 for _ in open_possible_gzip(f1)) / 4)
        N = records if N > records else N
-        rand_records = sorted(random.sample(xrange(records), N))
+        rand_records = sorted(random.sample(range(records), N))

    fh1 = open_possible_gzip(f1)
    fh2 = open_possible_gzip(f2) if f2 else None
@@ -275,6 +276,7 @@ def downsample(f1, f2, data, N, quick=False):

    return outf1, outf2

+
 def estimate_read_length(fastq_file, quality_format="fastq-sanger", nreads=1000):
    """
    estimate average read length of a fastq file

--- a/bcbio/broad/__init__.py
+++ b/bcbio/broad/__init__.py
@@ -72,7 +72,7 @@ def _clean_java_out(version_str):
    Java will report information like _JAVA_OPTIONS environmental variables in the output.
    """
    out = []
-    for line in version_str.decode().split("\n"):
+    for line in version_str.split("\n"):
        if line.startswith("Picked up"):
            pass
        if line.find("setlocale") > 0:
@@ -94,7 +94,8 @@ def get_gatk_version(gatk_jar=None, config=None):
        cl = gatk_cmd("gatk", ["-Xms128m", "-Xmx256m"] + get_default_jvm_opts(), ["-version"], config=config)
    with closing(subprocess.Popen(cl, stdout=subprocess.PIPE,
                                  stderr=subprocess.STDOUT, shell=True).stdout) as stdout:
-        out = _clean_java_out(stdout.read().strip())
+        stdout = stdout.read().decode().strip()
+        out = _clean_java_out(stdout)
        # Historical GATK version (2.4) and newer versions (4.1.0.0)
        # have a flag in front of output version
        version = out
@@ -113,7 +114,8 @@ def get_mutect_version(mutect_jar):
    """
    cl = ["java", "-Xms128m", "-Xmx256m"] + get_default_jvm_opts() + ["-jar", mutect_jar, "-h"]
    with closing(subprocess.Popen(cl, stdout=subprocess.PIPE, stderr=subprocess.STDOUT).stdout) as stdout:
-        if "SomaticIndelDetector" in stdout.read().strip():
+        stdout = stdout.read().decode().strip()
+        if "SomaticIndelDetector" in stdout:
            mutect_type = "-appistry"
        else:
            mutect_type = ""
@@ -243,7 +245,7 @@ class BroadRunner:
        cl += ["--version"]
        p = subprocess.Popen(cl, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        # fix for issue #494
-        pat = re.compile('([\d|\.]*)(\(\d*\)$)')  # matches '1.96(1510)'
+        pat = re.compile(r'([\d|\.]*)(\(\d*\)$)')  # matches '1.96(1510)'
        m = pat.search(p.stdout.read())
        version = m.group(1)
        self._picard_version = version
@@ -258,7 +260,7 @@ class BroadRunner:
                return True
            else:
                try:
-                    stdout = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True)
+                    stdout = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True, encoding="UTF-8")
                    return stdout.find("GATK jar file not found") == -1
                except subprocess.CalledProcessError:
                    return False

--- a/bcbio/chipseq/__init__.py
+++ b/bcbio/chipseq/__init__.py
 import os
+import subprocess
 import sys
 import toolz as tz
 from bcbio import utils
@@ -9,11 +10,33 @@ from bcbio.ngsalign import bowtie2, bwa
 from bcbio.distributed.transaction import file_transaction
 from bcbio.provenance import do
 from bcbio.log import logger
+from bcbio.heterogeneity.chromhacks import get_mitochondrial_chroms

 def clean_chipseq_alignment(data):
+    # lcr_bed = utils.get_in(data, ("genome_resources", "variation", "lcr"))
+    work_bam = dd.get_work_bam(data)
+    clean_bam = remove_nonassembled_chrom(work_bam, data)
+    if not dd.get_keep_multimapped(data):
+        clean_bam = remove_multimappers(clean_bam, data)
+    if not dd.get_keep_duplicates(data):
+        clean_bam = bam.remove_duplicates(clean_bam, data)
+    data["work_bam"] = clean_bam
+    encode_bed = tz.get_in(["genome_resources", "variation", "encode_blacklist"], data)
+    if encode_bed:
+        data["work_bam"] = remove_blacklist_regions(dd.get_work_bam(data), encode_bed, data['config'])
+        bam.index(data["work_bam"], data['config'])
+    try:
+        data["bigwig"] = _normalized_bam_coverage(dd.get_sample_name(data),
+                                                  dd.get_work_bam(data), data)
+    except subprocess.CalledProcessError:
+        logger.warning(f"{dd.get_work_bam(data)} was too sparse to normalize, "
+                       f" falling back to non-normalized coverage.")
+        data["bigwig"] = _bam_coverage(dd.get_sample_name(data),
+                                       dd.get_work_bam(data), data)
+    return [[data]]
+
+def remove_multimappers(bam_file, data):
    aligner = dd.get_aligner(data)
-    data["align_bam"] = dd.get_work_bam(data)
-    if dd.get_mark_duplicates(data):
    if aligner:
        if aligner == "bowtie2":
            filterer = bowtie2.filter_multimappers
@@ -22,24 +45,19 @@ def clean_chipseq_alignment(data):
        else:
            logger.error("ChIP-seq only supported for bowtie2 and bwa.")
            sys.exit(-1)
-            unique_bam = filterer(dd.get_work_bam(data), data)
-            data["work_bam"] = unique_bam
+        unique_bam = filterer(bam_file, data)
    else:
-            logger.info("Warning: When BAM file is given as input, bcbio skips multimappers removal."
-                        "If BAM is not cleaned for peak calling, can result in downstream errors.")
-    # lcr_bed = utils.get_in(data, ("genome_resources", "variation", "lcr"))
-    data["work_bam"] = _keep_assembled_chrom(dd.get_work_bam(data), dd.get_ref_file(data),
-                                             data["config"])
-    encode_bed = tz.get_in(["genome_resources", "variation", "encode_blacklist"], data)
-    if encode_bed:
-        data["work_bam"] = _prepare_bam(dd.get_work_bam(data), encode_bed, data['config'])
-        bam.index(data["work_bam"], data['config'])
-    data["bigwig"] = _bam_coverage(dd.get_sample_name(data), dd.get_work_bam(data), data)
-    return [[data]]
+        unique_bam = bam_file
+        logger.warn("When a BAM file is given as input, bcbio skips removal of "
+                    "multimappers.")
+    logger.warn("ChIP/ATAC-seq usually requires duplicate marking, but it was disabled.")
+    return unique_bam

-def _keep_assembled_chrom(bam_file, genome, config):
-    """Remove contigs from the BAM file"""
-    fai = "%s.fai" % genome
+def remove_nonassembled_chrom(bam_file, data):
+    """Remove non-assembled contigs from the BAM file"""
+    ref_file =  dd.get_ref_file(data)
+    config = dd.get_config(data)
+    fai = "%s.fai" % ref_file
    chrom = []
    with open(fai) as inh:
        for line in inh:
@@ -56,9 +74,8 @@ def _keep_assembled_chrom(bam_file, genome, config):
        bam.index(out_file, config)
    return out_file

-
-def _prepare_bam(bam_file, bed_file, config):
-    """Remove regions from bed files"""
+def remove_blacklist_regions(bam_file, bed_file, config):
+    """Remove blacklist regions from a BAM file"""
    if not bam_file or not bed_file:
        return bam_file
    out_file = utils.append_stem(bam_file, '_filter')
@@ -71,7 +88,31 @@ def _prepare_bam(bam_file, bed_file, config):

 def _bam_coverage(name, bam_input, data):
    """Run bamCoverage from deeptools"""
-    cmd = ("{bam_coverage} -b {bam_input} -o {bw_output} "
+    cmd = ("{bam_coverage} --bam {bam_input} --outFileName {bw_output} "
+          "--binSize 20 --effectiveGenomeSize {size} "
+          "--smoothLength 60 --extendReads 150 --centerReads -p {cores} ")
+    size = bam.fasta.total_sequence_length(dd.get_ref_file(data))
+    cores = dd.get_num_cores(data)
+    try:
+        bam_coverage = config_utils.get_program("bamCoverage", data)
+    except config_utils.CmdNotFound:
+        logger.info("No bamCoverage found, skipping bamCoverage.")
+        return None
+    resources = config_utils.get_resources("bamCoverage", data["config"])
+    if resources:
+        options = resources.get("options")
+        if options:
+            cmd += " %s" % " ".join([str(x) for x in options])
+    bw_output = os.path.join(os.path.dirname(bam_input), "%s.bw" % name)
+    if utils.file_exists(bw_output):
+        return bw_output
+    with file_transaction(bw_output) as out_tx:
+        do.run(cmd.format(**locals()), "Run bamCoverage in %s" % name)
+    return bw_output
+
+def _normalized_bam_coverage(name, bam_input, data):
+    """Run bamCoverage from deeptools but produce normalized bigWig files"""
+    cmd = ("{bam_coverage} --bam {bam_input} --outFileName {bw_output} "
          "--binSize 20 --effectiveGenomeSize {size} "
          "--smoothLength 60 --extendReads 150 --centerReads -p {cores} ")
    size = bam.fasta.total_sequence_length(dd.get_ref_file(data))
@@ -81,6 +122,12 @@ def _bam_coverage(name, bam_input, data):
    except config_utils.CmdNotFound:
        logger.info("No bamCoverage found, skipping bamCoverage.")
        return None
+    method = dd.get_chip_method(data)
+    cmd += "--normalizeUsing CPM "
+    toignore = get_mitochondrial_chroms(data)
+    if toignore:
+        ignorenormflag = f"--ignoreForNormalization {' '.join(toignore)} "
+        cmd += ignorenormflag
    resources = config_utils.get_resources("bamCoverage", data["config"])
    if resources:
        options = resources.get("options")
@@ -92,3 +139,35 @@ def _bam_coverage(name, bam_input, data):
    with file_transaction(bw_output) as out_tx:
        do.run(cmd.format(**locals()), "Run bamCoverage in %s" % name)
    return bw_output
+
+def _compute_deeptools_matrix(data):
+    pass
+
+def extract_NF_regions(data):
+    """
+    extract the nucleosome free regions from the work_bam. These regions will
+    be < 100 bases
+    """
+    MAX_FRAG_LENGTH = 100
+    sieve = config_utils.get_program("alignmentSieve", data)
+    work_bam = dd.get_work_bam(data)
+    num_cores = dd.get_num_cores(data)
+    out_file = os.path.splitext(work_bam)[0] + "-NF.bam"
+    log_file = os.path.splitext(work_bam)[0] + "-NF.log"
+    if file_exists(out_file):
+        data["NF_bam"] = out_file
+        return data
+
+    with file_transaction(out_file) as tx_out_file, \
+         file_transaction(log_file) as tx_log_file:
+        tx_unsorted_bam = tx_out_file + ".unsorted"
+        cmd = (
+            f"{sieve} --bam ${work_bam} --outFile {tx_unsorted_bam} --ATACshift "
+            f"--numberOfProcessors {num_cores} --maxFragmentLength {MAX_FRAG_LENGTH} "
+            f"--minMappingQuality 10 "
+            f"--filterMetrics {tx_log_file} ")
+        do.run(cmd, "Extract NF regions from {work_bam} to {tx_unsorted_bam}.")
+        tx_out_file = bam.sort(tx_unsorted_bam)
+
+    data["NF_bam"] = out_file
+    return data
--- a/bcbio/chipseq/antibodies.py
+++ b/bcbio/chipseq/antibodies.py
+#from dataclasses import dataclass
+from collections import namedtuple
+
+VALID_PEAKTYPES = ["narrow", "broad"]
+
+# @dataclass
+# class Antibody:
+#     """
+#     ChIP-seq antibody
+#     """
+#     name: str
+#     # call narrow or broad peaks
+#     peaktype: str
+#     # remove duplicates?
+#     rmdup: bool = True
+
+#     def __post_init__(self):
+#         if self.peaktype not in VALID_PEAKTYPES:
+#             raise TypeError(f"peaktype {self.peatktype} is not one of {VALID_PEAKTYPES}")
+
+Antibody = namedtuple('Antibody', 'name peaktype rmdup')
+
+_ANTIBODIES = [
+    Antibody("h3f3a", "broad", True),
+    Antibody("h3k27me3", "broad", True),
+    Antibody("h3k36me3", "broad", True),
+    Antibody("h3k4me1", "broad", True),
+    Antibody("h3k79me2", "broad", True),
+    Antibody("h3k79me3", "broad", True),
+    Antibody("h3k9me1", "broad", True),
+    Antibody("h3k9me2", "broad", True),
+    Antibody("h4k20me1", "broad", True),
+    Antibody("h2afz", "narrow", True),
+    Antibody("h3ac", "narrow", True),
+    Antibody("h3k27ac", "narrow", True),
+    Antibody("h3k4me2", "narrow", True),
+    Antibody("h3k4me3", "narrow", True),
+    Antibody("h3k9ac", "narrow", True),
+    Antibody("h3k9me3", "broad", False)
+]
+
+ANTIBODIES = {x.name: x for x in _ANTIBODIES}
+SUPPORTED_ANTIBODIES = {x.name for x in _ANTIBODIES}
--- a/bcbio/chipseq/macs2.py
+++ b/bcbio/chipseq/macs2.py
@@ -7,6 +7,8 @@ from bcbio.provenance import do
 from bcbio.pipeline import config_utils
 from bcbio.pipeline import datadict as dd
 from bcbio import bam
+from bcbio.chipseq import antibodies
+from bcbio.log import logger

 def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, data):
    """
@@ -18,15 +20,22 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
    out_file = os.path.join(out_dir, name + "_peaks_macs2.xls")
    macs2_file = os.path.join(out_dir, name + "_peaks.xls")
    if utils.file_exists(out_file):
-        _compres_bdg_files(out_dir)
+        _compress_bdg_files(out_dir)
        return _get_output_files(out_dir)
    macs2 = config_utils.get_program("macs2", config)
+    antibody = antibodies.ANTIBODIES.get(dd.get_antibody(data).lower(), None)
+    if antibody:
+        logger.info(f"{antibody.name} specified, using {antibody.peaktype} peak settings.")
+        peaksettings = select_peak_parameters(antibody)
+    else:
+        peaksettings = ""
    options = " ".join(resources.get("macs2", {}).get("options", ""))
    genome_size = bam.fasta.total_sequence_length(dd.get_ref_file(data))
    genome_size = "" if options.find("-g") > -1 else "-g %s" % genome_size
    paired = "-f BAMPE" if bam.is_paired(chip_bam) else ""
    with utils.chdir(out_dir):
-        cmd = _macs2_cmd(method)
+        cmd = _macs2_cmd(data)
+        cmd += peaksettings
        try:
            do.run(cmd.format(**locals()), "macs2 for %s" % name)
            utils.move_safe(macs2_file, out_file)
@@ -37,7 +46,7 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
                                 "You can add specific options for the sample "
                                 "setting resources as explained in docs: "
                                 "https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#sample-specific-resources")
-    _compres_bdg_files(out_dir)
+    _compress_bdg_files(out_dir)
    return _get_output_files(out_dir)

 def _get_output_files(out_dir):
@@ -52,20 +61,29 @@ def _get_output_files(out_dir):
            break
    return {"main": peaks, "macs2": fns}

-def _compres_bdg_files(out_dir):
+def _compress_bdg_files(out_dir):
    for fn in glob.glob(os.path.join(out_dir, "*bdg")):
        cmd = "gzip  %s" % fn
        do.run(cmd, "compress bdg file: %s" % fn)

-def _macs2_cmd(method="chip"):
+def _macs2_cmd(data):
    """Main command for macs2 tool."""
+    method = dd.get_chip_method(data)
    if method.lower() == "chip":
        cmd = ("{macs2} callpeak -t {chip_bam} -c {input_bam} {paired} "
-                " {genome_size} -n {name} -B {options}")
+               "{genome_size} -n {name} --bdg {options} ")
    elif method.lower() == "atac":
        cmd = ("{macs2} callpeak -t {chip_bam} --nomodel "
-               " {paired} {genome_size} -n {name} -B {options}"
+               " {paired} {genome_size} -n {name} --bdg {options}"
               " --nolambda --keep-dup all")
    else:
        raise ValueError("chip_method should be chip or atac.")
    return cmd
+
+def select_peak_parameters(antibody):
+    if antibody.peaktype == "broad":
+        return " --broad --broad-cutoff 0.05"
+    elif antibody.peaktype == "narrow":
+        return ""
+    else:
+        raise ValueError(f"{antibody.peaktype} not recognized.")
--- a/bcbio/chipseq/peaks.py
+++ b/bcbio/chipseq/peaks.py
@@ -109,7 +109,7 @@ def greylisting(data):
    """
    input_bam = data.get("work_bam_input", None)
    if not input_bam:
-        logger.info("No input BAM file detected, skipping greylisting.")
+        logger.info("No input control BAM file detected, skipping greylisting.")
        return None
    try:
        greylister = config_utils.get_program("chipseq-greylist", data)

--- a/bcbio/distributed/ipythontasks.py
+++ b/bcbio/distributed/ipythontasks.py
@@ -8,7 +8,7 @@ try:
 except ImportError:
    from IPython.parallel import require

-from bcbio import heterogeneity, hla, chipseq, structural, upload
+from bcbio import heterogeneity, hla, chipseq, structural, upload, utils
 from bcbio.bam import callable
 from bcbio.rnaseq import (sailfish, rapmap, salmon, umi, kallisto, spikein,
                          bcbiornaseq)
@@ -17,6 +17,7 @@ from bcbio.ngsalign import alignprep
 from bcbio.srna import sample as srna
 from bcbio.srna import group as seqcluster
 from bcbio.chipseq import peaks
+from bcbio.wgbsseq import cpg_caller, deduplication, trimming
 from bcbio.pipeline import (archive, config_utils, disambiguate, sample,
                            qcsummary, shared, variation, run_info, rnaseq)
 from bcbio.provenance import system
@@ -29,9 +30,10 @@ from bcbio.log import logger, setup_local_logging
 @contextlib.contextmanager
 def _setup_logging(args):
    # Set environment to standard to use periods for decimals and avoid localization
-    os.environ["LC_ALL"] = "C"
-    os.environ["LC"] = "C"
-    os.environ["LANG"] = "C"
+    locale_to_use = utils.get_locale()
+    os.environ["LC_ALL"] = locale_to_use
+    os.environ["LC"] = locale_to_use
+    os.environ["LANG"] = locale_to_use
    config = None
    if len(args) == 1 and isinstance(args[0], (list, tuple)):
        args = args[0]
@@ -103,12 +105,21 @@ def trim_srna_sample(*args):
    with _setup_logging(args) as config:
        return ipython.zip_args(apply(srna.trim_srna_sample, *args))

+
+@require(trimming)
+def trim_bs_sample(*args):
+    args = ipython.unzip_args(args)
+    with _setup_logging(args):
+        return ipython.zip_args(apply(trimming.trim, *args))
+
+
 @require(srna)
 def srna_annotation(*args):
    args = ipython.unzip_args(args)
    with _setup_logging(args) as config:
        return ipython.zip_args(apply(srna.sample_annotation, *args))

+
 @require(seqcluster)
 def seqcluster_prepare(*args):
    args = ipython.unzip_args(args)
@@ -133,6 +144,35 @@ def peakcalling(* args):
    with _setup_logging(args) as config:
        return ipython.zip_args(apply(peaks.calling, *args))

+
+@require(cpg_caller)
+def cpg_calling(*args):
+    args = ipython.unzip_args(args)
+    with _setup_logging(args) as config:
+        return ipython.zip_args(apply(cpg_caller.calling, *args))
+
+
+@require(cpg_caller)
+def cpg_processing(*args):
+    args = ipython.unzip_args(args)
+    with _setup_logging(args) as config:
+        return ipython.zip_args(apply(cpg_caller.cpg_postprocessing, *args))
+
+
+@require(cpg_caller)
+def cpg_stats(*args):
+    args = ipython.unzip_args(args)
+    with _setup_logging(args) as config:
+        return ipython.zip_args(apply(cpg_caller.cpg_stats, *args))
+
+
+@require(deduplication)
+def deduplicate_bismark(*args):
+    args = ipython.unzip_args(args)
+    with _setup_logging(args):
+        return ipython.zip_args(apply(deduplication.dedup_bismark, *args))
+
+
 @require(sailfish)
 def run_sailfish(*args):
    args = ipython.unzip_args(args)
@@ -223,6 +263,12 @@ def run_salmon_bam(*args):
    with _setup_logging(args):
        return ipython.zip_args(apply(salmon.run_salmon_bam, *args))

+@require(salmon)
+def run_salmon_decoy(*args):
+    args = ipython.unzip_args(args)
+    with _setup_logging(args):
+        return ipython.zip_args(apply(salmon.run_salmon_decoy, *args))
+
 @require(salmon)
 def run_salmon_reads(*args):
    args = ipython.unzip_args(args)

--- a/bcbio/distributed/multitasks.py
+++ b/bcbio/distributed/multitasks.py
@@ -5,6 +5,7 @@ from bcbio.bam import callable
 from bcbio.srna import sample as srna
 from bcbio.srna import group as seqcluster
 from bcbio.chipseq import peaks
+from bcbio.wgbsseq import cpg_caller, deduplication, trimming
 from bcbio.cwl import create as cwl_create
 from bcbio.cwl import cwlutils
 from bcbio.rnaseq import (sailfish, rapmap, salmon, umi, kallisto, spikein,
@@ -58,6 +59,10 @@ def run_kallisto_index(*args):
 def run_kallisto_rnaseq(*args):
    return kallisto.run_kallisto_rnaseq(*args)

+@utils.map_wrap
+def run_salmon_decoy(*args):
+    return salmon.run_salmon_decoy(*args)
+
 @utils.map_wrap
 def run_salmon_reads(*args):
    return salmon.run_salmon_reads(*args)
@@ -175,6 +180,32 @@ def srna_alignment(*args):
 def peakcalling(*args):
    return peaks.calling(*args)

+
+@utils.map_wrap
+def trim_bs_sample(*args):
+    return trimming.trim(*args)
+
+
+@utils.map_wrap
+def cpg_calling(*args):
+    return cpg_caller.calling(*args)
+
+
+@utils.map_wrap
+def cpg_processing(*args):
+    return cpg_caller.cpg_postprocessing(*args)
+
+
+@utils.map_wrap
+def cpg_stats(*args):
+    return cpg_caller.cpg_stats(*args)
+
+
+@utils.map_wrap
+def deduplicate_bismark(*args):
+    return deduplication.dedup_bismark(*args)
+
+
 @utils.map_wrap
 def prep_align_inputs(*args):
    return alignprep.create_inputs(*args)

--- a/bcbio/distributed/runfn.py
+++ b/bcbio/distributed/runfn.py
@@ -14,7 +14,7 @@ import six
 import toolz as tz
 import yaml

-from bcbio import log, utils, setpath
+from bcbio import log, utils, setpath, utils
 from bcbio.log import logger
 from bcbio.cwl import cwlutils
 from bcbio.distributed import multitasks
@@ -24,9 +24,10 @@ def process(args):
    """Run the function in args.name given arguments in args.argfile.
    """
    # Set environment to standard to use periods for decimals and avoid localization
-    os.environ["LC_ALL"] = "C"
-    os.environ["LC"] = "C"
-    os.environ["LANG"] = "C"
+    locale_to_use = utils.get_locale()
+    os.environ["LC_ALL"] = locale_to_use
+    os.environ["LC"] = locale_to_use
+    os.environ["LANG"] = locale_to_use
    setpath.prepend_bcbiopath()
    try:
        fn = getattr(multitasks, args.name)

--- a/bcbio/heterogeneity/bubbletree.py
+++ b/bcbio/heterogeneity/bubbletree.py
@@ -78,7 +78,7 @@ def _run_bubbletree(vcf_csv, cnv_csv, data, wide_lrr=False, do_plots=True,
    with open(r_file, "w") as out_handle:
        out_handle.write(_script.format(**locals()))
    if not utils.file_exists(freqs_out):
-        cmd = "%s && %s --no-environ %s" % (utils.get_R_exports(), utils.Rscript_cmd(), r_file)
+        cmd = "%s && %s --vanilla %s" % (utils.get_R_exports(), utils.Rscript_cmd(), r_file)
        try:
            do.run(cmd, "Assess heterogeneity with BubbleTree")
        except subprocess.CalledProcessError as msg:

--- a/bcbio/heterogeneity/chromhacks.py
+++ b/bcbio/heterogeneity/chromhacks.py
@@ -7,6 +7,8 @@ import os

 from bcbio import utils
 from bcbio.distributed.transaction import file_transaction
+from bcbio.pipeline import datadict as dd
+from bcbio.bam import ref

 def is_autosomal(chrom):
    """Keep chromosomes that are a digit 1-22, or chr prefixed digit chr1-chr22
@@ -51,3 +53,8 @@ def bed_to_standardonly(in_file, data, headers=None, include_sex_chroms=False, o
                        if checkfn(line.split()[0]) or (headers and line.startswith(headers)):
                            out_handle.write(line)
    return out_file
+
+def get_mitochondrial_chroms(data):
+    ref_file = dd.get_ref_file(data)
+    mito = [c.name for c in ref.file_contigs(ref_file) if is_mitochondrial(c.name)]
+    return mito
--- a/bcbio/hla/optitype.py
+++ b/bcbio/hla/optitype.py
@@ -25,7 +25,7 @@ def run(data):
    """
    hlas = []
    for hla_fq in tz.get_in(["hla", "fastq"], data, []):
-        hla_type = re.search("[.-](?P<hlatype>HLA-[\w-]+).fq", hla_fq).group("hlatype")
+        hla_type = re.search(r"[.-](?P<hlatype>HLA-[\w-]+).fq", hla_fq).group("hlatype")
        if hla_type in SUPPORTED_HLAS:
            if utils.file_exists(hla_fq):
                hlas.append((hla_type, hla_fq))

--- a/bcbio/illumina/flowcell.py
+++ b/bcbio/illumina/flowcell.py
@@ -13,7 +13,7 @@ def parse_dirname(fc_dir):
    name = None
    date = None
    for p in parts:
-        if p.endswith(("XX", "xx", "XY", "X2")):
+        if p.endswith(("XX", "xx", "XY", "X2", "X3")):
            name = p
        elif len(p) == 6:
            try:

--- a/bcbio/illumina/machine.py
+++ b/bcbio/illumina/machine.py
@@ -106,7 +106,7 @@ def _find_unprocessed(config):

 def _get_directories(config):
    for directory in config["dump_directories"]:
-        for dname in sorted(glob.glob(os.path.join(directory, "*[Aa]*[Xx][XxYy2]"))):
+        for dname in sorted(glob.glob(os.path.join(directory, "*[Aa]*[Xx][XxYy23]"))):
            if os.path.isdir(dname):
                yield dname