Skip to content
Commits on Source (2)
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at chapmanb@fastmail.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
## 1.1.8
- Add `antibody` configuration option. Setting a specific antibody for ChIP-seq will use appropriate
settings for that antibody. See the documentation for supported antibodies.
- Add `use_lowfreq_filter` for forcing vardict to report variants with low allelic frequency,
useful for calling somatic variants in panels with high coverage.
- Fix for checking for pre-existing inputs with python3.
- Add `keep_duplicates` option for ChIP/ATAC-seq which does not remove duplicates before peak calling.
Defaults to False.
- Add `keep_multimappers` for ChIP/ATAC-seq which does not remove multimappers before peak calling.
Defaults to False.
- Remove ethnicity as a required column in PED files.
## 1.1.7 (10 October 2019)
=======
- hot fix for dataclasses not being supported in 3.6. Use namedtuple instead.
## 1.1.6 (10 October 2019)
- GATK ApplyBQSRSpark: avoid StreamClosed issue with GATK 4.1+
- RNA-seq: fixes for cufflinks preparation due to python3 transition.
- RNA-seq: output count tables from tximport for genes and transcripts. These
are in `bcbioRNASeq/results/date/genes/counts` and `bcbioRNASeq/results/data/transcripts/counts`.
- qualimap (RNA-seq): disable stranded mode for qualimap, as it gives incorrect
results with the hisat2 aligner and for RNA-seq just setting it to unstranded
- Add `quantify_genome_alignments` option to use genome alignments to quantify
with Salmon.
- Add `--validateMappings` flag to Salmon read quantification mode.
- VEP cache is not installing anymore from bcbio run
- Add support for Salmon SA method when STAR alignments are not available
(for hg38).
- Add support for the new read model for filtering in Mutect2. This is
experimental, and a little flaky, so it can optionally be turned on via:
`tools_on: mutect2_readmodel`. Thanks to @lbeltrame for implementing this
feature and doing a ton of work debugging.
- Swap pandas `from_csv` call to `read_csv`.
- Make STAR respect the `transcriptome_gtf` option.
- Prefix regular expression with r. Thanks to @smoe for finding all of these.
- Add informative logging messages at beginning of bcbio run. Includes the version
and the configuration files being used.
- Swap samtools mpileup to use bcftools mpileup as samtools mpileup is being
deprecated (https://github.com/samtools/samtools/releases/tag/1.9).
- Ensure locale is set to one supporting UTF-8 bcbio-wide. This may need to get
reverted if it introduces issues.
- Added hg38 support for STAR. We did this by taking hg38 and removing the alts,
decoys and HLA sequences.
- Added support for the arriba fusion caller.
- Added back missing programs from the version provenance file. Fixed formatting
problems introduced by switch to python3.
- Added initial support for whole genome bisulfite sequencing using bismark. Thanks to
@hackdna for implementing this and @jnhutchinson for drafting the initial
pipeline. This is a work in progress in collaboration with @gcampanella, who
has a similar implementation with some extra features that we will be merging
in soon.
- qualimap for RNA-seq runs on the downsampled BAM files by default. Set
`tools_on: [qualimap_full]` to run on the full BAM files.
- Add STAR junction files to the files captured at the end of a run.
## 1.1.5 (12 April 2019)
- Fixes for Python3 incompatibilities on distributed IPython runs.
......
......@@ -123,6 +123,7 @@ Contributors
- `Roman Valls`_, Science for Life Laboratory, Stockholm
- `Kevin Ying`_, Garvan Institute of Medical Research, Sydney, Australia
- `Vlad Saveliev`_, Center for Algorithmic Biotechnology, St. Petersburg University
- `Sergey Naumenko`_, Harvard Chan Bioinformatics Core
.. _Miika Ahdesmaki: https://github.com/mjafin
......@@ -146,6 +147,7 @@ Contributors
.. _Matt Edwards: https://github.com/matted
.. _Saket Choudhary: https://github.com/saketkc
.. _Vlad Saveliev: https://github.com/vladsaveliev
.. _Sergey Naumenko: https://github.com/naumenko-sa
License
-------
......
artwork/logo.png

14.9 KiB | W: | H:

artwork/logo.png

17.3 KiB | W: | H:

artwork/logo.png
artwork/logo.png
artwork/logo.png
artwork/logo.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -514,7 +514,7 @@ def estimate_max_mapq(in_bam, nreads=1e6):
"""Guess maximum MAPQ in a BAM file of reads with alignments
"""
with pysam.Samfile(in_bam, "rb") as work_bam:
reads = tz.take(nreads, work_bam)
reads = tz.take(int(nreads), work_bam)
return max([x.mapq for x in reads if not x.is_unmapped])
def convert_cufflinks_mapq(in_bam, out_bam=None):
......@@ -561,3 +561,21 @@ def convert_invalid_mapq(in_bam, out_bam=None):
read.mapq = VALIDMAPQ
out_bam_fh.write(read)
return out_bam
def remove_duplicates(in_bam, data):
"""
remove duplicates from a duplicate marked BAM file
"""
base, ext = os.path.splitext(in_bam)
out_bam = base + "-noduplicates" + ext
if utils.file_exists(out_bam):
return out_bam
num_cores = dd.get_num_cores(data)
sambamba = config_utils.get_program("sambamba", data)
with file_transaction(out_bam) as tx_out_bam:
cmd = (f'{sambamba} view -h --nthreads {num_cores} -f bam -F "not duplicate" '
f'{in_bam} > {tx_out_bam}')
message = f"Removing duplicates from {in_bam}, saving as {out_bam}."
do.run(cmd, message)
index(out_bam, dd.get_config(data))
return out_bam
......@@ -41,7 +41,7 @@ def _calc_regional_coverage(in_bam, chrom, start, end, samplename, work_dir, dat
"{bedtools} coverage -a {region_file} -b - -d > {tx_tmp_file}")
do.run(cmd.format(**locals()), "Plotting coverage for %s %s" % (samplename, coords))
names = ["chom", "start", "end", "offset", "coverage"]
df = pd.io.parsers.read_table(tx_tmp_file, sep="\t", header=None,
df = pd.io.parsers.read_csv(tx_tmp_file, sep="\t", header=None,
names=names).dropna()
os.remove(tx_tmp_file)
df["sample"] = samplename
......
......@@ -221,8 +221,9 @@ def is_fastq(in_file, bzip=True):
else:
return False
def downsample(f1, f2, data, N, quick=False):
""" get N random headers from a fastq file without reading the
def downsample(f1, f2, N, quick=False):
"""Get N random headers from a fastq file without reading the
whole thing into memory
modified from: http://www.biostars.org/p/6544/
quick=True will just grab the first N reads rather than do a true
......@@ -231,9 +232,9 @@ def downsample(f1, f2, data, N, quick=False):
if quick:
rand_records = range(N)
else:
records = sum(1 for _ in open(f1)) / 4
records = int(sum(1 for _ in open_possible_gzip(f1)) / 4)
N = records if N > records else N
rand_records = sorted(random.sample(xrange(records), N))
rand_records = sorted(random.sample(range(records), N))
fh1 = open_possible_gzip(f1)
fh2 = open_possible_gzip(f2) if f2 else None
......@@ -275,6 +276,7 @@ def downsample(f1, f2, data, N, quick=False):
return outf1, outf2
def estimate_read_length(fastq_file, quality_format="fastq-sanger", nreads=1000):
"""
estimate average read length of a fastq file
......
......@@ -72,7 +72,7 @@ def _clean_java_out(version_str):
Java will report information like _JAVA_OPTIONS environmental variables in the output.
"""
out = []
for line in version_str.decode().split("\n"):
for line in version_str.split("\n"):
if line.startswith("Picked up"):
pass
if line.find("setlocale") > 0:
......@@ -94,7 +94,8 @@ def get_gatk_version(gatk_jar=None, config=None):
cl = gatk_cmd("gatk", ["-Xms128m", "-Xmx256m"] + get_default_jvm_opts(), ["-version"], config=config)
with closing(subprocess.Popen(cl, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, shell=True).stdout) as stdout:
out = _clean_java_out(stdout.read().strip())
stdout = stdout.read().decode().strip()
out = _clean_java_out(stdout)
# Historical GATK version (2.4) and newer versions (4.1.0.0)
# have a flag in front of output version
version = out
......@@ -113,7 +114,8 @@ def get_mutect_version(mutect_jar):
"""
cl = ["java", "-Xms128m", "-Xmx256m"] + get_default_jvm_opts() + ["-jar", mutect_jar, "-h"]
with closing(subprocess.Popen(cl, stdout=subprocess.PIPE, stderr=subprocess.STDOUT).stdout) as stdout:
if "SomaticIndelDetector" in stdout.read().strip():
stdout = stdout.read().decode().strip()
if "SomaticIndelDetector" in stdout:
mutect_type = "-appistry"
else:
mutect_type = ""
......@@ -243,7 +245,7 @@ class BroadRunner:
cl += ["--version"]
p = subprocess.Popen(cl, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# fix for issue #494
pat = re.compile('([\d|\.]*)(\(\d*\)$)') # matches '1.96(1510)'
pat = re.compile(r'([\d|\.]*)(\(\d*\)$)') # matches '1.96(1510)'
m = pat.search(p.stdout.read())
version = m.group(1)
self._picard_version = version
......@@ -258,7 +260,7 @@ class BroadRunner:
return True
else:
try:
stdout = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True)
stdout = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True, encoding="UTF-8")
return stdout.find("GATK jar file not found") == -1
except subprocess.CalledProcessError:
return False
......
import os
import subprocess
import sys
import toolz as tz
from bcbio import utils
......@@ -9,11 +10,33 @@ from bcbio.ngsalign import bowtie2, bwa
from bcbio.distributed.transaction import file_transaction
from bcbio.provenance import do
from bcbio.log import logger
from bcbio.heterogeneity.chromhacks import get_mitochondrial_chroms
def clean_chipseq_alignment(data):
# lcr_bed = utils.get_in(data, ("genome_resources", "variation", "lcr"))
work_bam = dd.get_work_bam(data)
clean_bam = remove_nonassembled_chrom(work_bam, data)
if not dd.get_keep_multimapped(data):
clean_bam = remove_multimappers(clean_bam, data)
if not dd.get_keep_duplicates(data):
clean_bam = bam.remove_duplicates(clean_bam, data)
data["work_bam"] = clean_bam
encode_bed = tz.get_in(["genome_resources", "variation", "encode_blacklist"], data)
if encode_bed:
data["work_bam"] = remove_blacklist_regions(dd.get_work_bam(data), encode_bed, data['config'])
bam.index(data["work_bam"], data['config'])
try:
data["bigwig"] = _normalized_bam_coverage(dd.get_sample_name(data),
dd.get_work_bam(data), data)
except subprocess.CalledProcessError:
logger.warning(f"{dd.get_work_bam(data)} was too sparse to normalize, "
f" falling back to non-normalized coverage.")
data["bigwig"] = _bam_coverage(dd.get_sample_name(data),
dd.get_work_bam(data), data)
return [[data]]
def remove_multimappers(bam_file, data):
aligner = dd.get_aligner(data)
data["align_bam"] = dd.get_work_bam(data)
if dd.get_mark_duplicates(data):
if aligner:
if aligner == "bowtie2":
filterer = bowtie2.filter_multimappers
......@@ -22,24 +45,19 @@ def clean_chipseq_alignment(data):
else:
logger.error("ChIP-seq only supported for bowtie2 and bwa.")
sys.exit(-1)
unique_bam = filterer(dd.get_work_bam(data), data)
data["work_bam"] = unique_bam
unique_bam = filterer(bam_file, data)
else:
logger.info("Warning: When BAM file is given as input, bcbio skips multimappers removal."
"If BAM is not cleaned for peak calling, can result in downstream errors.")
# lcr_bed = utils.get_in(data, ("genome_resources", "variation", "lcr"))
data["work_bam"] = _keep_assembled_chrom(dd.get_work_bam(data), dd.get_ref_file(data),
data["config"])
encode_bed = tz.get_in(["genome_resources", "variation", "encode_blacklist"], data)
if encode_bed:
data["work_bam"] = _prepare_bam(dd.get_work_bam(data), encode_bed, data['config'])
bam.index(data["work_bam"], data['config'])
data["bigwig"] = _bam_coverage(dd.get_sample_name(data), dd.get_work_bam(data), data)
return [[data]]
unique_bam = bam_file
logger.warn("When a BAM file is given as input, bcbio skips removal of "
"multimappers.")
logger.warn("ChIP/ATAC-seq usually requires duplicate marking, but it was disabled.")
return unique_bam
def _keep_assembled_chrom(bam_file, genome, config):
"""Remove contigs from the BAM file"""
fai = "%s.fai" % genome
def remove_nonassembled_chrom(bam_file, data):
"""Remove non-assembled contigs from the BAM file"""
ref_file = dd.get_ref_file(data)
config = dd.get_config(data)
fai = "%s.fai" % ref_file
chrom = []
with open(fai) as inh:
for line in inh:
......@@ -56,9 +74,8 @@ def _keep_assembled_chrom(bam_file, genome, config):
bam.index(out_file, config)
return out_file
def _prepare_bam(bam_file, bed_file, config):
"""Remove regions from bed files"""
def remove_blacklist_regions(bam_file, bed_file, config):
"""Remove blacklist regions from a BAM file"""
if not bam_file or not bed_file:
return bam_file
out_file = utils.append_stem(bam_file, '_filter')
......@@ -71,7 +88,31 @@ def _prepare_bam(bam_file, bed_file, config):
def _bam_coverage(name, bam_input, data):
"""Run bamCoverage from deeptools"""
cmd = ("{bam_coverage} -b {bam_input} -o {bw_output} "
cmd = ("{bam_coverage} --bam {bam_input} --outFileName {bw_output} "
"--binSize 20 --effectiveGenomeSize {size} "
"--smoothLength 60 --extendReads 150 --centerReads -p {cores} ")
size = bam.fasta.total_sequence_length(dd.get_ref_file(data))
cores = dd.get_num_cores(data)
try:
bam_coverage = config_utils.get_program("bamCoverage", data)
except config_utils.CmdNotFound:
logger.info("No bamCoverage found, skipping bamCoverage.")
return None
resources = config_utils.get_resources("bamCoverage", data["config"])
if resources:
options = resources.get("options")
if options:
cmd += " %s" % " ".join([str(x) for x in options])
bw_output = os.path.join(os.path.dirname(bam_input), "%s.bw" % name)
if utils.file_exists(bw_output):
return bw_output
with file_transaction(bw_output) as out_tx:
do.run(cmd.format(**locals()), "Run bamCoverage in %s" % name)
return bw_output
def _normalized_bam_coverage(name, bam_input, data):
"""Run bamCoverage from deeptools but produce normalized bigWig files"""
cmd = ("{bam_coverage} --bam {bam_input} --outFileName {bw_output} "
"--binSize 20 --effectiveGenomeSize {size} "
"--smoothLength 60 --extendReads 150 --centerReads -p {cores} ")
size = bam.fasta.total_sequence_length(dd.get_ref_file(data))
......@@ -81,6 +122,12 @@ def _bam_coverage(name, bam_input, data):
except config_utils.CmdNotFound:
logger.info("No bamCoverage found, skipping bamCoverage.")
return None
method = dd.get_chip_method(data)
cmd += "--normalizeUsing CPM "
toignore = get_mitochondrial_chroms(data)
if toignore:
ignorenormflag = f"--ignoreForNormalization {' '.join(toignore)} "
cmd += ignorenormflag
resources = config_utils.get_resources("bamCoverage", data["config"])
if resources:
options = resources.get("options")
......@@ -92,3 +139,35 @@ def _bam_coverage(name, bam_input, data):
with file_transaction(bw_output) as out_tx:
do.run(cmd.format(**locals()), "Run bamCoverage in %s" % name)
return bw_output
def _compute_deeptools_matrix(data):
pass
def extract_NF_regions(data):
"""
extract the nucleosome free regions from the work_bam. These regions will
be < 100 bases
"""
MAX_FRAG_LENGTH = 100
sieve = config_utils.get_program("alignmentSieve", data)
work_bam = dd.get_work_bam(data)
num_cores = dd.get_num_cores(data)
out_file = os.path.splitext(work_bam)[0] + "-NF.bam"
log_file = os.path.splitext(work_bam)[0] + "-NF.log"
if file_exists(out_file):
data["NF_bam"] = out_file
return data
with file_transaction(out_file) as tx_out_file, \
file_transaction(log_file) as tx_log_file:
tx_unsorted_bam = tx_out_file + ".unsorted"
cmd = (
f"{sieve} --bam ${work_bam} --outFile {tx_unsorted_bam} --ATACshift "
f"--numberOfProcessors {num_cores} --maxFragmentLength {MAX_FRAG_LENGTH} "
f"--minMappingQuality 10 "
f"--filterMetrics {tx_log_file} ")
do.run(cmd, "Extract NF regions from {work_bam} to {tx_unsorted_bam}.")
tx_out_file = bam.sort(tx_unsorted_bam)
data["NF_bam"] = out_file
return data
#from dataclasses import dataclass
from collections import namedtuple
VALID_PEAKTYPES = ["narrow", "broad"]
# @dataclass
# class Antibody:
# """
# ChIP-seq antibody
# """
# name: str
# # call narrow or broad peaks
# peaktype: str
# # remove duplicates?
# rmdup: bool = True
# def __post_init__(self):
# if self.peaktype not in VALID_PEAKTYPES:
# raise TypeError(f"peaktype {self.peatktype} is not one of {VALID_PEAKTYPES}")
Antibody = namedtuple('Antibody', 'name peaktype rmdup')
_ANTIBODIES = [
Antibody("h3f3a", "broad", True),
Antibody("h3k27me3", "broad", True),
Antibody("h3k36me3", "broad", True),
Antibody("h3k4me1", "broad", True),
Antibody("h3k79me2", "broad", True),
Antibody("h3k79me3", "broad", True),
Antibody("h3k9me1", "broad", True),
Antibody("h3k9me2", "broad", True),
Antibody("h4k20me1", "broad", True),
Antibody("h2afz", "narrow", True),
Antibody("h3ac", "narrow", True),
Antibody("h3k27ac", "narrow", True),
Antibody("h3k4me2", "narrow", True),
Antibody("h3k4me3", "narrow", True),
Antibody("h3k9ac", "narrow", True),
Antibody("h3k9me3", "broad", False)
]
ANTIBODIES = {x.name: x for x in _ANTIBODIES}
SUPPORTED_ANTIBODIES = {x.name for x in _ANTIBODIES}
......@@ -7,6 +7,8 @@ from bcbio.provenance import do
from bcbio.pipeline import config_utils
from bcbio.pipeline import datadict as dd
from bcbio import bam
from bcbio.chipseq import antibodies
from bcbio.log import logger
def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, data):
"""
......@@ -18,15 +20,22 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
out_file = os.path.join(out_dir, name + "_peaks_macs2.xls")
macs2_file = os.path.join(out_dir, name + "_peaks.xls")
if utils.file_exists(out_file):
_compres_bdg_files(out_dir)
_compress_bdg_files(out_dir)
return _get_output_files(out_dir)
macs2 = config_utils.get_program("macs2", config)
antibody = antibodies.ANTIBODIES.get(dd.get_antibody(data).lower(), None)
if antibody:
logger.info(f"{antibody.name} specified, using {antibody.peaktype} peak settings.")
peaksettings = select_peak_parameters(antibody)
else:
peaksettings = ""
options = " ".join(resources.get("macs2", {}).get("options", ""))
genome_size = bam.fasta.total_sequence_length(dd.get_ref_file(data))
genome_size = "" if options.find("-g") > -1 else "-g %s" % genome_size
paired = "-f BAMPE" if bam.is_paired(chip_bam) else ""
with utils.chdir(out_dir):
cmd = _macs2_cmd(method)
cmd = _macs2_cmd(data)
cmd += peaksettings
try:
do.run(cmd.format(**locals()), "macs2 for %s" % name)
utils.move_safe(macs2_file, out_file)
......@@ -37,7 +46,7 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
"You can add specific options for the sample "
"setting resources as explained in docs: "
"https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#sample-specific-resources")
_compres_bdg_files(out_dir)
_compress_bdg_files(out_dir)
return _get_output_files(out_dir)
def _get_output_files(out_dir):
......@@ -52,20 +61,29 @@ def _get_output_files(out_dir):
break
return {"main": peaks, "macs2": fns}
def _compres_bdg_files(out_dir):
def _compress_bdg_files(out_dir):
for fn in glob.glob(os.path.join(out_dir, "*bdg")):
cmd = "gzip %s" % fn
do.run(cmd, "compress bdg file: %s" % fn)
def _macs2_cmd(method="chip"):
def _macs2_cmd(data):
"""Main command for macs2 tool."""
method = dd.get_chip_method(data)
if method.lower() == "chip":
cmd = ("{macs2} callpeak -t {chip_bam} -c {input_bam} {paired} "
" {genome_size} -n {name} -B {options}")
"{genome_size} -n {name} --bdg {options} ")
elif method.lower() == "atac":
cmd = ("{macs2} callpeak -t {chip_bam} --nomodel "
" {paired} {genome_size} -n {name} -B {options}"
" {paired} {genome_size} -n {name} --bdg {options}"
" --nolambda --keep-dup all")
else:
raise ValueError("chip_method should be chip or atac.")
return cmd
def select_peak_parameters(antibody):
if antibody.peaktype == "broad":
return " --broad --broad-cutoff 0.05"
elif antibody.peaktype == "narrow":
return ""
else:
raise ValueError(f"{antibody.peaktype} not recognized.")
......@@ -109,7 +109,7 @@ def greylisting(data):
"""
input_bam = data.get("work_bam_input", None)
if not input_bam:
logger.info("No input BAM file detected, skipping greylisting.")
logger.info("No input control BAM file detected, skipping greylisting.")
return None
try:
greylister = config_utils.get_program("chipseq-greylist", data)
......
......@@ -8,7 +8,7 @@ try:
except ImportError:
from IPython.parallel import require
from bcbio import heterogeneity, hla, chipseq, structural, upload
from bcbio import heterogeneity, hla, chipseq, structural, upload, utils
from bcbio.bam import callable
from bcbio.rnaseq import (sailfish, rapmap, salmon, umi, kallisto, spikein,
bcbiornaseq)
......@@ -17,6 +17,7 @@ from bcbio.ngsalign import alignprep
from bcbio.srna import sample as srna
from bcbio.srna import group as seqcluster
from bcbio.chipseq import peaks
from bcbio.wgbsseq import cpg_caller, deduplication, trimming
from bcbio.pipeline import (archive, config_utils, disambiguate, sample,
qcsummary, shared, variation, run_info, rnaseq)
from bcbio.provenance import system
......@@ -29,9 +30,10 @@ from bcbio.log import logger, setup_local_logging
@contextlib.contextmanager
def _setup_logging(args):
# Set environment to standard to use periods for decimals and avoid localization
os.environ["LC_ALL"] = "C"
os.environ["LC"] = "C"
os.environ["LANG"] = "C"
locale_to_use = utils.get_locale()
os.environ["LC_ALL"] = locale_to_use
os.environ["LC"] = locale_to_use
os.environ["LANG"] = locale_to_use
config = None
if len(args) == 1 and isinstance(args[0], (list, tuple)):
args = args[0]
......@@ -103,12 +105,21 @@ def trim_srna_sample(*args):
with _setup_logging(args) as config:
return ipython.zip_args(apply(srna.trim_srna_sample, *args))
@require(trimming)
def trim_bs_sample(*args):
args = ipython.unzip_args(args)
with _setup_logging(args):
return ipython.zip_args(apply(trimming.trim, *args))
@require(srna)
def srna_annotation(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(srna.sample_annotation, *args))
@require(seqcluster)
def seqcluster_prepare(*args):
args = ipython.unzip_args(args)
......@@ -133,6 +144,35 @@ def peakcalling(* args):
with _setup_logging(args) as config:
return ipython.zip_args(apply(peaks.calling, *args))
@require(cpg_caller)
def cpg_calling(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(cpg_caller.calling, *args))
@require(cpg_caller)
def cpg_processing(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(cpg_caller.cpg_postprocessing, *args))
@require(cpg_caller)
def cpg_stats(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(cpg_caller.cpg_stats, *args))
@require(deduplication)
def deduplicate_bismark(*args):
args = ipython.unzip_args(args)
with _setup_logging(args):
return ipython.zip_args(apply(deduplication.dedup_bismark, *args))
@require(sailfish)
def run_sailfish(*args):
args = ipython.unzip_args(args)
......@@ -223,6 +263,12 @@ def run_salmon_bam(*args):
with _setup_logging(args):
return ipython.zip_args(apply(salmon.run_salmon_bam, *args))
@require(salmon)
def run_salmon_decoy(*args):
args = ipython.unzip_args(args)
with _setup_logging(args):
return ipython.zip_args(apply(salmon.run_salmon_decoy, *args))
@require(salmon)
def run_salmon_reads(*args):
args = ipython.unzip_args(args)
......
......@@ -5,6 +5,7 @@ from bcbio.bam import callable
from bcbio.srna import sample as srna
from bcbio.srna import group as seqcluster
from bcbio.chipseq import peaks
from bcbio.wgbsseq import cpg_caller, deduplication, trimming
from bcbio.cwl import create as cwl_create
from bcbio.cwl import cwlutils
from bcbio.rnaseq import (sailfish, rapmap, salmon, umi, kallisto, spikein,
......@@ -58,6 +59,10 @@ def run_kallisto_index(*args):
def run_kallisto_rnaseq(*args):
return kallisto.run_kallisto_rnaseq(*args)
@utils.map_wrap
def run_salmon_decoy(*args):
return salmon.run_salmon_decoy(*args)
@utils.map_wrap
def run_salmon_reads(*args):
return salmon.run_salmon_reads(*args)
......@@ -175,6 +180,32 @@ def srna_alignment(*args):
def peakcalling(*args):
return peaks.calling(*args)
@utils.map_wrap
def trim_bs_sample(*args):
return trimming.trim(*args)
@utils.map_wrap
def cpg_calling(*args):
return cpg_caller.calling(*args)
@utils.map_wrap
def cpg_processing(*args):
return cpg_caller.cpg_postprocessing(*args)
@utils.map_wrap
def cpg_stats(*args):
return cpg_caller.cpg_stats(*args)
@utils.map_wrap
def deduplicate_bismark(*args):
return deduplication.dedup_bismark(*args)
@utils.map_wrap
def prep_align_inputs(*args):
return alignprep.create_inputs(*args)
......
......@@ -14,7 +14,7 @@ import six
import toolz as tz
import yaml
from bcbio import log, utils, setpath
from bcbio import log, utils, setpath, utils
from bcbio.log import logger
from bcbio.cwl import cwlutils
from bcbio.distributed import multitasks
......@@ -24,9 +24,10 @@ def process(args):
"""Run the function in args.name given arguments in args.argfile.
"""
# Set environment to standard to use periods for decimals and avoid localization
os.environ["LC_ALL"] = "C"
os.environ["LC"] = "C"
os.environ["LANG"] = "C"
locale_to_use = utils.get_locale()
os.environ["LC_ALL"] = locale_to_use
os.environ["LC"] = locale_to_use
os.environ["LANG"] = locale_to_use
setpath.prepend_bcbiopath()
try:
fn = getattr(multitasks, args.name)
......
......@@ -78,7 +78,7 @@ def _run_bubbletree(vcf_csv, cnv_csv, data, wide_lrr=False, do_plots=True,
with open(r_file, "w") as out_handle:
out_handle.write(_script.format(**locals()))
if not utils.file_exists(freqs_out):
cmd = "%s && %s --no-environ %s" % (utils.get_R_exports(), utils.Rscript_cmd(), r_file)
cmd = "%s && %s --vanilla %s" % (utils.get_R_exports(), utils.Rscript_cmd(), r_file)
try:
do.run(cmd, "Assess heterogeneity with BubbleTree")
except subprocess.CalledProcessError as msg:
......
......@@ -7,6 +7,8 @@ import os
from bcbio import utils
from bcbio.distributed.transaction import file_transaction
from bcbio.pipeline import datadict as dd
from bcbio.bam import ref
def is_autosomal(chrom):
"""Keep chromosomes that are a digit 1-22, or chr prefixed digit chr1-chr22
......@@ -51,3 +53,8 @@ def bed_to_standardonly(in_file, data, headers=None, include_sex_chroms=False, o
if checkfn(line.split()[0]) or (headers and line.startswith(headers)):
out_handle.write(line)
return out_file
def get_mitochondrial_chroms(data):
ref_file = dd.get_ref_file(data)
mito = [c.name for c in ref.file_contigs(ref_file) if is_mitochondrial(c.name)]
return mito
......@@ -25,7 +25,7 @@ def run(data):
"""
hlas = []
for hla_fq in tz.get_in(["hla", "fastq"], data, []):
hla_type = re.search("[.-](?P<hlatype>HLA-[\w-]+).fq", hla_fq).group("hlatype")
hla_type = re.search(r"[.-](?P<hlatype>HLA-[\w-]+).fq", hla_fq).group("hlatype")
if hla_type in SUPPORTED_HLAS:
if utils.file_exists(hla_fq):
hlas.append((hla_type, hla_fq))
......
......@@ -13,7 +13,7 @@ def parse_dirname(fc_dir):
name = None
date = None
for p in parts:
if p.endswith(("XX", "xx", "XY", "X2")):
if p.endswith(("XX", "xx", "XY", "X2", "X3")):
name = p
elif len(p) == 6:
try:
......
......@@ -106,7 +106,7 @@ def _find_unprocessed(config):
def _get_directories(config):
for directory in config["dump_directories"]:
for dname in sorted(glob.glob(os.path.join(directory, "*[Aa]*[Xx][XxYy2]"))):
for dname in sorted(glob.glob(os.path.join(directory, "*[Aa]*[Xx][XxYy23]"))):
if os.path.isdir(dname):
yield dname
......