Skip to content
Commits on Source (7)
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at chapmanb@fastmail.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
## 1.1.7
- hot fix for dataclasses not being supported in 3.6. Use namedtuple instead.
## 1.1.6
- GATK ApplyBQSRSpark: avoid StreamClosed issue with GATK 4.1+
- RNA-seq: fixes for cufflinks preparation due to python3 transition.
- RNA-seq: output count tables from tximport for genes and transcripts. These
are in `bcbioRNASeq/results/date/genes/counts` and
`bcbioRNASeq/results/data/transcripts/counts`.
- qualimap (RNA-seq): disable stranded mode for qualimap, as it gives incorrect
results with the hisat2 aligner and for RNA-seq just setting it to unstranded
- Add `quantify_genome_alignments` option to use genome alignments to quantify
with Salmon.
- Add `--validateMappings` flag to Salmon read quantification mode.
- VEP cache is not installing anymore from bcbio run
- Add support for Salmon SA method when STAR alignments are not available
(for hg38).
- Add support for the new read model for filtering in Mutect2. This is
experimental, and a little flaky, so it can optionally be turned on via:
`tools_on: mutect2_readmodel`. Thanks to @lbeltrame for implementing this
feature and doing a ton of work debugging.
- Swap pandas `from_csv` call to `read_csv`.
- Make STAR respect the `transcriptome_gtf` option.
- Prefix regular expression with r. Thanks to @smoe for finding all of these.
- Add informative logging messages at beginning of bcbio run. Includes the version
and the configuration files being used.
- Swap samtools mpileup to use bcftools mpileup as samtools mpileup is being
deprecated (https://github.com/samtools/samtools/releases/tag/1.9).
- Ensure locale is set to one supporting UTF-8 bcbio-wide. This may need to get
reverted if it introduces issues.
- Added hg38 support for STAR. We did this by taking hg38 and removing the alts,
decoys and HLA sequences.
- Added support for the arriba fusion caller.
- Added back missing programs from the version provenance file. Fixed formatting
problems introduced by switch to python3.
- Added initial support for whole genome bisulfite sequencing using bismark. Thanks to
@hackdna for implementing this and @jnhutchinson for drafting the initial
pipeline. This is a work in progress in collaboration with @gcampanella, who
has a similar implementation with some extra features that we will be merging
in soon.
- qualimap for RNA-seq runs on the downsampled BAM files by default. Set
`tools_on: [qualimap_full]` to run on the full BAM files.
- Add STAR junction files to the files captured at the end of a run.
## 1.1.5 (12 April 2019)
- Fixes for Python3 incompatibilities on distributed IPython runs.
......
......@@ -514,7 +514,7 @@ def estimate_max_mapq(in_bam, nreads=1e6):
"""Guess maximum MAPQ in a BAM file of reads with alignments
"""
with pysam.Samfile(in_bam, "rb") as work_bam:
reads = tz.take(nreads, work_bam)
reads = tz.take(int(nreads), work_bam)
return max([x.mapq for x in reads if not x.is_unmapped])
def convert_cufflinks_mapq(in_bam, out_bam=None):
......
......@@ -41,7 +41,7 @@ def _calc_regional_coverage(in_bam, chrom, start, end, samplename, work_dir, dat
"{bedtools} coverage -a {region_file} -b - -d > {tx_tmp_file}")
do.run(cmd.format(**locals()), "Plotting coverage for %s %s" % (samplename, coords))
names = ["chom", "start", "end", "offset", "coverage"]
df = pd.io.parsers.read_table(tx_tmp_file, sep="\t", header=None,
df = pd.io.parsers.read_csv(tx_tmp_file, sep="\t", header=None,
names=names).dropna()
os.remove(tx_tmp_file)
df["sample"] = samplename
......
......@@ -221,8 +221,9 @@ def is_fastq(in_file, bzip=True):
else:
return False
def downsample(f1, f2, data, N, quick=False):
""" get N random headers from a fastq file without reading the
def downsample(f1, f2, N, quick=False):
"""Get N random headers from a fastq file without reading the
whole thing into memory
modified from: http://www.biostars.org/p/6544/
quick=True will just grab the first N reads rather than do a true
......@@ -231,9 +232,9 @@ def downsample(f1, f2, data, N, quick=False):
if quick:
rand_records = range(N)
else:
records = sum(1 for _ in open(f1)) / 4
records = int(sum(1 for _ in open_possible_gzip(f1)) / 4)
N = records if N > records else N
rand_records = sorted(random.sample(xrange(records), N))
rand_records = sorted(random.sample(range(records), N))
fh1 = open_possible_gzip(f1)
fh2 = open_possible_gzip(f2) if f2 else None
......@@ -275,6 +276,7 @@ def downsample(f1, f2, data, N, quick=False):
return outf1, outf2
def estimate_read_length(fastq_file, quality_format="fastq-sanger", nreads=1000):
"""
estimate average read length of a fastq file
......
......@@ -72,7 +72,7 @@ def _clean_java_out(version_str):
Java will report information like _JAVA_OPTIONS environmental variables in the output.
"""
out = []
for line in version_str.decode().split("\n"):
for line in version_str.split("\n"):
if line.startswith("Picked up"):
pass
if line.find("setlocale") > 0:
......@@ -94,7 +94,8 @@ def get_gatk_version(gatk_jar=None, config=None):
cl = gatk_cmd("gatk", ["-Xms128m", "-Xmx256m"] + get_default_jvm_opts(), ["-version"], config=config)
with closing(subprocess.Popen(cl, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, shell=True).stdout) as stdout:
out = _clean_java_out(stdout.read().strip())
stdout = stdout.read().decode().strip()
out = _clean_java_out(stdout)
# Historical GATK version (2.4) and newer versions (4.1.0.0)
# have a flag in front of output version
version = out
......@@ -113,7 +114,8 @@ def get_mutect_version(mutect_jar):
"""
cl = ["java", "-Xms128m", "-Xmx256m"] + get_default_jvm_opts() + ["-jar", mutect_jar, "-h"]
with closing(subprocess.Popen(cl, stdout=subprocess.PIPE, stderr=subprocess.STDOUT).stdout) as stdout:
if "SomaticIndelDetector" in stdout.read().strip():
stdout = stdout.read().decode().strip()
if "SomaticIndelDetector" in stdout:
mutect_type = "-appistry"
else:
mutect_type = ""
......@@ -243,7 +245,7 @@ class BroadRunner:
cl += ["--version"]
p = subprocess.Popen(cl, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# fix for issue #494
pat = re.compile('([\d|\.]*)(\(\d*\)$)') # matches '1.96(1510)'
pat = re.compile(r'([\d|\.]*)(\(\d*\)$)') # matches '1.96(1510)'
m = pat.search(p.stdout.read())
version = m.group(1)
self._picard_version = version
......@@ -258,7 +260,7 @@ class BroadRunner:
return True
else:
try:
stdout = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True)
stdout = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True, encoding="UTF-8")
return stdout.find("GATK jar file not found") == -1
except subprocess.CalledProcessError:
return False
......
#from dataclasses import dataclass
from collections import namedtuple
VALID_PEAKTYPES = ["narrow", "broad"]
# @dataclass
# class Antibody:
# """
# ChIP-seq antibody
# """
# name: str
# # call narrow or broad peaks
# peaktype: str
# # remove duplicates?
# rmdup: bool = True
# def __post_init__(self):
# if self.peaktype not in VALID_PEAKTYPES:
# raise TypeError(f"peaktype {self.peatktype} is not one of {VALID_PEAKTYPES}")
Antibody = namedtuple('Antibody', 'name peaktype rmdup')
_ANTIBODIES = [
Antibody("h3f3a", "broad", True),
Antibody("h3k27me3", "broad", True),
Antibody("h3k36me3", "broad", True),
Antibody("h3k4me1", "broad", True),
Antibody("h3k79me2", "broad", True),
Antibody("h3k79me3", "broad", True),
Antibody("h3k9me1", "broad", True),
Antibody("h3k9me2", "broad", True),
Antibody("h4k20me1", "broad", True),
Antibody("h2afz", "narrow", True),
Antibody("h3ac", "narrow", True),
Antibody("h3k27ac", "narrow", True),
Antibody("h3k4me2", "narrow", True),
Antibody("h3k4me3", "narrow", True),
Antibody("h3k9ac", "narrow", True),
Antibody("h3k9me3", "broad", False)
]
ANTIBODIES = {x.name: x for x in _ANTIBODIES}
SUPPORTED_ANTIBODIES = {x.name for x in _ANTIBODIES}
......@@ -7,6 +7,9 @@ from bcbio.provenance import do
from bcbio.pipeline import config_utils
from bcbio.pipeline import datadict as dd
from bcbio import bam
from bcbio.chipseq import antibodies
from bcbio.log import logger
def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, data):
"""
......@@ -18,15 +21,22 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
out_file = os.path.join(out_dir, name + "_peaks_macs2.xls")
macs2_file = os.path.join(out_dir, name + "_peaks.xls")
if utils.file_exists(out_file):
_compres_bdg_files(out_dir)
_compress_bdg_files(out_dir)
return _get_output_files(out_dir)
macs2 = config_utils.get_program("macs2", config)
antibody = antibodies.ANTIBODIES.get(dd.get_chipseq_antibody(data).lower(), None)
if antibody:
logger.info(f"{antibody.name} specified, using {antibody.peaktype} peak settings.")
peaksettings = select_peak_parameters(antibody)
else:
peaksettings = ""
options = " ".join(resources.get("macs2", {}).get("options", ""))
genome_size = bam.fasta.total_sequence_length(dd.get_ref_file(data))
genome_size = "" if options.find("-g") > -1 else "-g %s" % genome_size
paired = "-f BAMPE" if bam.is_paired(chip_bam) else ""
with utils.chdir(out_dir):
cmd = _macs2_cmd(method)
cmd += peaksettings
try:
do.run(cmd.format(**locals()), "macs2 for %s" % name)
utils.move_safe(macs2_file, out_file)
......@@ -37,7 +47,7 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
"You can add specific options for the sample "
"setting resources as explained in docs: "
"https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#sample-specific-resources")
_compres_bdg_files(out_dir)
_compress_bdg_files(out_dir)
return _get_output_files(out_dir)
def _get_output_files(out_dir):
......@@ -52,7 +62,7 @@ def _get_output_files(out_dir):
break
return {"main": peaks, "macs2": fns}
def _compres_bdg_files(out_dir):
def _compress_bdg_files(out_dir):
for fn in glob.glob(os.path.join(out_dir, "*bdg")):
cmd = "gzip %s" % fn
do.run(cmd, "compress bdg file: %s" % fn)
......@@ -61,11 +71,19 @@ def _macs2_cmd(method="chip"):
"""Main command for macs2 tool."""
if method.lower() == "chip":
cmd = ("{macs2} callpeak -t {chip_bam} -c {input_bam} {paired} "
" {genome_size} -n {name} -B {options}")
" {genome_size} -n {name} --bdg {options}")
elif method.lower() == "atac":
cmd = ("{macs2} callpeak -t {chip_bam} --nomodel "
" {paired} {genome_size} -n {name} -B {options}"
" {paired} {genome_size} -n {name} --bdg {options}"
" --nolambda --keep-dup all")
else:
raise ValueError("chip_method should be chip or atac.")
return cmd
def select_peak_parameters(antibody):
if antibody.peaktype == "broad":
return " --broad --broad-cutoff 0.05"
elif antibody.peaktype == "narrow":
return ""
else:
raise ValueError(f"{antibody.peaktype} not recognized.")
......@@ -8,7 +8,7 @@ try:
except ImportError:
from IPython.parallel import require
from bcbio import heterogeneity, hla, chipseq, structural, upload
from bcbio import heterogeneity, hla, chipseq, structural, upload, utils
from bcbio.bam import callable
from bcbio.rnaseq import (sailfish, rapmap, salmon, umi, kallisto, spikein,
bcbiornaseq)
......@@ -17,6 +17,7 @@ from bcbio.ngsalign import alignprep
from bcbio.srna import sample as srna
from bcbio.srna import group as seqcluster
from bcbio.chipseq import peaks
from bcbio.wgbsseq import cpg_caller, deduplication, trimming
from bcbio.pipeline import (archive, config_utils, disambiguate, sample,
qcsummary, shared, variation, run_info, rnaseq)
from bcbio.provenance import system
......@@ -29,9 +30,10 @@ from bcbio.log import logger, setup_local_logging
@contextlib.contextmanager
def _setup_logging(args):
# Set environment to standard to use periods for decimals and avoid localization
os.environ["LC_ALL"] = "C"
os.environ["LC"] = "C"
os.environ["LANG"] = "C"
locale_to_use = utils.get_locale()
os.environ["LC_ALL"] = locale_to_use
os.environ["LC"] = locale_to_use
os.environ["LANG"] = locale_to_use
config = None
if len(args) == 1 and isinstance(args[0], (list, tuple)):
args = args[0]
......@@ -103,12 +105,21 @@ def trim_srna_sample(*args):
with _setup_logging(args) as config:
return ipython.zip_args(apply(srna.trim_srna_sample, *args))
@require(trimming)
def trim_bs_sample(*args):
args = ipython.unzip_args(args)
with _setup_logging(args):
return ipython.zip_args(apply(trimming.trim, *args))
@require(srna)
def srna_annotation(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(srna.sample_annotation, *args))
@require(seqcluster)
def seqcluster_prepare(*args):
args = ipython.unzip_args(args)
......@@ -133,6 +144,35 @@ def peakcalling(* args):
with _setup_logging(args) as config:
return ipython.zip_args(apply(peaks.calling, *args))
@require(cpg_caller)
def cpg_calling(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(cpg_caller.calling, *args))
@require(cpg_caller)
def cpg_processing(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(cpg_caller.cpg_postprocessing, *args))
@require(cpg_caller)
def cpg_stats(*args):
args = ipython.unzip_args(args)
with _setup_logging(args) as config:
return ipython.zip_args(apply(cpg_caller.cpg_stats, *args))
@require(deduplication)
def deduplicate_bismark(*args):
args = ipython.unzip_args(args)
with _setup_logging(args):
return ipython.zip_args(apply(deduplication.dedup_bismark, *args))
@require(sailfish)
def run_sailfish(*args):
args = ipython.unzip_args(args)
......@@ -223,6 +263,12 @@ def run_salmon_bam(*args):
with _setup_logging(args):
return ipython.zip_args(apply(salmon.run_salmon_bam, *args))
@require(salmon)
def run_salmon_decoy(*args):
args = ipython.unzip_args(args)
with _setup_logging(args):
return ipython.zip_args(apply(salmon.run_salmon_decoy, *args))
@require(salmon)
def run_salmon_reads(*args):
args = ipython.unzip_args(args)
......
......@@ -5,6 +5,7 @@ from bcbio.bam import callable
from bcbio.srna import sample as srna
from bcbio.srna import group as seqcluster
from bcbio.chipseq import peaks
from bcbio.wgbsseq import cpg_caller, deduplication, trimming
from bcbio.cwl import create as cwl_create
from bcbio.cwl import cwlutils
from bcbio.rnaseq import (sailfish, rapmap, salmon, umi, kallisto, spikein,
......@@ -58,6 +59,10 @@ def run_kallisto_index(*args):
def run_kallisto_rnaseq(*args):
return kallisto.run_kallisto_rnaseq(*args)
@utils.map_wrap
def run_salmon_decoy(*args):
return salmon.run_salmon_decoy(*args)
@utils.map_wrap
def run_salmon_reads(*args):
return salmon.run_salmon_reads(*args)
......@@ -175,6 +180,32 @@ def srna_alignment(*args):
def peakcalling(*args):
return peaks.calling(*args)
@utils.map_wrap
def trim_bs_sample(*args):
return trimming.trim(*args)
@utils.map_wrap
def cpg_calling(*args):
return cpg_caller.calling(*args)
@utils.map_wrap
def cpg_processing(*args):
return cpg_caller.cpg_postprocessing(*args)
@utils.map_wrap
def cpg_stats(*args):
return cpg_caller.cpg_stats(*args)
@utils.map_wrap
def deduplicate_bismark(*args):
return deduplication.dedup_bismark(*args)
@utils.map_wrap
def prep_align_inputs(*args):
return alignprep.create_inputs(*args)
......
......@@ -14,7 +14,7 @@ import six
import toolz as tz
import yaml
from bcbio import log, utils, setpath
from bcbio import log, utils, setpath, utils
from bcbio.log import logger
from bcbio.cwl import cwlutils
from bcbio.distributed import multitasks
......@@ -24,9 +24,10 @@ def process(args):
"""Run the function in args.name given arguments in args.argfile.
"""
# Set environment to standard to use periods for decimals and avoid localization
os.environ["LC_ALL"] = "C"
os.environ["LC"] = "C"
os.environ["LANG"] = "C"
locale_to_use = utils.get_locale()
os.environ["LC_ALL"] = locale_to_use
os.environ["LC"] = locale_to_use
os.environ["LANG"] = locale_to_use
setpath.prepend_bcbiopath()
try:
fn = getattr(multitasks, args.name)
......
......@@ -78,7 +78,7 @@ def _run_bubbletree(vcf_csv, cnv_csv, data, wide_lrr=False, do_plots=True,
with open(r_file, "w") as out_handle:
out_handle.write(_script.format(**locals()))
if not utils.file_exists(freqs_out):
cmd = "%s && %s --no-environ %s" % (utils.get_R_exports(), utils.Rscript_cmd(), r_file)
cmd = "%s && %s --vanilla %s" % (utils.get_R_exports(), utils.Rscript_cmd(), r_file)
try:
do.run(cmd, "Assess heterogeneity with BubbleTree")
except subprocess.CalledProcessError as msg:
......
......@@ -25,7 +25,7 @@ def run(data):
"""
hlas = []
for hla_fq in tz.get_in(["hla", "fastq"], data, []):
hla_type = re.search("[.-](?P<hlatype>HLA-[\w-]+).fq", hla_fq).group("hlatype")
hla_type = re.search(r"[.-](?P<hlatype>HLA-[\w-]+).fq", hla_fq).group("hlatype")
if hla_type in SUPPORTED_HLAS:
if utils.file_exists(hla_fq):
hlas.append((hla_type, hla_fq))
......
......@@ -13,7 +13,7 @@ def parse_dirname(fc_dir):
name = None
date = None
for p in parts:
if p.endswith(("XX", "xx", "XY", "X2")):
if p.endswith(("XX", "xx", "XY", "X2", "X3")):
name = p
elif len(p) == 6:
try:
......
......@@ -106,7 +106,7 @@ def _find_unprocessed(config):
def _get_directories(config):
for directory in config["dump_directories"]:
for dname in sorted(glob.glob(os.path.join(directory, "*[Aa]*[Xx][XxYy2]"))):
for dname in sorted(glob.glob(os.path.join(directory, "*[Aa]*[Xx][XxYy23]"))):
if os.path.isdir(dname):
yield dname
......
......@@ -231,7 +231,7 @@ def _update_bcbiovm():
"""Update or install a local bcbiovm install with tools and dependencies.
"""
print("## CWL support with bcbio-vm")
python_env = "python=3"
python_env = "python=3.6"
conda_bin, env_name = _add_environment("bcbiovm", python_env)
channels = _get_conda_channels(conda_bin)
base_cmd = [conda_bin, "install", "--yes", "--name", env_name] + channels
......@@ -588,7 +588,7 @@ def _install_kraken_db(datadir, args):
db = os.path.join(kraken, base)
tooldir = args.tooldir or get_defaults()["tooldir"]
requests.packages.urllib3.disable_warnings()
last_mod = urllib.request.urlopen(url).info().getheader('Last-Modified')
last_mod = urllib.request.urlopen(url).info().get('Last-Modified')
last_mod = dateutil.parser.parse(last_mod).astimezone(dateutil.tz.tzutc())
if os.path.exists(os.path.join(tooldir, "bin", "kraken")):
if not os.path.exists(db):
......@@ -710,7 +710,7 @@ def _datatarget_defaults(args, default_args):
val = None
if x == "data":
val = "gemini"
elif x in ["cadd", "dbnsfp", "dbscsnv", "kraken", "gnomad"]:
elif x in ["dbnsfp", "dbscsnv", "kraken", "gnomad"]:
val = x
if val and val not in default_data:
default_data.append(val)
......@@ -768,7 +768,7 @@ def add_subparser(subparsers):
action="append", default=[], type=_check_toolplus)
parser.add_argument("--datatarget", help="Data to install. Allows customization or install of extra data.",
action="append", default=[],
choices=["variation", "rnaseq", "smallrna", "gemini", "cadd", "vep", "dbnsfp", "dbscsnv", "battenberg", "kraken", "ericscript", "gnomad"])
choices=["variation", "rnaseq", "smallrna", "gemini", "vep", "dbnsfp", "dbscsnv", "battenberg", "kraken", "ericscript", "gnomad"])
parser.add_argument("--genomes", help="Genomes to download",
action="append", default=[], choices=SUPPORTED_GENOMES)
parser.add_argument("--aligners", help="Aligner indexes to download",
......
import os
import glob
import sys
import shutil
import pysam
from bcbio.pipeline import config_utils
from bcbio.distributed.transaction import file_transaction, tx_tmpdir
from bcbio.utils import (safe_makedir, file_exists)
from bcbio.provenance import do
from bcbio import utils
from bcbio.log import logger
from bcbio.pipeline import datadict as dd
from bcbio import bam
from bcbio import broad
def align(fastq_file, pair_file, ref_file, names, align_dir, data):
assert data["analysis"].lower().startswith("wgbs-seq"), "No comparible alignment."
config = data["config"]
sample = dd.get_sample_name(data)
out_prefix = os.path.join(align_dir, dd.get_lane(data))
out_dir = os.path.join(align_dir, "%s_bismark" % dd.get_lane(data))
if not ref_file:
logger.error("bismark index not found. We don't provide the STAR indexes "
"by default because they are very large. You can install "
"the index for your genome with: bcbio_nextgen.py upgrade "
"--aligners bismark --genomes genome-build-name --data")
sys.exit(1)
final_out = os.path.join(align_dir, "{0}.bam".format(sample))
if file_exists(final_out):
data = dd.set_work_bam(data, final_out)
data["bam_report"] = glob.glob(os.path.join(out_dir, "*report.txt"))[0]
return data
bismark = config_utils.get_program("bismark", config)
fastq_files = " ".join([fastq_file, pair_file]) if pair_file else fastq_file
num_cores = dd.get_num_cores(data)
n = 1 if num_cores < 5 else 2
safe_makedir(align_dir)
cmd = "{bismark} --bowtie2 --temp_dir {tx_out_dir} --gzip --multicore {n} -o {tx_out_dir} --unmapped {ref_file} {fastq_file}"
if pair_file:
fastq_file = "-1 %s -2 %s" % (fastq_file, pair_file)
raw_bam = glob.glob(out_dir + "/*bismark*bt2*bam")
if not raw_bam:
with tx_tmpdir() as tx_out_dir:
run_message = "Running Bismark aligner on %s and %s" % (fastq_file, ref_file)
do.run(cmd.format(**locals()), run_message, None)
shutil.move(tx_out_dir, out_dir)
raw_bam = glob.glob(out_dir + "/*bismark*bt2*bam")
process_bam = _process_bam(raw_bam[0], fastq_files, sample, dd.get_sam_ref(data), config)
utils.symlink_plus(process_bam, final_out)
data = dd.set_work_bam(data, final_out)
data["bam_report"] = glob.glob(os.path.join(out_dir, "*report.txt"))[0]
return data
def _process_bam(bam_file, in_fastq, sample, reference, config):
broad_runner = broad.runner_from_config(config)
names = {'rg': in_fastq, 'library': 'WGBS_LIB', 'pl': 'Illumina', 'pu': 'R1', 'sm': in_fastq, 'sample': sample}
out_fix_bam = broad_runner.run_fn("picard_fix_rgs", bam_file, names)
order_bam = utils.append_stem(out_fix_bam, "_order")
broad_runner.run_fn("picard_reorder", out_fix_bam, reference, order_bam)
bam.index(order_bam, config)
# order_bam = _set_quality(order_bam)
# bam.index(order_bam, config)
return order_bam
def remap_index_fn(ref_file):
"""Map sequence references to equivalent bismark indexes
"""
return os.path.join(os.path.dirname(os.path.dirname(ref_file)), "bismark")
def _set_quality(in_bam):
"""
change all quality to 255
"""
bam = pysam.AlignmentFile(in_bam, "rb")
out_file = utils.append_stem(in_bam, "_normqual")
if file_exists(out_file):
return out_file
with file_transaction(out_file) as tx_out:
with pysam.AlignmentFile(tx_out, "wb", template=bam) as out_handle:
for read in bam.fetch():
read.mapping_quality = 255
out_handle.write(read)
return out_file
def index(ref_file, out_dir, data):
"""Create a bismark index in the defined reference directory.
"""
(ref_dir, local_file) = os.path.split(ref_file)
gtf_file = dd.get_gtf_file(data)
bismark = config_utils.find_program("bismark", data["config"])
if not utils.file_exists(gtf_file):
raise ValueError("%s not found, could not create a star index." % (gtf_file))
if not utils.file_exists(out_dir):
with tx_tmpdir(data, os.path.dirname(out_dir)) as tx_out_dir:
num_cores = dd.get_cores(data)
cmd = "{bismark} --bowtie2 -p {num_cores} -n 1 -o {tx_out_dir} --basename {sample} --unmapped {ref_file} {in_fastq}"
do.run(cmd.format(**locals()), "Index STAR")
if os.path.exists(out_dir):
shutil.rmtree(out_dir)
shutil.move(tx_out_dir, out_dir)
return out_dir
import os
from bcbio.pipeline import config_utils
from bcbio.distributed.transaction import file_transaction, tx_tmpdir
from bcbio.utils import (safe_makedir, file_exists)
from bcbio.provenance import do
from bcbio import utils
from bcbio.log import logger
from bcbio.pipeline import datadict as dd
from bcbio.ngsalign import postalign
from bcbio import bam
from bcbio import broad
def align(fastq_file, pair_file, ref_file, names, align_dir, data):
assert data["analysis"].lower().startswith("wgbs-seq"), "No comparible alignment"
config = data["config"]
sample = dd.get_sample_name(data)
out_prefix = os.path.join(align_dir, dd.get_lane(data))
ref_file = dd.get_sam_ref(data)
final_out = os.path.join(align_dir, "{0}.bam".format(sample))
if file_exists(final_out):
data = dd.set_work_bam(data, final_out)
return data
bsmap = config_utils.get_program("bsmap", config)
fastq_files = " -a %s" % fastq_file
num_cores = dd.get_num_cores(data)
num_cores = "-p %d" % num_cores
safe_makedir(align_dir)
cmd = "{bsmap} {num_cores} -w 100 -v 0.07 -m 10 -x 300 -o {tx_out_bam} -d {ref_file} {fastq_files}"
if pair_file:
fastq_files = "-a %s -b %s" % (fastq_file, pair_file)
if not final_out:
with file_transaction(final_out) as tx_out_bam:
run_message = "Running BSMAP aligner on %s and %s" % (fastq_file, ref_file)
do.run(cmd.format(**locals()), run_message, None)
data = dd.set_work_bam(data, final_out)
return data
def _process_bam(bam_file, in_fastq, sample, reference, config):
broad_runner = broad.runner_from_config(config)
names = {'rg': in_fastq, 'library': 'WGBS_LIB', 'pl': 'Illumina', 'pu': 'R1', 'sm': in_fastq, 'sample': sample}
out_fix_bam = broad_runner.run_fn("picard_fix_rgs", bam_file, names)
order_bam = utils.append_stem(out_fix_bam, "_order")
broad_runner.run_fn("picard_reorder", out_fix_bam, reference, order_bam)
bam.index(order_bam, config)
# order_bam = _set_quality(order_bam)
# bam.index(order_bam, config)
return order_bam
......@@ -185,6 +185,29 @@ def _estimate_fgbio_defaults(avg_coverage):
out["--min-reads"] = 1
return out
def correct_umis(data):
"""Correct umis against the whitelist in correct_umi_file
http://fulcrumgenomics.github.io/fgbio/tools/latest/CorrectUmis.html
"""
input_bam = dd.get_work_bam(data)
output_bam = os.path.join(utils.safe_makedir(os.path.join(os.getcwd(),
"align", dd.get_sample_name(data))),
"%s-umis_corrected%s" % utils.splitext_plus(os.path.basename(input_bam)))
jvm_opts = _get_fgbio_jvm_opts(data, os.path.dirname(output_bam), 2)
# Improve speeds by avoiding compression read/write bottlenecks
io_opts = "--async-io=true --compression=0"
umis_whitelist = tz.get_in(["config", "algorithm", "correct_umis"], data)
umi_method, umi_tag = _check_umi_type(input_bam)
cmd = ("unset JAVA_HOME && "
"fgbio {jvm_opts} {io_opts} CorrectUmis "
"-t {umi_tag} -m 3 -d 1 -x "
"-U {umis_whitelist} "
"-i {input_bam} -o {output_bam}")
do.run(cmd.format(**locals()), "Correcting UMIs")
return output_bam
def umi_consensus(data):
"""Convert UMI grouped reads into fastq pair for re-alignment.
"""
......
......@@ -55,6 +55,8 @@ def align(fastq_file, pair_file, ref_file, names, align_dir, data):
fastq_files = (" ".join([_unpack_fastq(fastq_file), _unpack_fastq(pair_file)])
if pair_file else _unpack_fastq(fastq_file))
num_cores = dd.get_num_cores(data)
gtf_file = dd.get_transcriptome_gtf(data)
if not gtf_file:
gtf_file = dd.get_gtf_file(data)
if ref_file.endswith("chrLength"):
ref_file = os.path.dirname(ref_file)
......@@ -75,6 +77,14 @@ def align(fastq_file, pair_file, ref_file, names, align_dir, data):
cmd += _add_sj_index_commands(fastq_file, ref_file, gtf_file) if not srna else ""
cmd += _read_group_option(names)
if dd.get_fusion_caller(data):
if "arriba" in dd.get_fusion_caller(data):
cmd += (
"--chimSegmentMin 10 --chimOutType WithinBAM SoftClip Junctions "
"--chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 "
"--chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 "
"--alignSJstitchMismatchNmax 5 -1 5 5 "
"--chimSegmentReadGapMax 3 ")
else:
cmd += (" --chimSegmentMin 12 --chimJunctionOverhangMin 12 "
"--chimScoreDropMax 30 --chimSegmentReadGapMax 5 "
"--chimScoreSeparation 5 ")
......@@ -141,8 +151,11 @@ def _update_data(align_file, out_dir, names, data):
data = dd.set_transcriptome_bam(data, transcriptome_file)
sjfile = get_splicejunction_file(out_dir, data)
if sjfile:
data = dd.set_starjunction(data, sjfile)
sjbed = junction2bed(sjfile)
data = dd.set_junction_bed(data, sjbed)
sjchimfile = get_chimericjunction_file(out_dir, data)
data = dd.set_chimericjunction(data, sjchimfile)
return data
def _move_transcriptome_file(out_dir, names):
......@@ -204,12 +217,23 @@ def get_star_version(data):
version = line.split("STAR_")[1].strip()
return version
def get_chimericjunction_file(out_dir, data):
"""
locate the chimeric splice junction file starting from the alignment directory
"""
samplename = dd.get_sample_name(data)
sjfile = os.path.join(out_dir, os.pardir, f"{samplename}Chimeric.out.junction")
if file_exists(sjfile):
return sjfile
else:
return None
def get_splicejunction_file(out_dir, data):
"""
locate the splicejunction file starting from the alignment directory
"""
samplename = dd.get_sample_name(data)
sjfile = os.path.join(out_dir, os.pardir, "{0}SJ.out.tab").format(samplename)
sjfile = os.path.join(out_dir, os.pardir, f"{samplename}SJ.out.tab")
if file_exists(sjfile):
return sjfile
else:
......