Skip to content
GitLab
Explore
Sign in
Register
Commits on Source (2)
New upstream version 1.1.7
· b50e6f19
Steffen Möller
authored
Oct 19, 2019
b50e6f19
New upstream version 1.1.8
· 53ab0439
Steffen Möller
authored
Nov 26, 2019
53ab0439
Show whitespace changes
Inline
Side-by-side
CODE_OF_CONDUCT.md
0 → 100644
View file @
53ab0439
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
*
Using welcoming and inclusive language
*
Being respectful of differing viewpoints and experiences
*
Gracefully accepting constructive criticism
*
Focusing on what is best for the community
*
Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
*
The use of sexualized language or imagery and unwelcome sexual attention or
advances
*
Trolling, insulting/derogatory comments, and personal or political attacks
*
Public or private harassment
*
Publishing others' private information, such as a physical or electronic
address, without explicit permission
*
Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at chapmanb@fastmail.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the
[
Contributor Covenant
][
homepage
]
, version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[
homepage
]:
https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
HISTORY.md
View file @
53ab0439
## 1.1.8
-
Add
`antibody`
configuration option. Setting a specific antibody for ChIP-seq will use appropriate
settings for that antibody. See the documentation for supported antibodies.
-
Add
`use_lowfreq_filter`
for forcing vardict to report variants with low allelic frequency,
useful for calling somatic variants in panels with high coverage.
-
Fix for checking for pre-existing inputs with python3.
-
Add
`keep_duplicates`
option for ChIP/ATAC-seq which does not remove duplicates before peak calling.
Defaults to False.
-
Add
`keep_multimappers`
for ChIP/ATAC-seq which does not remove multimappers before peak calling.
Defaults to False.
-
Remove ethnicity as a required column in PED files.
## 1.1.7 (10 October 2019)
=======
-
hot fix for dataclasses not being supported in 3.6. Use namedtuple instead.
## 1.1.6 (10 October 2019)
-
GATK ApplyBQSRSpark: avoid StreamClosed issue with GATK 4.1+
-
RNA-seq: fixes for cufflinks preparation due to python3 transition.
-
RNA-seq: output count tables from tximport for genes and transcripts. These
are in
`bcbioRNASeq/results/date/genes/counts`
and
`bcbioRNASeq/results/data/transcripts/counts`
.
-
qualimap (RNA-seq): disable stranded mode for qualimap, as it gives incorrect
results with the hisat2 aligner and for RNA-seq just setting it to unstranded
-
Add
`quantify_genome_alignments`
option to use genome alignments to quantify
with Salmon.
-
Add
`--validateMappings`
flag to Salmon read quantification mode.
-
VEP cache is not installing anymore from bcbio run
-
Add support for Salmon SA method when STAR alignments are not available
(for hg38).
-
Add support for the new read model for filtering in Mutect2. This is
experimental, and a little flaky, so it can optionally be turned on via:
`tools_on: mutect2_readmodel`
. Thanks to @lbeltrame for implementing this
feature and doing a ton of work debugging.
-
Swap pandas
`from_csv`
call to
`read_csv`
.
-
Make STAR respect the
`transcriptome_gtf`
option.
-
Prefix regular expression with r. Thanks to @smoe for finding all of these.
-
Add informative logging messages at beginning of bcbio run. Includes the version
and the configuration files being used.
-
Swap samtools mpileup to use bcftools mpileup as samtools mpileup is being
deprecated (https://github.com/samtools/samtools/releases/tag/1.9).
-
Ensure locale is set to one supporting UTF-8 bcbio-wide. This may need to get
reverted if it introduces issues.
-
Added hg38 support for STAR. We did this by taking hg38 and removing the alts,
decoys and HLA sequences.
-
Added support for the arriba fusion caller.
-
Added back missing programs from the version provenance file. Fixed formatting
problems introduced by switch to python3.
-
Added initial support for whole genome bisulfite sequencing using bismark. Thanks to
@hackdna for implementing this and @jnhutchinson for drafting the initial
pipeline. This is a work in progress in collaboration with @gcampanella, who
has a similar implementation with some extra features that we will be merging
in soon.
-
qualimap for RNA-seq runs on the downsampled BAM files by default. Set
`tools_on: [qualimap_full]`
to run on the full BAM files.
-
Add STAR junction files to the files captured at the end of a run.
## 1.1.5 (12 April 2019)
-
Fixes for Python3 incompatibilities on distributed IPython runs.
...
...
README.rst
View file @
53ab0439
...
...
@@ -123,6 +123,7 @@ Contributors
- `Roman Valls`_, Science for Life Laboratory, Stockholm
- `Kevin Ying`_, Garvan Institute of Medical Research, Sydney, Australia
- `Vlad Saveliev`_, Center for Algorithmic Biotechnology, St. Petersburg University
- `Sergey Naumenko`_, Harvard Chan Bioinformatics Core
.. _Miika Ahdesmaki: https://github.com/mjafin
...
...
@@ -146,6 +147,7 @@ Contributors
.. _Matt Edwards: https://github.com/matted
.. _Saket Choudhary: https://github.com/saketkc
.. _Vlad Saveliev: https://github.com/vladsaveliev
.. _Sergey Naumenko: https://github.com/naumenko-sa
License
-------
...
...
artwork/logo.png
View replaced file @
4016ff05
View file @
53ab0439
14.9 KiB
|
W:
|
H:
17.3 KiB
|
W:
|
H:
2-up
Swipe
Onion skin
bcbio/bam/__init__.py
View file @
53ab0439
...
...
@@ -514,7 +514,7 @@ def estimate_max_mapq(in_bam, nreads=1e6):
"""
Guess maximum MAPQ in a BAM file of reads with alignments
"""
with
pysam
.
Samfile
(
in_bam
,
"
rb
"
)
as
work_bam
:
reads
=
tz
.
take
(
nreads
,
work_bam
)
reads
=
tz
.
take
(
int
(
nreads
)
,
work_bam
)
return
max
([
x
.
mapq
for
x
in
reads
if
not
x
.
is_unmapped
])
def
convert_cufflinks_mapq
(
in_bam
,
out_bam
=
None
):
...
...
@@ -561,3 +561,21 @@ def convert_invalid_mapq(in_bam, out_bam=None):
read
.
mapq
=
VALIDMAPQ
out_bam_fh
.
write
(
read
)
return
out_bam
def
remove_duplicates
(
in_bam
,
data
):
"""
remove duplicates from a duplicate marked BAM file
"""
base
,
ext
=
os
.
path
.
splitext
(
in_bam
)
out_bam
=
base
+
"
-noduplicates
"
+
ext
if
utils
.
file_exists
(
out_bam
):
return
out_bam
num_cores
=
dd
.
get_num_cores
(
data
)
sambamba
=
config_utils
.
get_program
(
"
sambamba
"
,
data
)
with
file_transaction
(
out_bam
)
as
tx_out_bam
:
cmd
=
(
f
'
{
sambamba
}
view -h --nthreads
{
num_cores
}
-f bam -F
"
not duplicate
"
'
f
'
{
in_bam
}
>
{
tx_out_bam
}
'
)
message
=
f
"
Removing duplicates from
{
in_bam
}
, saving as
{
out_bam
}
.
"
do
.
run
(
cmd
,
message
)
index
(
out_bam
,
dd
.
get_config
(
data
))
return
out_bam
bcbio/bam/coverage.py
View file @
53ab0439
...
...
@@ -41,7 +41,7 @@ def _calc_regional_coverage(in_bam, chrom, start, end, samplename, work_dir, dat
"
{bedtools} coverage -a {region_file} -b - -d > {tx_tmp_file}
"
)
do
.
run
(
cmd
.
format
(
**
locals
()),
"
Plotting coverage for %s %s
"
%
(
samplename
,
coords
))
names
=
[
"
chom
"
,
"
start
"
,
"
end
"
,
"
offset
"
,
"
coverage
"
]
df
=
pd
.
io
.
parsers
.
read_
table
(
tx_tmp_file
,
sep
=
"
\t
"
,
header
=
None
,
df
=
pd
.
io
.
parsers
.
read_
csv
(
tx_tmp_file
,
sep
=
"
\t
"
,
header
=
None
,
names
=
names
).
dropna
()
os
.
remove
(
tx_tmp_file
)
df
[
"
sample
"
]
=
samplename
...
...
bcbio/bam/fastq.py
View file @
53ab0439
...
...
@@ -221,8 +221,9 @@ def is_fastq(in_file, bzip=True):
else
:
return
False
def
downsample
(
f1
,
f2
,
data
,
N
,
quick
=
False
):
"""
get N random headers from a fastq file without reading the
def
downsample
(
f1
,
f2
,
N
,
quick
=
False
):
"""
Get N random headers from a fastq file without reading the
whole thing into memory
modified from: http://www.biostars.org/p/6544/
quick=True will just grab the first N reads rather than do a true
...
...
@@ -231,9 +232,9 @@ def downsample(f1, f2, data, N, quick=False):
if
quick
:
rand_records
=
range
(
N
)
else
:
records
=
sum
(
1
for
_
in
open
(
f1
))
/
4
records
=
int
(
sum
(
1
for
_
in
open
_possible_gzip
(
f1
))
/
4
)
N
=
records
if
N
>
records
else
N
rand_records
=
sorted
(
random
.
sample
(
x
range
(
records
),
N
))
rand_records
=
sorted
(
random
.
sample
(
range
(
records
),
N
))
fh1
=
open_possible_gzip
(
f1
)
fh2
=
open_possible_gzip
(
f2
)
if
f2
else
None
...
...
@@ -275,6 +276,7 @@ def downsample(f1, f2, data, N, quick=False):
return
outf1
,
outf2
def
estimate_read_length
(
fastq_file
,
quality_format
=
"
fastq-sanger
"
,
nreads
=
1000
):
"""
estimate average read length of a fastq file
...
...
bcbio/broad/__init__.py
View file @
53ab0439
...
...
@@ -72,7 +72,7 @@ def _clean_java_out(version_str):
Java will report information like _JAVA_OPTIONS environmental variables in the output.
"""
out
=
[]
for
line
in
version_str
.
decode
().
split
(
"
\n
"
):
for
line
in
version_str
.
split
(
"
\n
"
):
if
line
.
startswith
(
"
Picked up
"
):
pass
if
line
.
find
(
"
setlocale
"
)
>
0
:
...
...
@@ -94,7 +94,8 @@ def get_gatk_version(gatk_jar=None, config=None):
cl
=
gatk_cmd
(
"
gatk
"
,
[
"
-Xms128m
"
,
"
-Xmx256m
"
]
+
get_default_jvm_opts
(),
[
"
-version
"
],
config
=
config
)
with
closing
(
subprocess
.
Popen
(
cl
,
stdout
=
subprocess
.
PIPE
,
stderr
=
subprocess
.
STDOUT
,
shell
=
True
).
stdout
)
as
stdout
:
out
=
_clean_java_out
(
stdout
.
read
().
strip
())
stdout
=
stdout
.
read
().
decode
().
strip
()
out
=
_clean_java_out
(
stdout
)
# Historical GATK version (2.4) and newer versions (4.1.0.0)
# have a flag in front of output version
version
=
out
...
...
@@ -113,7 +114,8 @@ def get_mutect_version(mutect_jar):
"""
cl
=
[
"
java
"
,
"
-Xms128m
"
,
"
-Xmx256m
"
]
+
get_default_jvm_opts
()
+
[
"
-jar
"
,
mutect_jar
,
"
-h
"
]
with
closing
(
subprocess
.
Popen
(
cl
,
stdout
=
subprocess
.
PIPE
,
stderr
=
subprocess
.
STDOUT
).
stdout
)
as
stdout
:
if
"
SomaticIndelDetector
"
in
stdout
.
read
().
strip
():
stdout
=
stdout
.
read
().
decode
().
strip
()
if
"
SomaticIndelDetector
"
in
stdout
:
mutect_type
=
"
-appistry
"
else
:
mutect_type
=
""
...
...
@@ -243,7 +245,7 @@ class BroadRunner:
cl
+=
[
"
--version
"
]
p
=
subprocess
.
Popen
(
cl
,
stdout
=
subprocess
.
PIPE
,
stderr
=
subprocess
.
STDOUT
)
# fix for issue #494
pat
=
re
.
compile
(
'
([\d|\.]*)(\(\d*\)$)
'
)
# matches '1.96(1510)'
pat
=
re
.
compile
(
r
'
([\d|\.]*)(\(\d*\)$)
'
)
# matches '1.96(1510)'
m
=
pat
.
search
(
p
.
stdout
.
read
())
version
=
m
.
group
(
1
)
self
.
_picard_version
=
version
...
...
@@ -258,7 +260,7 @@ class BroadRunner:
return
True
else
:
try
:
stdout
=
subprocess
.
check_output
(
cmd
,
stderr
=
subprocess
.
STDOUT
,
shell
=
True
)
stdout
=
subprocess
.
check_output
(
cmd
,
stderr
=
subprocess
.
STDOUT
,
shell
=
True
,
encoding
=
"
UTF-8
"
)
return
stdout
.
find
(
"
GATK jar file not found
"
)
==
-
1
except
subprocess
.
CalledProcessError
:
return
False
...
...
bcbio/chipseq/__init__.py
View file @
53ab0439
import
os
import
subprocess
import
sys
import
toolz
as
tz
from
bcbio
import
utils
...
...
@@ -9,11 +10,33 @@ from bcbio.ngsalign import bowtie2, bwa
from
bcbio.distributed.transaction
import
file_transaction
from
bcbio.provenance
import
do
from
bcbio.log
import
logger
from
bcbio.heterogeneity.chromhacks
import
get_mitochondrial_chroms
def
clean_chipseq_alignment
(
data
):
# lcr_bed = utils.get_in(data, ("genome_resources", "variation", "lcr"))
work_bam
=
dd
.
get_work_bam
(
data
)
clean_bam
=
remove_nonassembled_chrom
(
work_bam
,
data
)
if
not
dd
.
get_keep_multimapped
(
data
):
clean_bam
=
remove_multimappers
(
clean_bam
,
data
)
if
not
dd
.
get_keep_duplicates
(
data
):
clean_bam
=
bam
.
remove_duplicates
(
clean_bam
,
data
)
data
[
"
work_bam
"
]
=
clean_bam
encode_bed
=
tz
.
get_in
([
"
genome_resources
"
,
"
variation
"
,
"
encode_blacklist
"
],
data
)
if
encode_bed
:
data
[
"
work_bam
"
]
=
remove_blacklist_regions
(
dd
.
get_work_bam
(
data
),
encode_bed
,
data
[
'
config
'
])
bam
.
index
(
data
[
"
work_bam
"
],
data
[
'
config
'
])
try
:
data
[
"
bigwig
"
]
=
_normalized_bam_coverage
(
dd
.
get_sample_name
(
data
),
dd
.
get_work_bam
(
data
),
data
)
except
subprocess
.
CalledProcessError
:
logger
.
warning
(
f
"
{
dd
.
get_work_bam
(
data
)
}
was too sparse to normalize,
"
f
"
falling back to non-normalized coverage.
"
)
data
[
"
bigwig
"
]
=
_bam_coverage
(
dd
.
get_sample_name
(
data
),
dd
.
get_work_bam
(
data
),
data
)
return
[[
data
]]
def
remove_multimappers
(
bam_file
,
data
):
aligner
=
dd
.
get_aligner
(
data
)
data
[
"
align_bam
"
]
=
dd
.
get_work_bam
(
data
)
if
dd
.
get_mark_duplicates
(
data
):
if
aligner
:
if
aligner
==
"
bowtie2
"
:
filterer
=
bowtie2
.
filter_multimappers
...
...
@@ -22,24 +45,19 @@ def clean_chipseq_alignment(data):
else
:
logger
.
error
(
"
ChIP-seq only supported for bowtie2 and bwa.
"
)
sys
.
exit
(
-
1
)
unique_bam
=
filterer
(
dd
.
get_work_bam
(
data
),
data
)
data
[
"
work_bam
"
]
=
unique_bam
unique_bam
=
filterer
(
bam_file
,
data
)
else
:
logger
.
info
(
"
Warning: When BAM file is given as input, bcbio skips multimappers removal.
"
"
If BAM is not cleaned for peak calling, can result in downstream errors.
"
)
# lcr_bed = utils.get_in(data, ("genome_resources", "variation", "lcr"))
data
[
"
work_bam
"
]
=
_keep_assembled_chrom
(
dd
.
get_work_bam
(
data
),
dd
.
get_ref_file
(
data
),
data
[
"
config
"
])
encode_bed
=
tz
.
get_in
([
"
genome_resources
"
,
"
variation
"
,
"
encode_blacklist
"
],
data
)
if
encode_bed
:
data
[
"
work_bam
"
]
=
_prepare_bam
(
dd
.
get_work_bam
(
data
),
encode_bed
,
data
[
'
config
'
])
bam
.
index
(
data
[
"
work_bam
"
],
data
[
'
config
'
])
data
[
"
bigwig
"
]
=
_bam_coverage
(
dd
.
get_sample_name
(
data
),
dd
.
get_work_bam
(
data
),
data
)
return
[[
data
]]
unique_bam
=
bam_file
logger
.
warn
(
"
When a BAM file is given as input, bcbio skips removal of
"
"
multimappers.
"
)
logger
.
warn
(
"
ChIP/ATAC-seq usually requires duplicate marking, but it was disabled.
"
)
return
unique_bam
def
_keep_assembled_chrom
(
bam_file
,
genome
,
config
):
"""
Remove contigs from the BAM file
"""
fai
=
"
%s.fai
"
%
genome
def
remove_nonassembled_chrom
(
bam_file
,
data
):
"""
Remove non-assembled contigs from the BAM file
"""
ref_file
=
dd
.
get_ref_file
(
data
)
config
=
dd
.
get_config
(
data
)
fai
=
"
%s.fai
"
%
ref_file
chrom
=
[]
with
open
(
fai
)
as
inh
:
for
line
in
inh
:
...
...
@@ -56,9 +74,8 @@ def _keep_assembled_chrom(bam_file, genome, config):
bam
.
index
(
out_file
,
config
)
return
out_file
def
_prepare_bam
(
bam_file
,
bed_file
,
config
):
"""
Remove regions from bed files
"""
def
remove_blacklist_regions
(
bam_file
,
bed_file
,
config
):
"""
Remove blacklist regions from a BAM file
"""
if
not
bam_file
or
not
bed_file
:
return
bam_file
out_file
=
utils
.
append_stem
(
bam_file
,
'
_filter
'
)
...
...
@@ -71,7 +88,31 @@ def _prepare_bam(bam_file, bed_file, config):
def
_bam_coverage
(
name
,
bam_input
,
data
):
"""
Run bamCoverage from deeptools
"""
cmd
=
(
"
{bam_coverage} -b {bam_input} -o {bw_output}
"
cmd
=
(
"
{bam_coverage} --bam {bam_input} --outFileName {bw_output}
"
"
--binSize 20 --effectiveGenomeSize {size}
"
"
--smoothLength 60 --extendReads 150 --centerReads -p {cores}
"
)
size
=
bam
.
fasta
.
total_sequence_length
(
dd
.
get_ref_file
(
data
))
cores
=
dd
.
get_num_cores
(
data
)
try
:
bam_coverage
=
config_utils
.
get_program
(
"
bamCoverage
"
,
data
)
except
config_utils
.
CmdNotFound
:
logger
.
info
(
"
No bamCoverage found, skipping bamCoverage.
"
)
return
None
resources
=
config_utils
.
get_resources
(
"
bamCoverage
"
,
data
[
"
config
"
])
if
resources
:
options
=
resources
.
get
(
"
options
"
)
if
options
:
cmd
+=
"
%s
"
%
"
"
.
join
([
str
(
x
)
for
x
in
options
])
bw_output
=
os
.
path
.
join
(
os
.
path
.
dirname
(
bam_input
),
"
%s.bw
"
%
name
)
if
utils
.
file_exists
(
bw_output
):
return
bw_output
with
file_transaction
(
bw_output
)
as
out_tx
:
do
.
run
(
cmd
.
format
(
**
locals
()),
"
Run bamCoverage in %s
"
%
name
)
return
bw_output
def
_normalized_bam_coverage
(
name
,
bam_input
,
data
):
"""
Run bamCoverage from deeptools but produce normalized bigWig files
"""
cmd
=
(
"
{bam_coverage} --bam {bam_input} --outFileName {bw_output}
"
"
--binSize 20 --effectiveGenomeSize {size}
"
"
--smoothLength 60 --extendReads 150 --centerReads -p {cores}
"
)
size
=
bam
.
fasta
.
total_sequence_length
(
dd
.
get_ref_file
(
data
))
...
...
@@ -81,6 +122,12 @@ def _bam_coverage(name, bam_input, data):
except
config_utils
.
CmdNotFound
:
logger
.
info
(
"
No bamCoverage found, skipping bamCoverage.
"
)
return
None
method
=
dd
.
get_chip_method
(
data
)
cmd
+=
"
--normalizeUsing CPM
"
toignore
=
get_mitochondrial_chroms
(
data
)
if
toignore
:
ignorenormflag
=
f
"
--ignoreForNormalization
{
'
'
.
join
(
toignore
)
}
"
cmd
+=
ignorenormflag
resources
=
config_utils
.
get_resources
(
"
bamCoverage
"
,
data
[
"
config
"
])
if
resources
:
options
=
resources
.
get
(
"
options
"
)
...
...
@@ -92,3 +139,35 @@ def _bam_coverage(name, bam_input, data):
with
file_transaction
(
bw_output
)
as
out_tx
:
do
.
run
(
cmd
.
format
(
**
locals
()),
"
Run bamCoverage in %s
"
%
name
)
return
bw_output
def
_compute_deeptools_matrix
(
data
):
pass
def
extract_NF_regions
(
data
):
"""
extract the nucleosome free regions from the work_bam. These regions will
be < 100 bases
"""
MAX_FRAG_LENGTH
=
100
sieve
=
config_utils
.
get_program
(
"
alignmentSieve
"
,
data
)
work_bam
=
dd
.
get_work_bam
(
data
)
num_cores
=
dd
.
get_num_cores
(
data
)
out_file
=
os
.
path
.
splitext
(
work_bam
)[
0
]
+
"
-NF.bam
"
log_file
=
os
.
path
.
splitext
(
work_bam
)[
0
]
+
"
-NF.log
"
if
file_exists
(
out_file
):
data
[
"
NF_bam
"
]
=
out_file
return
data
with
file_transaction
(
out_file
)
as
tx_out_file
,
\
file_transaction
(
log_file
)
as
tx_log_file
:
tx_unsorted_bam
=
tx_out_file
+
"
.unsorted
"
cmd
=
(
f
"
{
sieve
}
--bam $
{
work_bam
}
--outFile
{
tx_unsorted_bam
}
--ATACshift
"
f
"
--numberOfProcessors
{
num_cores
}
--maxFragmentLength
{
MAX_FRAG_LENGTH
}
"
f
"
--minMappingQuality 10
"
f
"
--filterMetrics
{
tx_log_file
}
"
)
do
.
run
(
cmd
,
"
Extract NF regions from {work_bam} to {tx_unsorted_bam}.
"
)
tx_out_file
=
bam
.
sort
(
tx_unsorted_bam
)
data
[
"
NF_bam
"
]
=
out_file
return
data
bcbio/chipseq/antibodies.py
0 → 100644
View file @
53ab0439
#from dataclasses import dataclass
from
collections
import
namedtuple
VALID_PEAKTYPES
=
[
"
narrow
"
,
"
broad
"
]
# @dataclass
# class Antibody:
# """
# ChIP-seq antibody
# """
# name: str
# # call narrow or broad peaks
# peaktype: str
# # remove duplicates?
# rmdup: bool = True
# def __post_init__(self):
# if self.peaktype not in VALID_PEAKTYPES:
# raise TypeError(f"peaktype {self.peatktype} is not one of {VALID_PEAKTYPES}")
Antibody
=
namedtuple
(
'
Antibody
'
,
'
name peaktype rmdup
'
)
_ANTIBODIES
=
[
Antibody
(
"
h3f3a
"
,
"
broad
"
,
True
),
Antibody
(
"
h3k27me3
"
,
"
broad
"
,
True
),
Antibody
(
"
h3k36me3
"
,
"
broad
"
,
True
),
Antibody
(
"
h3k4me1
"
,
"
broad
"
,
True
),
Antibody
(
"
h3k79me2
"
,
"
broad
"
,
True
),
Antibody
(
"
h3k79me3
"
,
"
broad
"
,
True
),
Antibody
(
"
h3k9me1
"
,
"
broad
"
,
True
),
Antibody
(
"
h3k9me2
"
,
"
broad
"
,
True
),
Antibody
(
"
h4k20me1
"
,
"
broad
"
,
True
),
Antibody
(
"
h2afz
"
,
"
narrow
"
,
True
),
Antibody
(
"
h3ac
"
,
"
narrow
"
,
True
),
Antibody
(
"
h3k27ac
"
,
"
narrow
"
,
True
),
Antibody
(
"
h3k4me2
"
,
"
narrow
"
,
True
),
Antibody
(
"
h3k4me3
"
,
"
narrow
"
,
True
),
Antibody
(
"
h3k9ac
"
,
"
narrow
"
,
True
),
Antibody
(
"
h3k9me3
"
,
"
broad
"
,
False
)
]
ANTIBODIES
=
{
x
.
name
:
x
for
x
in
_ANTIBODIES
}
SUPPORTED_ANTIBODIES
=
{
x
.
name
for
x
in
_ANTIBODIES
}
bcbio/chipseq/macs2.py
View file @
53ab0439
...
...
@@ -7,6 +7,8 @@ from bcbio.provenance import do
from
bcbio.pipeline
import
config_utils
from
bcbio.pipeline
import
datadict
as
dd
from
bcbio
import
bam
from
bcbio.chipseq
import
antibodies
from
bcbio.log
import
logger
def
run
(
name
,
chip_bam
,
input_bam
,
genome_build
,
out_dir
,
method
,
resources
,
data
):
"""
...
...
@@ -18,15 +20,22 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
out_file
=
os
.
path
.
join
(
out_dir
,
name
+
"
_peaks_macs2.xls
"
)
macs2_file
=
os
.
path
.
join
(
out_dir
,
name
+
"
_peaks.xls
"
)
if
utils
.
file_exists
(
out_file
):
_compres_bdg_files
(
out_dir
)
_compres
s
_bdg_files
(
out_dir
)
return
_get_output_files
(
out_dir
)
macs2
=
config_utils
.
get_program
(
"
macs2
"
,
config
)
antibody
=
antibodies
.
ANTIBODIES
.
get
(
dd
.
get_antibody
(
data
).
lower
(),
None
)
if
antibody
:
logger
.
info
(
f
"
{
antibody
.
name
}
specified, using
{
antibody
.
peaktype
}
peak settings.
"
)
peaksettings
=
select_peak_parameters
(
antibody
)
else
:
peaksettings
=
""
options
=
"
"
.
join
(
resources
.
get
(
"
macs2
"
,
{}).
get
(
"
options
"
,
""
))
genome_size
=
bam
.
fasta
.
total_sequence_length
(
dd
.
get_ref_file
(
data
))
genome_size
=
""
if
options
.
find
(
"
-g
"
)
>
-
1
else
"
-g %s
"
%
genome_size
paired
=
"
-f BAMPE
"
if
bam
.
is_paired
(
chip_bam
)
else
""
with
utils
.
chdir
(
out_dir
):
cmd
=
_macs2_cmd
(
method
)
cmd
=
_macs2_cmd
(
data
)
cmd
+=
peaksettings
try
:
do
.
run
(
cmd
.
format
(
**
locals
()),
"
macs2 for %s
"
%
name
)
utils
.
move_safe
(
macs2_file
,
out_file
)
...
...
@@ -37,7 +46,7 @@ def run(name, chip_bam, input_bam, genome_build, out_dir, method, resources, dat
"
You can add specific options for the sample
"
"
setting resources as explained in docs:
"
"
https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#sample-specific-resources
"
)
_compres_bdg_files
(
out_dir
)
_compres
s
_bdg_files
(
out_dir
)
return
_get_output_files
(
out_dir
)
def
_get_output_files
(
out_dir
):
...
...
@@ -52,20 +61,29 @@ def _get_output_files(out_dir):
break
return
{
"
main
"
:
peaks
,
"
macs2
"
:
fns
}
def
_compres_bdg_files
(
out_dir
):
def
_compres
s
_bdg_files
(
out_dir
):
for
fn
in
glob
.
glob
(
os
.
path
.
join
(
out_dir
,
"
*bdg
"
)):
cmd
=
"
gzip %s
"
%
fn
do
.
run
(
cmd
,
"
compress bdg file: %s
"
%
fn
)
def
_macs2_cmd
(
method
=
"
chip
"
):
def
_macs2_cmd
(
data
):
"""
Main command for macs2 tool.
"""
method
=
dd
.
get_chip_method
(
data
)
if
method
.
lower
()
==
"
chip
"
:
cmd
=
(
"
{macs2} callpeak -t {chip_bam} -c {input_bam} {paired}
"
"
{genome_size} -n {name} -
B
{options}
"
)
"
{genome_size} -n {name} -
-bdg
{options}
"
)
elif
method
.
lower
()
==
"
atac
"
:
cmd
=
(
"
{macs2} callpeak -t {chip_bam} --nomodel
"
"
{paired} {genome_size} -n {name} -
B
{options}
"
"
{paired} {genome_size} -n {name} -
-bdg
{options}
"
"
--nolambda --keep-dup all
"
)
else
:
raise
ValueError
(
"
chip_method should be chip or atac.
"
)
return
cmd
def
select_peak_parameters
(
antibody
):
if
antibody
.
peaktype
==
"
broad
"
:
return
"
--broad --broad-cutoff 0.05
"
elif
antibody
.
peaktype
==
"
narrow
"
:
return
""
else
:
raise
ValueError
(
f
"
{
antibody
.
peaktype
}
not recognized.
"
)
bcbio/chipseq/peaks.py
View file @
53ab0439
...
...
@@ -109,7 +109,7 @@ def greylisting(data):
"""
input_bam
=
data
.
get
(
"
work_bam_input
"
,
None
)
if
not
input_bam
:
logger
.
info
(
"
No input BAM file detected, skipping greylisting.
"
)
logger
.
info
(
"
No input
control
BAM file detected, skipping greylisting.
"
)
return
None
try
:
greylister
=
config_utils
.
get_program
(
"
chipseq-greylist
"
,
data
)
...
...
bcbio/distributed/ipythontasks.py
View file @
53ab0439
...
...
@@ -8,7 +8,7 @@ try:
except
ImportError
:
from
IPython.parallel
import
require
from
bcbio
import
heterogeneity
,
hla
,
chipseq
,
structural
,
upload
from
bcbio
import
heterogeneity
,
hla
,
chipseq
,
structural
,
upload
,
utils
from
bcbio.bam
import
callable
from
bcbio.rnaseq
import
(
sailfish
,
rapmap
,
salmon
,
umi
,
kallisto
,
spikein
,
bcbiornaseq
)
...
...
@@ -17,6 +17,7 @@ from bcbio.ngsalign import alignprep
from
bcbio.srna
import
sample
as
srna
from
bcbio.srna
import
group
as
seqcluster
from
bcbio.chipseq
import
peaks
from
bcbio.wgbsseq
import
cpg_caller
,
deduplication
,
trimming
from
bcbio.pipeline
import
(
archive
,
config_utils
,
disambiguate
,
sample
,
qcsummary
,
shared
,
variation
,
run_info
,
rnaseq
)
from
bcbio.provenance
import
system
...
...
@@ -29,9 +30,10 @@ from bcbio.log import logger, setup_local_logging
@contextlib.contextmanager
def
_setup_logging
(
args
):
# Set environment to standard to use periods for decimals and avoid localization
os
.
environ
[
"
LC_ALL
"
]
=
"
C
"
os
.
environ
[
"
LC
"
]
=
"
C
"
os
.
environ
[
"
LANG
"
]
=
"
C
"
locale_to_use
=
utils
.
get_locale
()
os
.
environ
[
"
LC_ALL
"
]
=
locale_to_use
os
.
environ
[
"
LC
"
]
=
locale_to_use
os
.
environ
[
"
LANG
"
]
=
locale_to_use
config
=
None
if
len
(
args
)
==
1
and
isinstance
(
args
[
0
],
(
list
,
tuple
)):
args
=
args
[
0
]
...
...
@@ -103,12 +105,21 @@ def trim_srna_sample(*args):
with
_setup_logging
(
args
)
as
config
:
return
ipython
.
zip_args
(
apply
(
srna
.
trim_srna_sample
,
*
args
))
@require
(
trimming
)
def
trim_bs_sample
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
with
_setup_logging
(
args
):
return
ipython
.
zip_args
(
apply
(
trimming
.
trim
,
*
args
))
@require
(
srna
)
def
srna_annotation
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
with
_setup_logging
(
args
)
as
config
:
return
ipython
.
zip_args
(
apply
(
srna
.
sample_annotation
,
*
args
))
@require
(
seqcluster
)
def
seqcluster_prepare
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
...
...
@@ -133,6 +144,35 @@ def peakcalling(* args):
with
_setup_logging
(
args
)
as
config
:
return
ipython
.
zip_args
(
apply
(
peaks
.
calling
,
*
args
))
@require
(
cpg_caller
)
def
cpg_calling
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
with
_setup_logging
(
args
)
as
config
:
return
ipython
.
zip_args
(
apply
(
cpg_caller
.
calling
,
*
args
))
@require
(
cpg_caller
)
def
cpg_processing
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
with
_setup_logging
(
args
)
as
config
:
return
ipython
.
zip_args
(
apply
(
cpg_caller
.
cpg_postprocessing
,
*
args
))
@require
(
cpg_caller
)
def
cpg_stats
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
with
_setup_logging
(
args
)
as
config
:
return
ipython
.
zip_args
(
apply
(
cpg_caller
.
cpg_stats
,
*
args
))
@require
(
deduplication
)
def
deduplicate_bismark
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
with
_setup_logging
(
args
):
return
ipython
.
zip_args
(
apply
(
deduplication
.
dedup_bismark
,
*
args
))
@require
(
sailfish
)
def
run_sailfish
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
...
...
@@ -223,6 +263,12 @@ def run_salmon_bam(*args):
with
_setup_logging
(
args
):
return
ipython
.
zip_args
(
apply
(
salmon
.
run_salmon_bam
,
*
args
))
@require
(
salmon
)
def
run_salmon_decoy
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
with
_setup_logging
(
args
):
return
ipython
.
zip_args
(
apply
(
salmon
.
run_salmon_decoy
,
*
args
))
@require
(
salmon
)
def
run_salmon_reads
(
*
args
):
args
=
ipython
.
unzip_args
(
args
)
...
...
bcbio/distributed/multitasks.py
View file @
53ab0439
...
...
@@ -5,6 +5,7 @@ from bcbio.bam import callable
from
bcbio.srna
import
sample
as
srna
from
bcbio.srna
import
group
as
seqcluster
from
bcbio.chipseq
import
peaks
from
bcbio.wgbsseq
import
cpg_caller
,
deduplication
,
trimming
from
bcbio.cwl
import
create
as
cwl_create
from
bcbio.cwl
import
cwlutils
from
bcbio.rnaseq
import
(
sailfish
,
rapmap
,
salmon
,
umi
,
kallisto
,
spikein
,
...
...
@@ -58,6 +59,10 @@ def run_kallisto_index(*args):
def
run_kallisto_rnaseq
(
*
args
):
return
kallisto
.
run_kallisto_rnaseq
(
*
args
)
@utils.map_wrap
def
run_salmon_decoy
(
*
args
):
return
salmon
.
run_salmon_decoy
(
*
args
)
@utils.map_wrap
def
run_salmon_reads
(
*
args
):
return
salmon
.
run_salmon_reads
(
*
args
)
...
...
@@ -175,6 +180,32 @@ def srna_alignment(*args):
def
peakcalling
(
*
args
):
return
peaks
.
calling
(
*
args
)
@utils.map_wrap
def
trim_bs_sample
(
*
args
):
return
trimming
.
trim
(
*
args
)
@utils.map_wrap
def
cpg_calling
(
*
args
):
return
cpg_caller
.
calling
(
*
args
)
@utils.map_wrap
def
cpg_processing
(
*
args
):
return
cpg_caller
.
cpg_postprocessing
(
*
args
)
@utils.map_wrap
def
cpg_stats
(
*
args
):
return
cpg_caller
.
cpg_stats
(
*
args
)
@utils.map_wrap
def
deduplicate_bismark
(
*
args
):
return
deduplication
.
dedup_bismark
(
*
args
)
@utils.map_wrap
def
prep_align_inputs
(
*
args
):
return
alignprep
.
create_inputs
(
*
args
)
...
...
bcbio/distributed/runfn.py
View file @
53ab0439
...
...
@@ -14,7 +14,7 @@ import six
import
toolz
as
tz
import
yaml
from
bcbio
import
log
,
utils
,
setpath
from
bcbio
import
log
,
utils
,
setpath
,
utils
from
bcbio.log
import
logger
from
bcbio.cwl
import
cwlutils
from
bcbio.distributed
import
multitasks
...
...
@@ -24,9 +24,10 @@ def process(args):
"""
Run the function in args.name given arguments in args.argfile.
"""
# Set environment to standard to use periods for decimals and avoid localization
os
.
environ
[
"
LC_ALL
"
]
=
"
C
"
os
.
environ
[
"
LC
"
]
=
"
C
"
os
.
environ
[
"
LANG
"
]
=
"
C
"
locale_to_use
=
utils
.
get_locale
()
os
.
environ
[
"
LC_ALL
"
]
=
locale_to_use
os
.
environ
[
"
LC
"
]
=
locale_to_use
os
.
environ
[
"
LANG
"
]
=
locale_to_use
setpath
.
prepend_bcbiopath
()
try
:
fn
=
getattr
(
multitasks
,
args
.
name
)
...
...
bcbio/heterogeneity/bubbletree.py
View file @
53ab0439
...
...
@@ -78,7 +78,7 @@ def _run_bubbletree(vcf_csv, cnv_csv, data, wide_lrr=False, do_plots=True,
with
open
(
r_file
,
"
w
"
)
as
out_handle
:
out_handle
.
write
(
_script
.
format
(
**
locals
()))
if
not
utils
.
file_exists
(
freqs_out
):
cmd
=
"
%s && %s --
no-environ
%s
"
%
(
utils
.
get_R_exports
(),
utils
.
Rscript_cmd
(),
r_file
)
cmd
=
"
%s && %s --
vanilla
%s
"
%
(
utils
.
get_R_exports
(),
utils
.
Rscript_cmd
(),
r_file
)
try
:
do
.
run
(
cmd
,
"
Assess heterogeneity with BubbleTree
"
)
except
subprocess
.
CalledProcessError
as
msg
:
...
...
bcbio/heterogeneity/chromhacks.py
View file @
53ab0439
...
...
@@ -7,6 +7,8 @@ import os
from
bcbio
import
utils
from
bcbio.distributed.transaction
import
file_transaction
from
bcbio.pipeline
import
datadict
as
dd
from
bcbio.bam
import
ref
def
is_autosomal
(
chrom
):
"""
Keep chromosomes that are a digit 1-22, or chr prefixed digit chr1-chr22
...
...
@@ -51,3 +53,8 @@ def bed_to_standardonly(in_file, data, headers=None, include_sex_chroms=False, o
if
checkfn
(
line
.
split
()[
0
])
or
(
headers
and
line
.
startswith
(
headers
)):
out_handle
.
write
(
line
)
return
out_file
def
get_mitochondrial_chroms
(
data
):
ref_file
=
dd
.
get_ref_file
(
data
)
mito
=
[
c
.
name
for
c
in
ref
.
file_contigs
(
ref_file
)
if
is_mitochondrial
(
c
.
name
)]
return
mito
bcbio/hla/optitype.py
View file @
53ab0439
...
...
@@ -25,7 +25,7 @@ def run(data):
"""
hlas
=
[]
for
hla_fq
in
tz
.
get_in
([
"
hla
"
,
"
fastq
"
],
data
,
[]):
hla_type
=
re
.
search
(
"
[.-](?P<hlatype>HLA-[\w-]+).fq
"
,
hla_fq
).
group
(
"
hlatype
"
)
hla_type
=
re
.
search
(
r
"
[.-](?P<hlatype>HLA-[\w-]+).fq
"
,
hla_fq
).
group
(
"
hlatype
"
)
if
hla_type
in
SUPPORTED_HLAS
:
if
utils
.
file_exists
(
hla_fq
):
hlas
.
append
((
hla_type
,
hla_fq
))
...
...
bcbio/illumina/flowcell.py
View file @
53ab0439
...
...
@@ -13,7 +13,7 @@ def parse_dirname(fc_dir):
name
=
None
date
=
None
for
p
in
parts
:
if
p
.
endswith
((
"
XX
"
,
"
xx
"
,
"
XY
"
,
"
X2
"
)):
if
p
.
endswith
((
"
XX
"
,
"
xx
"
,
"
XY
"
,
"
X2
"
,
"
X3
"
)):
name
=
p
elif
len
(
p
)
==
6
:
try
:
...
...
bcbio/illumina/machine.py
View file @
53ab0439
...
...
@@ -106,7 +106,7 @@ def _find_unprocessed(config):
def
_get_directories
(
config
):
for
directory
in
config
[
"
dump_directories
"
]:
for
dname
in
sorted
(
glob
.
glob
(
os
.
path
.
join
(
directory
,
"
*[Aa]*[Xx][XxYy2]
"
))):
for
dname
in
sorted
(
glob
.
glob
(
os
.
path
.
join
(
directory
,
"
*[Aa]*[Xx][XxYy2
3
]
"
))):
if
os
.
path
.
isdir
(
dname
):
yield
dname
...
...
Prev
1
2
3
4
5
…
7
Next