Skip to content
Commits on Source (8)
Copyright (c) 2013, Tao Liu lab at UB and Xiaole Shirley Liu lab at DFCI
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of Tao Liu lab or Xiaole Shirley Liu lab nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL COPYRIGHT
HOLDERs AND CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.
\ No newline at end of file
2019-09-19 Tao Liu <vladimir.liu@gmail.com>
MACS version 2.1.3.3
* Features added
1) Support Docker auto-deploy. PR #309
2) Support Travis CI auto-testing, update unit-testing
scripts, and enable subcommand testing on small datasets.
3) Update README documents. #297 PR #306
4) `cmbreps` supports more than 2 replicates. Merged from PR #304
@Maarten-vd-Sande and PR #307 (our own chi-sq test code)
5) `--d-min` option is added in `callpeak` and `predictd`, to
exclude predictions of fragment size smaller than the given
value. Merged from PR #267 @shouldsee.
6) `--buffer-size` option is added in `predictd`, `filterdup`,
`pileup` and `refinepeak` subcommands. Users can use this option
to decrease memory usage while there are a large number of contigs
in the data. Also, now `callpeak`, `predictd`, `filterdup`,
`pileup` and `refinepeak` will suggest users to tweak
`--buffer-size` while catching a MemoryError. #313 PR #314
* Bugs fixed
1) #265 Fixed a bug where the pseudocount hasn't been applied
while calculating p-value score in ScoreTrack object.
2) Fixed bdgbroadcall so that it will report those broad peaks
without strong peak inside, a consistent behavior as `callpeak
--broad`.
3) Rename COPYING to LICENSE.
2018-10-17 Tao Liu <vladimir.liu@gmail.com>
MACS version 2.1.2
......
# INSTALL Guide For MACS
Time-stamp: <2018-10-17 16:18:48 Tao Liu>
Time-stamp: <2019-09-20 15:08:35 taoliu>
Please check the following instructions to complete your installation.
## Prerequisites
Python version must be equal to *2.7* to run MACS. I recommend
using the version *2.7.9*.
using the version *2.7.15*.
[Numpy](http://www.scipy.org/Download) (>=1.6) are required to run MACS v2.
[Numpy](http://www.scipy.org/Download) (>=1.15) are required to run MACS v2.
GCC is required to compile `.c` codes in MACS v2 package, and python
header files are needed. If you are using Mac OSX, I recommend you
install Xcode; if you are using Linux, you need to make sure
`python-dev` is installed.
[Cython](http://cython.org/) (>=0.18) is required *only if* you want to regenerate `.c`
files from `.pyx` files using `setup_w_cython.py` script.
[Cython](http://cython.org/) (>=0.25) is required to generate `.c`
files from `.pyx` files using `setup.py` script.
## Easy installation through PyPI
......@@ -28,19 +28,12 @@ access to system-wide numpy and scipy libraries so that you don't need
to install them again.
Then under command line, type `pip install MACS2`. PyPI will
install Numpy and Scipy automatically if they are absent.
install Numpy and Cython automatically if they are absent.
To upgrade MACS2, type `pip install -U MACS2`. It will check
currently installed MACS2, compare the version with the one on PyPI
repository, download and install newer version while necessary.
Note, if you do not want pip to fix dependencies. For example, you
already have a workable Scipy and Numpy, and when 'pip install -U
MACS2', pip downloads newest Scipy and Numpy but unable to compile and
install them. This will fail the whole installation. You can pass
'--no-deps' option to pip and let it skip all dependencies. Type
`pip install -U --no-deps MACS2`.
## Install from source
MACS uses Python's distutils tools for source installations. To
......@@ -65,15 +58,6 @@ directory, use this command:
`$ python setup.py install --prefix /home/taoliu/`
If you want to re-generate `.c` files from `.pyx` files, you need
to install Cython first, then use `setup_w_cython.py` script to
replace `setup.py` script in the previous commands, such as::
`$ python setup_w_cython.py install`
or:
`$ python setup_w_cython.py install --prefix /home/taoliu/`
## Configure enviroment variables
......
BSD 3-Clause License
Copyright (c) 2019, Tao Liu lab at RPCCC and Xiaole Shirley Liu lab at DFCI
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
This diff is collapsed.
COPYING
ChangeLog
INSTALL.md
LICENSE
MANIFEST.in
README.md
setup.py
setup_w_cython.py
MACS2/Constants.py
MACS2/OptValidator.py
MACS2/OutputWriter.py
......
numpy>=1.6
numpy>=1.15
cython>=0.25
MACS_VERSION = "2.1.2"
MACS_VERSION = "2.1.3.3"
#MACSDIFF_VERSION = "1.0.4 20110212 (tag:alpha)"
FILTERDUP_VERSION = "1.0.0 20140616"
RANDSAMPLE_VERSION = "1.0.0 20120703"
......
This diff is collapsed.
# Time-stamp: <2016-02-12 00:39:32 Tao Liu>
# Time-stamp: <2019-09-20 11:37:21 taoliu>
"""Module for Feature IO classes.
Copyright (c) 2010,2011 Tao Liu <taoliu@jimmy.harvard.edu>
"""Module for BedGraph data class.
This code is free software; you can redistribute it and/or modify it
under the terms of the BSD License (see the file COPYING included
with the distribution).
@status: experimental
@version: $Revision$
@author: Tao Liu
@contact: taoliu@jimmy.harvard.edu
under the terms of the BSD License (see the file LICENSE included with
the distribution).
"""
# ------------------------------------
......@@ -22,7 +15,6 @@ import logging
from array import array
import numpy as np
np_convolve = np.convolve
from libc.math cimport sqrt
from libc.math cimport log
......@@ -31,7 +23,7 @@ from cpython cimport bool
from MACS2.Constants import *
from MACS2.IO.ScoreTrack import scoreTrackII,CombinedTwoTrack
from MACS2.IO.PeakIO import PeakIO, BroadPeakIO
from MACS2.Prob import chisq_logp_e
# ------------------------------------
# constants
......@@ -500,97 +492,6 @@ cdef class bedGraphTrackI:
)
return True
# def __close_peak2 (self, peak_content, peaks, int min_length, str chrom, int smoothlen=50):
# # this is where the summits are called, need to fix this
# end, start = peak_content[ -1 ][ 1 ], peak_content[ 0 ][ 0 ]
# if end - start < min_length: return # if the peak is too small, reject it
# #for (start,end,value,summitvalue,index) in peak_content:
# peakdata = np.zeros(end - start, dtype='float32')
# peakindices = np.zeros(end - start, dtype='int32')
# for (tmpstart,tmpend,tmpvalue,tmpsummitvalue, tmpindex) in peak_content:
# i, j = tmpstart-start, tmpend-start
# peakdata[i:j] = self.data[chrom]['sample'][tmpindex]
# peakindices[i:j] = tmpindex
# # apply smoothing window of tsize / 2
# w = np.ones(smoothlen, dtype='float32')
# smoothdata = np_convolve(w/w.sum(), peakdata, mode='same')
# # find maxima and minima
# local_extrema = np.where(np.diff(np.sign(np.diff(smoothdata))))[0]+1
# # get only maxima by requiring it be greater than the mean
# # might be better to take another derivative instead
# plateau_offsets = np.intersect1d(local_extrema,
# np.where(peakdata>peakdata.mean())[0])
# # sometimes peak summits are plateaus, so check for adjacent coordinates
# # and take the middle ones if needed
# if len(plateau_offsets)==0:
# #####################################################################
# # ***failsafe if no summits so far*** #
# summit_offset_groups = [[(end - start) / 2]] #
# #####################################################################
# elif len(plateau_offsets) == 1:
# summit_offset_groups = [[plateau_offsets[0]]]
# else:
# previous_offset = plateau_offsets[0]
# summit_offset_groups = [[previous_offset]]
# for offset in plateau_offsets:
# if offset == previous_offset + 1:
# summit_offset_groups[-1].append(offset)
# else:
# summit_offset_groups.append([offset])
# summit_offsets = []
# for offset_group in summit_offset_groups:
# summit_offsets.append(offset_group[len(offset_group) / 2])
# summit_indices = peakindices[summit_offsets]
# # also purge offsets that have the same summit_index
# unique_offsets = []
# summit_offsets = np.fromiter(summit_offsets, dtype='int32')
# for index in np.unique(summit_indices):
# those_index_indices = np.where(summit_indices == index)[0]
# those_offsets = summit_offsets[those_index_indices]
# unique_offsets.append(int(those_offsets.mean()))
# # also require a valley of at least 0.6 * taller peak
# # in every adjacent two peaks or discard the lesser one
# # this behavior is like PeakSplitter
# better_offsets = []
# previous_offset = unique_offsets.pop()
# while True:
# if len(unique_offsets) == 0:
# better_offsets.append(previous_offset)
# break
# else:
# this_offset = unique_offsets.pop()
# this_h, prev_h = peakdata[[this_offset, previous_offset]]
# if this_h > prev_h:
# prev_is_taller = False
# min_valley = 0.6 * this_h
# else:
# prev_is_taller = True
# min_valley = 0.6 * prev_h
# s = slice(this_offset, previous_offset)
# valley = np.where(peakdata[s] < min_valley)[0]
# if len(valley) > 0: better_offsets.append(previous_offset)
# else:
# if prev_is_taller: continue # discard this peak
# # else: discard previous peak by ignoring it
# previous_offset = this_offset
# better_offsets.reverse()
# better_indices = peakindices[better_offsets]
# assert len(better_offsets) > 0, "Lost peak summit(s) near %s %d" % (chrom, start)
# for summit_offset, summit_index in zip(better_offsets, better_indices):
# peaks.add( chrom,
# start,
# end,
# summit = start + summit_offset,
# peak_score = peakdata[summit_offset],
# pileup = 0,
# pscore = 0,
# fold_change = 0,
# qscore = 0,
# )
# # start a new peak
# return True
def call_broadpeaks (self, double lvl1_cutoff=500, double lvl2_cutoff=100, int min_length=200,
int lvl1_max_gap=50, int lvl2_max_gap=400):
"""This function try to find enriched regions within which,
......@@ -662,25 +563,36 @@ cdef class bedGraphTrackI:
start = lvl2peak["start"]
end = lvl2peak["end"]
if not lvl1peakset:
bpeaks.add(chrom, start, end, score=lvl2peak["score"], thickStart=".", thickEnd=".",
blockNum = 0, blockSizes = ".", blockStarts = ".")
#try:
# will complement by adding 1bps start and end to this region
# may change in the future if gappedPeak format was improved.
bpeaks.add(chrom, start, end, score=lvl2peak["score"], thickStart=str(start), thickEnd=str(end),
blockNum = 2, blockSizes = "1,1", blockStarts = "0,"+str(end-start-1), pileup = lvl2peak["pileup"],
pscore = lvl2peak["pscore"], fold_change = lvl2peak["fc"],
qscore = lvl2peak["qscore"] )
return bpeaks
thickStart = str(lvl1peakset[0]["start"])
thickEnd = str(lvl1peakset[-1]["end"])
blockNum = len(lvl1peakset)
blockSizes = ",".join( map(lambda x:str(x["length"]),lvl1peakset) )
blockStarts = ",".join( map(lambda x:str(x["start"]-start),lvl1peakset) )
#if lvl2peak["start"] != thickStart:
# # add 1bp mark for the start of lvl2 peak
# blockNum += 1
# blockSizes = "1,"+blockSizes
# blockStarts = "0,"+blockStarts
#if lvl2peak["end"] != thickEnd:
# # add 1bp mark for the end of lvl2 peak
# blockNum += 1
# blockSizes = blockSizes+",1"
# blockStarts = blockStarts+","+str(end-start-1)
# add 1bp left and/or right block if necessary
if int(thickStart) != start:
# add 1bp left block
thickStart = str(start)
blockNum += 1
blockSizes = "1,"+blockSizes
blockStarts = "0,"+blockStarts
if int(thickEnd) != end:
# add 1bp right block
thickEnd = str(end)
blockNum += 1
blockSizes = blockSizes+",1"
blockStarts = blockStarts+","+str(end-start-1)
bpeaks.add(chrom, start, end, score=lvl2peak["score"], thickStart=thickStart, thickEnd=thickEnd,
blockNum = blockNum, blockSizes = blockSizes, blockStarts = blockStarts)
......@@ -716,7 +628,7 @@ cdef class bedGraphTrackI:
ret.add_loc(chrom,0,max_p,new_value)
return ret
def overlie (self, bdgTrack2, func="max" ):
def overlie (self, bdgTracks, func="max" ):
"""Calculate two bedGraphTrackI objects by letting self
overlying bdgTrack2, with user-defined functions.
......@@ -744,6 +656,7 @@ cdef class bedGraphTrackI:
Then the given 'func' will be applied on each 2-tuple as func(#1,#2)
Supported 'func' are "max", "mean" and "fisher".
Return value is a bedGraphTrackI object.
"""
......@@ -752,66 +665,62 @@ cdef class bedGraphTrackI:
double v1, v2
str chrom
assert isinstance(bdgTrack2,bedGraphTrackI), "bdgTrack2 is not a bedGraphTrackI object"
nr_tracks = len(bdgTracks) + 1 # +1 for self
assert nr_tracks >= 2, "Specify at least two replicates"
for i, bdgTrack in enumerate(bdgTracks):
assert isinstance(bdgTrack, bedGraphTrackI), "bdgTrack{} is not a bedGraphTrackI object".format(i + 1)
if func == "max":
f = max
elif func == "mean":
f = lambda x, y: ( x + y ) / 2
#f = np_mean
def f(*args):
return sum(args)/nr_tracks
elif func == "fisher":
f = lambda p1, p2: ( p1 + p2 ) - log1p( ( p1 + p2 ) / LOG10_E ) * LOG10_E
# combine -log10pvalues
def f(*args):
# chisq statistics = sum(-log10p)/log10(e)*2, chisq df = 2*number_of_reps
return chisq_logp_e(2*sum(args)/LOG10_E, 2*nr_tracks, log10=True)
else:
raise Exception("Invalid function")
ret = bedGraphTrackI()
retadd = ret.add_loc
chr1 = set(self.get_chr_names())
chr2 = set(bdgTrack2.get_chr_names())
common_chr = chr1.intersection(chr2)
common_chr = set(self.get_chr_names())
for track in bdgTracks:
common_chr = common_chr.intersection(set(track.get_chr_names()))
for chrom in common_chr:
(p1s,v1s) = self.get_data_by_chr(chrom) # arrays for position and values
p1n = iter(p1s).next # assign the next function to a viable to speed up
v1n = iter(v1s).next
datas = [self.get_data_by_chr(chrom)]
datas.extend([bdgTracks[i].get_data_by_chr(chrom) for i in range(len(bdgTracks))])
(p2s,v2s) = bdgTrack2.get_data_by_chr(chrom) # arrays for position and values
p2n = iter(p2s).next # assign the next function to a viable to speed up
v2n = iter(v2s).next
ps, vs, pn, vn = [], [], [], []
for data in datas:
ps.append(data[0])
pn.append(iter(ps[-1]).next)
vs.append(data[1])
vn.append(iter(vs[-1]).next)
pre_p = 0 # remember the previous position in the new bedGraphTrackI object ret
try:
p1 = p1n()
v1 = v1n()
p2 = p2n()
v2 = v2n()
ps_cur = [pn[i]() for i in range(len(pn))]
vs_cur = [vn[i]() for i in range(len(pn))]
while True:
if p1 < p2:
# clip a region from pre_p to p1, then set pre_p as p1.
retadd(chrom,pre_p,p1,f(v1,v2))
pre_p = p1
# call for the next p1 and v1
p1 = p1n()
v1 = v1n()
elif p2 < p1:
# clip a region from pre_p to p2, then set pre_p as p2.
retadd(chrom,pre_p,p2,f(v1,v2))
pre_p = p2
# call for the next p2 and v2
p2 = p2n()
v2 = v2n()
elif p1 == p2:
# from pre_p to p1 or p2, then set pre_p as p1 or p2.
retadd(chrom,pre_p,p1,f(v1,v2))
pre_p = p1
# call for the next p1, v1, p2, v2.
p1 = p1n()
v1 = v1n()
p2 = p2n()
v2 = v2n()
# get the lowest position
lowest_p = min(ps_cur)
# at least one lowest position, could be multiple
locations = [i for i in range(len(ps_cur)) if ps_cur[i] == lowest_p]
# add the data until the interval
ret.add_loc(chrom, pre_p, ps_cur[locations[0]], f(*vs_cur))
pre_p = ps_cur[locations[0]]
for index in locations:
ps_cur[index] = pn[index]()
vs_cur[index] = vn[index]()
except StopIteration:
# meet the end of either bedGraphTrackI, simply exit
pass
......@@ -1145,80 +1054,6 @@ cdef class bedGraphTrackI:
ret += "%.2f\t%d\t%d\t%.2f\n" % ( cutoff, cutoff_npeaks[ cutoff ], cutoff_lpeaks[ cutoff ], cutoff_lpeaks[ cutoff ]/cutoff_npeaks[ cutoff ] )
return ret
# def make_scoreTrack_for_macs2diff (self, bdgTrack2 ):
# """A modified overlie function for MACS v2 differential call.
# Return value is a bedGraphTrackI object.
# """
# cdef:
# int pre_p, p1, p2
# double v1, v2
# str chrom
# assert isinstance(bdgTrack2,bedGraphTrackI), "bdgTrack2 is not a bedGraphTrackI object"
# ret = CombinedTwoTrack()
# retadd = ret.add
# chr1 = set(self.get_chr_names())
# chr2 = set(bdgTrack2.get_chr_names())
# common_chr = chr1.intersection(chr2)
# for chrom in common_chr:
# (p1s,v1s) = self.get_data_by_chr(chrom) # arrays for position and values
# p1n = iter(p1s).next # assign the next function to a viable to speed up
# v1n = iter(v1s).next
# (p2s,v2s) = bdgTrack2.get_data_by_chr(chrom) # arrays for position and values
# p2n = iter(p2s).next # assign the next function to a viable to speed up
# v2n = iter(v2s).next
# chrom_max_len = len(p1s)+len(p2s) # this is the maximum number of locations needed to be recorded in scoreTrackI for this chromosome.
# ret.add_chromosome(chrom,chrom_max_len)
# pre_p = 0 # remember the previous position in the new bedGraphTrackI object ret
# try:
# p1 = p1n()
# v1 = v1n()
# p2 = p2n()
# v2 = v2n()
# while True:
# if p1 < p2:
# # clip a region from pre_p to p1, then set pre_p as p1.
# retadd( chrom, p1, v1, v2 )
# pre_p = p1
# # call for the next p1 and v1
# p1 = p1n()
# v1 = v1n()
# elif p2 < p1:
# # clip a region from pre_p to p2, then set pre_p as p2.
# retadd( chrom, p2, v1, v2 )
# pre_p = p2
# # call for the next p2 and v2
# p2 = p2n()
# v2 = v2n()
# elif p1 == p2:
# # from pre_p to p1 or p2, then set pre_p as p1 or p2.
# retadd( chrom, p1, v1, v2 )
# pre_p = p1
# # call for the next p1, v1, p2, v2.
# p1 = p1n()
# v1 = v1n()
# p2 = p2n()
# v2 = v2n()
# except StopIteration:
# # meet the end of either bedGraphTrackI, simply exit
# pass
# ret.finalize()
# #ret.merge_regions()
# return ret
def scoreTracktoBedGraph (scoretrack, str colname):
"""Produce a bedGraphTrackI object with certain column as scores.
......
This diff is collapsed.
# Time-stamp: <2015-03-13 12:50:14 Tao Liu>
# Time-stamp: <2019-09-20 11:37:42 taoliu>
"""Module Description: IO Module for bedGraph file
Copyright (c) 2011 Tao Liu <taoliu@jimmy.harvard.edu>
This code is free software; you can redistribute it and/or modify it
under the terms of the BSD License (see the file COPYING included with
under the terms of the BSD License (see the file LICENSE included with
the distribution).
@status: experimental
@version: $Revision$
@author: Tao Liu
@contact: taoliu@jimmy.harvard.edu
"""
# ------------------------------------
......
# Time-stamp: <2011-03-14 17:52:00 Tao Liu>
# Time-stamp: <2019-09-20 11:38:14 taoliu>
"""Module Description: BinKeeper for Wiggle-like tracks.
Copyright (c) 2008 Tao Liu <taoliu@jimmy.harvard.edu>
"""Module Description: BinKeeper for Wiggle-like tracks. (obsolete)
This code is free software; you can redistribute it and/or modify it
under the terms of the BSD License (see the file COPYING included with
under the terms of the BSD License (see the file LICENSE included with
the distribution).
@status: experimental
@version: $Revision$
@author: Tao Liu
@contact: taoliu@jimmy.harvard.edu
"""
# ------------------------------------
......
This diff is collapsed.
# Time-stamp: <2016-05-19 10:22:22 Tao Liu>
# Time-stamp: <2019-09-20 11:38:30 taoliu>
"""Module for Calculate Scores.
Copyright (c) 2013 Tao Liu <vladimir.liu@gmail.com>
This code is free software; you can redistribute it and/or modify it
under the terms of the BSD License (see the file COPYING included
with the distribution).
@status: experimental
@version: $Revision$
@author: Tao Liu
@contact: vladimir.liu@gmail.com
under the terms of the BSD License (see the file LICENSE included with
the distribution).
"""
# ------------------------------------
......
This diff is collapsed.
# Time-stamp: <2015-06-03 00:49:19 Tao Liu>
# Time-stamp: <2019-09-20 11:38:45 taoliu>
"""Module for FWTrack classes.
Copyright (c) 2010,2011 Tao Liu <taoliu@jimmy.harvard.edu>
This code is free software; you can redistribute it and/or modify it
under the terms of the BSD License (see the file COPYING included
with the distribution).
@status: experimental
@version: $Revision$
@author: Tao Liu
@contact: taoliu@jimmy.harvard.edu
under the terms of the BSD License (see the file LICENSE included with
the distribution).
"""
# ------------------------------------
......@@ -446,7 +439,7 @@ cdef class FWTrack:
return 0
@cython.boundscheck(False) # do not check that np indices are valid
cpdef filter_dup ( self, int32_t maxnum = -1):
cpdef unsigned long filter_dup ( self, int32_t maxnum = -1):
"""Filter the duplicated reads.
Run it right after you add all data into this object.
......@@ -462,7 +455,7 @@ cdef class FWTrack:
str k
np.ndarray[np.int32_t, ndim=1] plus, new_plus, minus, new_minus
if maxnum < 0: return # do nothing
if maxnum < 0: return self.total # do nothing
if not self.__sorted:
self.sort()
......@@ -558,83 +551,7 @@ cdef class FWTrack:
self.__locations[k]=[new_plus,new_minus]
self.length = self.fw * self.total
return
@cython.boundscheck(False) # do not check that np indices are valid
cpdef filter_dup_dryrun ( self, int32_t maxnum = -1):
"""Filter the duplicated reads. (dry run) only return number of remaining reads
Run it right after you add all data into this object.
Note, this function will *throw out* duplicates
permenantly. If you want to keep them, use separate_dups
instead.
"""
cdef:
int p, m, n, current_loc, i_chrom, total
# index for old array, and index for new one
unsigned long i_old, size
str k
np.ndarray[np.int32_t, ndim=1] plus, minus
if maxnum < 0: return # do nothing
if not self.__sorted:
self.sort()
total = 0
chrnames = self.get_chr_names()
for i_chrom in range( len(chrnames) ):
# for each chromosome.
# This loop body is too big, I may need to split code later...
k = chrnames[ i_chrom ]
# + strand
plus = self.__locations[k][0]
size = plus.shape[0]
if len(plus) < 1:
pass
else:
total += 1
n = 1 # the number of tags in the current location
current_loc = plus[0]
for i_old in range( 1, size ):
p = plus[ i_old ]
if p == current_loc:
n += 1
if n <= maxnum:
total += 1
else:
logging.debug("Duplicate reads found at %s:%d at + strand" % (k,p) )
else:
current_loc = p
total += 1
n = 1
# - strand
minus = self.__locations[k][1]
size = minus.shape[0]
if len(minus) < 1:
pass
else:
total += 1
n = 1 # the number of tags in the current location
current_loc = minus[0]
for i_old in range( 1, size ):
p = minus[ i_old ]
if p == current_loc:
n += 1
if n <= maxnum:
total += 1
else:
logging.debug("Duplicate reads found at %s:%d at + strand" % (k,p) )
else:
current_loc = p
total += 1
n = 1
return total
return self.total
cpdef sample_percent (self, float percent, int seed = -1 ):
"""Sample the tags for a given percentage.
......
This diff is collapsed.
# Time-stamp: <2018-10-16 12:15:12 Tao Liu>
# Time-stamp: <2019-09-20 11:39:05 taoliu>
"""Module for filter duplicate tags from paired-end data
Copyright (c) 2010,2011 Tao Liu <taoliu@jimmy.harvard.edu>
This code is free software; you can redistribute it and/or modify it
under the terms of the BSD License (see the file COPYING included
with the distribution).
@status: experimental
@version: $Revision$
@author: Tao Liu
@contact: taoliu@jimmy.harvard.edu
under the terms of the BSD License (see the file LICENSE included with
the distribution).
"""
from MACS2.Constants import *
from logging import debug, info
import numpy as np
......@@ -357,7 +351,7 @@ cdef class PETrackI:
return
@cython.boundscheck(False) # do not check that np indices are valid
def filter_dup ( self, int maxnum=-1):
cpdef unsigned long filter_dup ( self, int maxnum=-1):
"""Filter the duplicated reads.
Run it right after you add all data into this object.
......@@ -371,7 +365,7 @@ cdef class PETrackI:
str k
np.ndarray locs, new_locs
if maxnum < 0: return # condition to return if not filtering
if maxnum < 0: return self.total # condition to return if not filtering
if not self.__sorted: self.sort()
......@@ -432,7 +426,7 @@ cdef class PETrackI:
self.__locations[k] = new_locs
self.average_template_length = float( self.length ) / self.total
return
return self.total
def sample_percent (self, float percent, int seed = -1):
"""Sample the tags for a given percentage.
......
This diff is collapsed.