Skip to content
Commits on Source (3)
mmseqs2 (9-d36de-1) UNRELEASED; urgency=medium
* Initial release (Closes: #<bug>)
* Initial release (Closes: #932369)
-- Shayan Doust <hello@shayandoust.me> Mon, 15 Jul 2019 13:45:03 +0100
......@@ -37,20 +37,6 @@ Copyright: 2019 Shayan Doust <hello@shayandoust.me>
License: GPLv3
License: GPLv3
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
License: LGPL
This version of the GNU Lesser General Public License incorporates
the terms and conditions of version 3 of the GNU General Public
License, supplemented by the additional permissions listed below.
......
This diff is collapsed.
# Code Style
## Braces
Braces are also used where they would be optional
Braces are used with if, else, for, do and while statements, even when the body is empty or contains only a single statement.
For example, a preprocessor macro that could go wrong when leaving out braces:
```
#define MACRO test1; \
test2;
if (true)
MACRO
```
This would get expanded to the following:
```
if (true) {
test1;
}
test2;
```
Keep the beginning brace in the same line as control structures, functions, etc.:
```
if (true) {
```
Please avoid:
```
if (true)
{
```
## Naming guideline
Write your names as descriptive and long as necessary, but also as short as possible.
Example: Use `weights` instead of `wg`. Do not unnecessarily expand the name to `sequenceWeights`, if its clear from the context that you are dealing with a sequence. However, if you dealing with `profileWeights` and `sequenceWeights` in the same context, feel free to use longer names.
Only iterator variables are supposed to be one letter long (`i`, `j`, `k`).
### Class names
Class names are written in UpperCamelCase.
Example: `ClusterAlgorithm`
### Method names
Method names are written in lowerCamelCase.
Method names are typically verbs or verb phrases.
Example: `sendMessage`, `stop`, `clusterMethod`
### Constant names
Constant names use CONSTANT_CASE: all uppercase letters, with each word separated from the next by a single underscore.
Example: `CLUSTER_ALGORITHM`
### Non-constant field names
Non-constant field names (static or otherwise) are written in lowerCamelCase.
These names are typically nouns or noun phrases.
Example: `computedValues` or `index`
### Local variable names
Local variable names are written in lowerCamelCase.
Even when final and immutable, local variables are not considered to be constants, and should not be styled as constants.
## Whitespace
Be generous with white spacing horizontally, but try to keep code compact vertically.
Here a `_` characters indicates where you should be placing a space character:
```
if_(int_name)_{
____int_test_=_1_+_(1_+_x);
}
```
Use empty lines to structure code in logical blocks or tasks.
# Programming Practices
## Logging errors or info messages
Do not use `printf`, `std::cout`, `std::err` (etc.) for printing messages. All output have to go through the `Debug` logging class.
## Error handling
We do not use Exceptions in our code. We have two types of errors in MMseqs2.
Exceptions are disabled per compile flag.
#### 1) Errors which stop the run completly
Write a descriptive error message with `Debug(Debug::ERROR)` and exit out immediatly with the `EXIT` macro.
Dot not use `exit` directly. `EXIT` handles cleaning up remaining MPI instances (if compiled with MPI).
```
size_t written = write(dataFilefd, buffer, bufferSize);
if (written != bufferSize) {
Debug(Debug::ERROR) << "Could not write to data file " << dataFileNames[0] << "\n";
EXIT(EXIT_FAILURE);
}
```
### 2) Warning which can be handled
Write to `Debug(Debug::WARNING)` and continue with the next loop iteration or whatever is appropriate.
```
if (std::remove(dataFileNames[i]) != 0) {
Debug(Debug::WARNING) << "Could not remove file " << dataFileNames[i] << "\n";
}
```
## Parallel Computing
We use OpenMP to run multiple threads for parallel computing.
Do not use `pthreads` or `std::thread`.
The standard pattern for doing anything with OpenMP looks something like this:
```
// Declare only thread-safe stuff here
#pragma omp parallel
{
// PER THREAD VARIABLE DECLARATION
unsigned int threadIdx = 0;
#ifdef OPENMP
threadIdx = (unsigned int) omp_get_thread_num();
#endif
// somtimes you want schedule(static)
#pragma omp for schedule(dynamic, 1)
for (...) {
// DO YOUR WORK
}
// CLEAN UP MEMORY IF NECCECARY
}
```
Try to avoid `#pragma omp critical` and `#pragma omp atomic`. Consider using atomic instructions instead (e.g. `__sync_fetch_and_add`).
## Advice on memory allocation
Allocate memory as early as possible. Try not to allocate memory on the heap inside your hot loops:
```
#pragma omp parallel
{
// try to allocate once here
char MEMORY[1024 * 1024 * 1024];
// also for containers
std::vector<int> results;
results.reserve(1024);
#pragma omp for schedule(static)
for (...) {
// not here
}
```
# C++ Standard
Try to avoid using too many C++ features. MMseqs2 is coded in a way where we do not use not too many concepts from modern C++.
Generally you have to support GCC 4.8, this is enforced by the Continous Integration system.
It is more like C style C++. We do use classes to organize code. Some STL functionality should be used `std::string`, `std::vector`, sometimes also `std::map` (careful!).
However, weight any new C++ concept heavily and try to avoid them as much as possible.
Especially, do not use:
* `auto`
* streams (they can be extremely slow, instead use `std::string s; s.reserve(10000);` outside a loop and inside `s.append(...); s.clear();`)
* smart pointers (try to use RAII for allocation as much as possible)
* functional programming
* inheritance (think about it very carefully, its usually a lot less useful than it appears)
You will still find some `std::stringstream` littered throughout our codebase, we are trying to progressivly get rid of those and not to add any new ones.
Some modern C++ features are very useful.
For example, `std::vector::emplace_back` can avoid memory allocations for example:
```
// two allocations
vector.push_back(Struct(1, 2, 3));
// one allocation
vector.emplace_back(1, 2, 3);
```
# MMseqs2 specific advice
## Code reuse
Take a look at all the classes in the `src/common` subfolder.
They contain a lot of useful stuff like `Util`, `FileUtil`, `MathUtil`, `Itoa`, etc.
Try not to reimplement stuff that exists already.
For bioinformatics, understand how to use the `Sequence`, `QueryMatcher`, `Matcher`, etc. classes.
## Development of modules
To add a workflow or an util tool to MMseqs2 you need register your workflow or module in the `src/mmseqs.cpp` file.
A new command generally looks something like this:
```
{"search", search, &par.searchworkflow, COMMAND_MAIN,
"Search with query sequence or profile DB (iteratively) through target sequence DB",
"Searches with the sequences or profiles query DB through the target sequence DB by running the prefilter tool and the align tool for Smith-Waterman alignment. For each query a results file with sequence matches is written as entry into a database of search results (alignmentDB).\nIn iterative profile search mode, the detected sequences satisfying user-specified criteria are aligned to the query MSA, and the resulting query profile is used for the next search iteration. Iterative profile searches are usually much more sensitive than (and at least as sensitive as) searches with single query sequences.",
"Martin Steinegger <martin.steinegger@mpibpc.mpg.de>",
"<i:queryDB> <i:targetDB> <o:alignmentDB> <tmpDir>",
CITATION_MMSEQS2},
```
# Before commiting code
## Compiler warnings
Do not leave any compiler warnings in your code. Most of the time they might be false positives.
However, sometimes they hide real issues. Continous integration runs with `-Werror` and will fail when it finds any warnings.
Since, the CI system runs on many compilers and compiler versions the kinf of warnigs reported might differ between your local environment and the CI>
### Shellcheck
(Shellcheck)[https://www.shellcheck.net] runs on all workflow shell scripts and will fail in the continous integration if it finds any issues.
Make sure to not use Bash specific features. `#!/bin/sh` means that are POSIX shell compliant.
The MMseqs2 Windows builds run with the busybox ash shell, if you are a bit careful about your scripts, you will automatically gain Windows support.
## Regression test
The regression test runs most workflows such as search, profile search, profile-profile, target-profile, clustering, linclustm etc. after every commit.
It compares their values against known good ones and fails if they don't match.
To run a search regression test execute the following steps:
# download the runner script and set permissions
$ wget https://bitbucket.org/martin_steinegger/mmseqs-benchmark/raw/master/scripts/run_codeship_pipeline.sh
$ chmod +x run_codeship_pipeline.sh
# change three variables in this file edit the following variables:
# If you dont have AVX2 on the machine just comment all lines containing MMSEQSAVX
`BASE_DIR="$HOME/clone/regression_test”`
`MMSEQSSSE="$HOME/clone/build/src/mmseqs”`
`MMSEQSAVX="$HOME/clone/build_avx2/src/mmseqs"`
# run script and set CI_COMMIT_ID to some non-empty string (in our CI system this is automatically set to the git commit).
$ CI_COMMIT_ID="TESTING" ./run_codeship_pipeline.sh
# The script will return an error code != 0 if there is a regression in sensitivity of MMseqs2. The error code can be checked with "echo $?".
$ [ $? -eq 1 ] && echo "Error"
It will print a report with sensitivity AUCs it achieved and then error out if it did not achieve the minimum AUCs. Currently 0.235 for normal sequence searches and 0.331 for profile searches.
You can also use our Docker images to run this benchmark:
cd mmseqs-folder
docker build -t mmseqs2 .
git clone https://bitbucket.org/martin_steinegger/mmseqs-benchmark.git
cd mmseqs-benchmark
docker build -t mmseqs-benchmark .
The regression test passed, if the second image exits cleanly.
Please note, some users don't have permissions to access the unix socket to communicate with the Docker engine.
In such a case, run the commands above as "sudo docker build ..."
The script that runs the regression test is found here:
https://bitbucket.org/martin_steinegger/mmseqs-benchmark/raw/master/scripts/run_codeship_pipeline.sh
There is also a docker image that runs the regression test:
https://bitbucket.org/martin_steinegger/mmseqs-benchmark/raw/master/Dockerfile
## Inspecting crashes on real data
MMseqs2 is designed for large-scale data analysis so if a crash occurs on real data it is often not possible to reproduce the run and debug it in a source-code editor (e.g., visual studio code). It is therefore recommended to compile MMseqs2 with
```
-DCMAKE_BUILD_TYPE=RelWithDebInfo
```
Any post-crash core dump file can then be inspected by running:
```
gdb /path/to/mmseqs path/to/core/file
```
You can first inspect the stack trace with 'bt'. This should give you an idea of the mmseqs function and line of code that started the trouble. using 'frame number' can allow zooming in on a particular frame. Other useful options include re-running the code using gdb and setting breakpoints. For example, 'b abort' and 'b exit' will set breakpoints upon any exit of abort in the code.
To run gdb on mmseqs2 with its arguments type:
```
gdb --args /path/to/mmseqs mmseqs2-arg1 mmseqs2-arg2 mmseqs2-arg3 ...
```
## Sanitizers
MMseqs2 can be built with [ASan](https://github.com/google/sanitizers/wiki/AddressSanitizer)/[MSan](https://github.com/google/sanitizers/wiki/MemorySanitizer)/[UBSan](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)/[TSan](https://clang.llvm.org/docs/ThreadSanitizer.html) support by specifying calling:
```
cmake -DHAVE_SANITIZER=1 -DCMAKE_BUILD_TYPE=ASan ..
```
Replace ASan with MSan, UBsan or TSan for the other sanitizers. CMake will error and abort if your compiler does not support the respective sanitizer.
\ No newline at end of file
This diff is collapsed.
#!/bin/sh -e
cat Home.md \
| sed '1,/<!--- TOC END -->/d' \
| cat .pandoc/meta.yaml - \
| pandoc \
--from=markdown \
-o userguide.pdf \
--template=.pandoc/eisvogel.tex \
--toc
chmod a+r userguide.pdf