1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665
|
# Writing New Modules
## Introduction
Writing a new module can at first seem a daunting task. However, MultiQC
has been written _(and refactored)_ to provide a lot of functionality
as common functions.
Provided that you are familiar with writing Python and you have a read
through the guide below, you should be on your way in no time!
If you have any problems, feel free to contact the author - details
here: [@ewels](https://github.com/ewels)
## Core modules / plugins
New modules can either be written as part of MultiQC or in a stand-alone
plugin. If your module is for a publicly available tool, please add it
to the main program and contribute your code back when complete via a
pull request.
If your module is for something _very_ niche, which no-one else can use,
you can write it as part of a custom plugin. The process is almost identical,
though it keeps the code bases separate. For more information about this,
see the docs about _MultiQC Plugins_ below.
## Linting
MultiQC has been developed to be as forgiving as possible and will handle lots of
invalid or ignored code. This is useful most of the time but can be difficult when
writing new MultiQC modules (especially during pull-request reviews).
To help with this, you can run with the `--lint` flag, which will give explicit
warnings about anything that is not optimally configured. For example:
```
multiqc --lint test_data
```
Note that the automated MultiQC continuous integration testing runs in this mode,
so you will need to pass all lint tests for those checks to pass. This is required
for any pull-requests.
## Initial setup
### Submodule
MultiQC modules are Python submodules - as such, they need their own
directory in `/multiqc/` with an `__init__.py` file. The directory should
share its name with the module. To follow common practice, the module
code usually then goes in a separate python file (also with the same name)
which is then imported by `__init__.py`:
```python
from __future__ import absolute_import
from .modname import MultiqcModule
```
### Entry points
Once your submodule files are in place, you need to tell MultiQC that they
are available as an analysis module. This is done within `setup.py` using
[entry points](http://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery-of-services-and-plugins).
In `setup.py` you will see some code that looks like this:
```python
entry_points = {
'multiqc.modules.v1': [
'bismark = multiqc.modules.bismark:MultiqcModule',
[...]
]
}
```
Copy one of the existing module lines and change it to use your module name.
The order is irrelevant, so stick to alphabetical if in doubt.
Once this is done, you will need to update your installation of MultiQC:
```
pip install -e .
```
### MultiQC config
So that MultiQC knows what order modules should be run in, you need to add
your module to the core config file.
In `multiqc/utils/config_defaults.yaml` you should see a list variable called `module_order`.
This contains the name of modules in order of precedence. Add your module here
in an appropriate position.
### Documentation
Next up, you need to create a documentation file for your module. The reason
for this is twofold: firstly, docs are important to help people to use, debug
and extend MultiQC (you're reading this, aren't you?). Secondly,
having the file there with the appropriate YAML front matter will make the
module show up on the [MultiQC homepage](http://multiqc.info) so that everyone
knows it exists. This process is automated once the file is added to the core
repository.
This docs file should be placed in `docs/modules/<your_module_name>.md` and
should have the following structure:
```
---
Name: Tool Name
URL: http://www.amazing-bfx-tool.com
Description: >
This amazing tool does some really cool stuff. You can describe it
here and split onto multiple lines if you want. Not too long though!
---
Your documentation goes here. Feel free to use markdown and write whatever
you think would be helpful. Please avoid using heading levels 1 to 3.
```
Make a reference to this in the YAML frontmatter at the top of
`docs/README.md` - this allows the website to find the file to build
the documentation.
### Changelog
Last but not least, remember to add your new module to the `CHANGELOG.md`,
so that people know that it's there.
### MultiqcModule Class
If you've copied one of the other entry point statements, it will have
ended in `:MultiqcModule` - this tells MultiQC to try to execute a class or
function called `MultiqcModule`.
To use the helper functions bundled with MultiQC, you should extend this
class from `multiqc.modules.base_module.BaseMultiqcModule`. This will give
you access to a number of functions on the `self` namespace. For example:
```python
from multiqc.modules.base_module import BaseMultiqcModule
class MultiqcModule(BaseMultiqcModule):
def __init__(self):
# Initialise the parent object
super(MultiqcModule, self).__init__(name='My Module', anchor='mymod',
href="http://www.awesome_bioinfo.com/my_module",
info="is an example analysis module used for writing documentation.")
```
Ok, that should be it! The `__init__()` function will now be executed every
time MultiQC runs. Try adding a `print("Hello World!")` statement and see
if it appears in the MultiQC logs at the appropriate time...
Note that the `__init__` variables are used to create the header, URL link,
analysis module credits and description in the report.
### Logging
Last thing - MultiQC modules have a standardised way of producing output,
so you shouldn't really use `print()` statements for your `Hello World` ;)
Instead, use the `logger` module as follows:
```python
import logging
log = logging.getLogger(__name__)
# Initialise your class and so on
log.info('Hello World!')
```
Log messages can come in a range of formats:
* `log.debug`
* Thes only show if MultiQC is run in `-v`/`--verbose` mode
* `log.info`
* For more important status updates
* `log.warning`
* Alert user about problems that don't halt execution
* `log.error` and `log.critical`
* Not often used, these are for show-stopping problems
## Step 1 - Find log files
The first thing that your module will need to do is to find analysis log
files. You can do this by searching for a filename fragment, or a string
within the file. It's possible to search for both (a match on either
will return the file) and also to have multiple strings possible.
First, add your default patterns to:
```
MULTIQC_ROOT/multiqc/utils/search_patterns.yaml
```
Each search has a yaml key, with one or more search criteria.
The yaml key must begin with the name of your module. If you have multiple
search patterns for a single module, follow the module name with a forward
slash and then any string. For example, see the `fastqc` module search patterns:
```yaml
fastqc/data:
fn: 'fastqc_data.txt'
fastqc/zip:
fn: '_fastqc.zip'
```
The following search criteria sub-keys can then be used:
* `fn`
* A glob filename pattern, used with the Python [`fnmatch`](https://docs.python.org/2/library/fnmatch.html) function
* `fn_re`
* A regex filename pattern
* `contents`
* A string to match within the file contents (checked line by line)
* `contents_re`
* A regex to match within the file contents (checked line by line)
* NB: Regex must match entire line (add `.*` to start and end of pattern to avoid this)
* `exclude_fn`
* A glob filename pattern which will exclude a file if matched
* `exclude_fn_re`
* A regex filename pattern which will exclude a file if matched
* `exclude_contents`
* A string which will exclude the file if matched within the file contents (checked line by line)
* `exclude_contents_re`
* A regex which will exclude the file if matched within the file contents (checked line by line)
* `num_lines`
* The number of lines to search through for the `contents` string. Default: all lines.
* `shared`
* By default, once a file has been assigned to a module it is not searched again. Specify `shared: true` when your file can be shared between multiple tools (for example, part of a `stdout` stream).
* `max_filesize`
* Files larger than the `log_filesize_limit` config key (default: 10MB) are skipped. If you know your files will be smaller than this and need to search by contents, you can specify this value (in bytes) to skip any files smaller than this limit.
Please try to use `num_lines` and `max_filesize` where possible as they will speed up
MultiQC execution time.
Note that `exclude_` keys are tested after a file is detected with one or
more of the other patterns.
For example, two typical modules could specify search patterns as follows:
```yaml
mymod:
fn: '_myprogram.txt'
myothermod:
contents: 'This is myprogram v1.3'
```
You can also supply a list of different patterns for a single log file type if needed.
If any of the patterns are matched, the file will be returned:
```yaml
mymod:
- fn: 'mylog.txt'
- fn: 'different_fn.out'
```
You can use _AND_ logic by specifying keys within a single list item. For example:
```yaml
mymod:
fn: 'mylog.txt'
contents: 'mystring'
myothermod:
- fn: 'different_fn.out'
contents: 'This is myprogram v1.3'
- fn: 'another.txt'
contents: 'What are these files anyway?'
```
Here, a file must have the filename `mylog.txt` _and_ contain the string `mystring`.
You can match subsets of files by using `exclude_` keys as follows:
```yaml
mymod:
fn: '*.myprog.txt'
exclude_fn: 'not_these_*'
myothermod:
fn: 'mylog.txt'
exclude_contents:
- 'trimmed'
- 'sorted'
```
Note that the `exclude_` patterns can have either a single value or a list of values.
They are always considered using OR logic - any matches will reject the file.
Remember that users can overwrite these defaults in their own config files.
This is helpful as people have weird and wonderful processing pipelines with
their own conventions.
Once your strings are added, you can find files in your module with the
base function `self.find_log_files()`, using the key you set in the YAML:
```python
self.find_log_files('mymod')
```
This function yields a dictionary with various information about each matching
file. The `f` key contains the contents of the matching file:
```python
# Find all files for mymod
for myfile in self.find_log_files('mymod'):
print( myfile['f'] ) # File contents
print( myfile['s_name'] ) # Sample name (from cleaned filename)
print( myfile['fn'] ) # Filename
print( myfile['root'] ) # Directory file was in
```
If `filehandles=True` is specified, the `f` key contains a file handle
instead:
```python
for f in self.find_log_files('mymod', filehandles=True):
# f['f'] is now a filehandle instead of contents
for l in f['f']:
print( l )
```
This is good if the file is large, as Python doesn't read the entire
file into memory in one go.
## Step 2 - Parse data from the input files
What most MultiQC modules do once they have found matching analysis files
is to pass the matched file contents to another function, responsible
for parsing the data from the file. How this parsing is done will depend
on the format of the log file and the type of data being read. See below
for a basic example, based loosely on the preseq module:
```python
class MultiqcModule(BaseMultiqcModule):
def __init__(self):
# [...]
self.mod_data = dict()
for f in self.find_log_files('mymod'):
self.mod_data[f['s_name']] = self.parse_logs(f['f'])
def parse_logs(self, f):
data = {}
for l in f.splitlines():
s = l.split()
data[s[0]] = s[1]
return data
```
### Filtering by parsed sample names
MultiQC users can use the `--ignore-samples` flag to skip sample names
that match specific patterns. As sample names are generated in a different
way by every module, this filter has to be applied after log parsing.
There is a core function to do this task - assuming that your data is
in a dictionary with the first key as sample name, pass it through the
`self.ignore_samples` function as follows:
```python
self.yourdata = self.ignore_samples(self.yourdata)
```
This will remove any dictionary keys where the sample name matches
a user pattern.
### No files found
If your module cannot find any matching files, it needs to raise an
exception of type `UserWarning`. This tells the core MultiQC program
that no modules were found. For example:
```python
if len(self.mod_data) == 0:
raise UserWarning
```
Note that this has to be raised as early as possible, so that it halts
the module progress. For example, if no logs are found then the module
should not create any files or try to do any computation.
### Custom sample names
Typically, sample names are taken from cleaned log filenames (the default
`f['s_name']` value returned). However, if possible, it's better to use
the name of the input file (allowing for concatenated log files).
To do this, you should use the `self.clean_s_name()` function, as
this will prepend the directory name if requested on the command line:
```python
input_fname = s[3] # Or parsed however
s_name = self.clean_s_name(input_fname, f['root'])
```
This function has already been applied to the contents of `f['s_name']`.
> `self.clean_s_name()` **must** be used on sample names parsed from the file
> contents. Without it, features such as prepending directories (`--dirs`)
> will not work.
### Identical sample names
If modules find samples with identical names, then the previous sample
is overwritten. It's good to print a log statement when this happens,
for debugging. However, most of the time it makes sense - programs often
create log files _and_ print to `stdout` for example.
```python
if f['s_name'] in self.bowtie_data:
log.debug("Duplicate sample name found! Overwriting: {}".format(f['s_name']))
```
### Printing to the sources file
Finally, once you've found your file we want to add this information to the
`multiqc_sources.txt` file in the MultiQC report data directory. This lists
every sample name and the file from which this data came from. This is especially
useful if sample names are being overwritten as it lists the source used. This code
is typically written immediately after the above warning.
If you've used the `self.find_log_files` function, writing to the sources file
is as simple as passing the log file variable to the `self.add_data_source` function:
```python
for f in self.find_log_files('mymod'):
self.add_data_source(f)
```
If you have different files for different sections of the module, or are
customising the sample name, you can tweak the fields. The default arguments
are as shown:
```python
self.add_data_source(f=None, s_name=None, source=None, module=None, section=None)
```
## Step 3 - Adding to the general statistics table
Now that you have your parsed data, you can start inserting it into the
MultiQC report. At the top of ever report is the 'General Statistics'
table. This contains metrics from all modules, allowing cross-module
comparison.
There is a helper function to add your data to this table. It can take
a lot of configuration options, but most have sensible defaults. At
it's simplest, it works as follows:
```python
data = {
'sample_1': {
'first_col': 91.4,
'second_col': '78.2%'
},
'sample_2': {
'first_col': 138.3,
'second_col': '66.3%'
}
}
self.general_stats_addcols(data)
```
To give more informative table headers and configure things like
data scales and colour schemes, you can supply an extra dict:
```python
headers = OrderedDict()
headers['first_col'] = {
'title': 'First',
'description': 'My First Column',
'scale': 'RdYlGn-rev'
}
headers['second_col'] = {
'title': 'Second',
'description': 'My Second Column',
'max': 100,
'min': 0,
'scale': 'Blues',
'suffix': '%'
}
self.general_stats_addcols(data, headers)
```
Here are all options for headers, with defaults:
```python
headers['name'] = {
'namespace': '', # Module name. Auto-generated for core modules in General Statistics.
'title': '[ dict key ]', # Short title, table column title
'description': '[ dict key ]', # Longer description, goes in mouse hover text
'max': None, # Minimum value in range, for bar / colour coding
'min': None, # Maximum value in range, for bar / colour coding
'scale': 'GnBu', # Colour scale for colour coding. Set to False to disable.
'suffix': None, # Suffix for value (eg. '%')
'format': '{:,.1f}', # Output format() string
'shared_key': None # See below for description
'modify': None, # Lambda function to modify values
'hidden': False, # Set to True to hide the column on page load
'placement' : 1000.0, # Alter the default ordering of columns in the table
}
```
* `namespace`
* This prepends the column title in the mouse hover: _Namespace: Title_.
* The 'Configure Columns' modal displays this under the 'Group' column.
* It's automatically generated for core modules in the General Statistics table,
though this can be overwritten (useful for example with custom-content).
* `scale`
* Colour scales are the names of ColorBrewer palettes. See below for available scales.
* Add `-rev` to the name of a colour scale to reverse it
* Set to `False` to disable colouring and background bars
* `shared_key`
* Any string can be specified here, if other columns are found that share
the same key, a consistent colour scheme and data scale will be used in
the table. Typically this is set to things like `read_count`, so that
the read count in a sample can be seen varying across analysis modules.
* `modify`
* A python `lambda` function to change the data in some way when it is
inserted into the table.
* `hidden`
* Setting this to `True` will hide the column when the report loads. It can
then be shown through the _Configure Columns_ modal in the report. This can
be useful when data could be sometimes useful. For example, some modules
show "percentage aligned" on page load but hide "number of reads aligned".
* `placement`
* If you feel that the results from your module should appear at the left side
of the table set this value less than 1000. Or to move the column right, set
it greater than 1000. This value can be any float.
The typical use for the `modify` string is to divide large numbers such as read counts,
to make them easier to interpret. If handling read counts, there are three config variables
that should be used to allow users to change the multiplier for read counts: `read_count_multiplier`,
`read_count_prefix` and `read_count_desc`. For example:
```python
'title': '{} Reads'.format(config.read_count_prefix),
'description': 'Number of reads ({})'.format(config.read_count_desc),
'modify': lambda x: x * config.read_count_multiplier,
```
Similar config options apply for base pairs: `base_count_multiplier`, `base_count_prefix` and
`base_count_desc`.
And for the read count of long reads: `long_read_count_multiplier`, `long_read_count_prefix` and
`long_read_count_desc`.
A third parameter can be passed to this function, `namespace`. This is usually
not needed - MultiQC automatically takes the name of the module that is calling
the function and uses this. However, sometimes it can be useful to overwrite this.
### Table colour scales
Colour scales are taken from [ColorBrewer2](http://colorbrewer2.org/).
Colour scales can be reversed by adding the suffix `-rev` to the name. For example, `RdYlGn-rev`.
The following scales are available:

## Step 4 - Writing data to a file
In addition to printing data to the General Stats, MultiQC modules typically
also write to text-files to allow people to easily use the data in downstream
applications. This also gives the opportunity to output additional data that
may not be appropriate for the General Statistics table.
Again, there is a base class function to help you with this - just supply it
with a dictionary and a filename:
```python
data = {
'sample_1': {
'first_col': 91.4,
'second_col': '78.2%'
},
'sample_2': {
'first_col': 138.3,
'second_col': '66.3%'
}
}
self.write_data_file(data, 'multiqc_mymod')
```
If your output has a lot of columns, you can supply the additional
argument `sort_cols = True` to have the columns alphabetically sorted.
This function will also pay attention to the default / command line
supplied data format and behave accordingly. So the written file could
be a tab-separated file (default), `JSON` or `YAML`.
Note that any keys with more than 2 levels of nesting will be ignored
when being written to tab-separated files.
## Step 5 - Create report sections
Great! It's time to start creating sections of the report with more information.
To do this, use the `self.add_section()` helper function:
```python
self.add_section (
name = 'Second Module Section',
anchor = 'mymod-second',
plot = linegraph.plot(data2)
)
self.add_section (
name = 'First Module Section',
anchor = 'mymod-first',
description = 'My amazing module output, from the first section',
helptext = "If you're not sure _how_ to interpret the data, we can help!",
extra = '<blockquote>Some extra custom HTML to put under the description</blockquote>',
plot = bargraph.plot(data)
)
self.add_section (
content = '<p>Some custom HTML.</p>'
)
```
These will automatically be labelled and linked in the navigation (unless
the module has only one section or `name` is not specified).
Note that `description` and `helptext` are processed as Markdown by default.
This can be disabled by passing `autoformat=False` to the function.
## Step 6 - Plot some data
Ok, you have some data, now the fun bit - visualising it! Each of the plot
types is described in the _Plotting Functions_ section of the docs.
## Appendices
### User configuration
Instead of hardcoding defaults, it's a great idea to allow users to configure
the behaviour of MultiQC module code.
It's pretty easy to use the built in MultiQC configuration settings to do this,
so that users can set up their config as described
[above in the docs](http://multiqc.info/docs/#configuring-multiqc).
To do this, just assume that your configuration variables are available in the
MultiQC `config` module and have sensible defaults. For example:
```python
from multiqc import config
mymod_config = getattr(config, 'mymod_config', {})
my_custom_config_var = mymod_config.get('my_custom_config_var', 5)
```
You now have a variable `my_custom_config_var` with a default value of 5, but that
can be configured by a user as follows:
```yaml
mymod_config:
my_custom_config_var: 200
```
Please be sure to use a unique top-level config name to avoid clashes - prefixing
with your module name is a good idea as in the example above. Keep all module config
options under the same top-level name for clarity.
Finally, don't forget to document the usage of your module-specific configuration
in `docs/modules/mymodule.md` so that people know how to use it.
### Profiling Performance
It's important that MultiQC runs quickly and efficiently, especially on big
projects with large numbers of samples. The recommended method to check this is
by using `cProfile` to profile the code execution. To do this, run MultiQC as follows:
```bash
python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc -f .
```
You can create a `.bashrc` alias to make this easier to run:
```bash
alias profile_multiqc='python -m cProfile -o multiqc_profile.prof /path/to/MultiQC/scripts/multiqc '
profile_multiqc -f .
```
MultiQC should run as normal, but produce the additional binary file `multiqc_profile.prof`.
This can then be visualised with software such as [SnakeViz](https://jiffyclub.github.io/snakeviz/).
To install SnakeViz and visualise the results, do the following:
```bash
pip install snakeviz
snakeviz multiqc_profile.prof
```
A web page should open where you can explore the execution times of different nested functions.
It's a good idea to run MultiQC with a comparable number of results from other tools (eg. FastQC)
to have a reference to compare against for how long the code should take to run.
### Adding Custom CSS / Javascript
If you would like module-specific CSS and / or JavaScript added to the template,
just add to the `self.css` and `self.js` dictionaries that come with the
`BaseMultiqcModule` class. The key should be the filename that you want your file to
have in the generated report folder _(this is ignored in the default template, which
includes the content file directly in the HTML)_. The dictionary value should be
the path to the desired file. For example, see how it's done in the FastQC module:
```python
self.css = {
'assets/css/multiqc_fastqc.css' :
os.path.join(os.path.dirname(__file__), 'assets', 'css', 'multiqc_fastqc.css')
}
self.js = {
'assets/js/multiqc_fastqc.js' :
os.path.join(os.path.dirname(__file__), 'assets', 'js', 'multiqc_fastqc.js')
}
```
|