1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
|
FastCluster
============
## This is simple workflow for clustering molecules
- Author: iwatobipen
- Date: 201712010
- Current version uses Morgan FP; rad2 is used for clustering
## Requirements
- python3.x
- bayon https://github.com/fujimizu/bayon
- rdkit
## Description
- Users need to install bayon at first, and can find tutorial at following URL. https://github.com/fujimizu/bayon/wiki/Tutorial_English
- Also it is needed to install RDKit for parsing SMILES.
- That's all!
## Basic usage
- input file format is tab delimited text format, "ID" \t "SMILES" \n .....
- $ python fastcluster.py {input; inputfile} {N; number of clusters} { --output; filename of output} {--centroid; filename of centroid information}
- Example usage
- $ ptyhon fastcluster.py cdk2.smi 5 # clustering 47 compounds to 5 clusters.
```
Fastcluster iwatobipen$ python fastcluster.py cdk2.smi 5
real 0m0.015s
user 0m0.006s
sys 0m0.002s
```
## Output format
- clusterd.tsv is default output format of bayon. List of clusters with similarity points.
```
cluster_1 \t molid1 \t point \t molid2 \t point ... \n
cluster_2 \t molid4 \t point \t molid5 \t point ... \n
....
```
- cluser_parse.tsv is rectangle format of cluster.tsv
```
molid1 \t point \t clusterID1 \n
molid2 \t point \t clusterID2 \n
molid3 \t point \t clusterID3 \n
molid4 \t point \t clusterID4 \n
....
```
## Memo
- It will need more cpu time compared with directly using bayon. Because this script converts smiles to fingerprint dataset at first then performs clustering.
|