File: README.md

package info (click to toggle)

rdkit 202503.6-3

links: PTS, VCS
area: main
in suites: forky, sid
size: 222,000 kB
sloc: cpp: 411,111; python: 78,482; ansic: 26,181; java: 8,285; javascript: 4,404; sql: 2,393; yacc: 1,626; lex: 1,267; cs: 1,090; makefile: 581; xml: 229; fortran: 183; sh: 121

file content (59 lines) | stat: -rw-r--r-- 1,533 bytes

parent folder | download | duplicates (5)

FastCluster
============


## This is simple workflow for clustering molecules

- Author: iwatobipen
- Date: 201712010
- Current version uses Morgan FP; rad2 is used for clustering

## Requirements

- python3.x
- bayon https://github.com/fujimizu/bayon
- rdkit

## Description

- Users need to install bayon at first, and can find tutorial at following URL. https://github.com/fujimizu/bayon/wiki/Tutorial_English
- Also it is needed to install RDKit for parsing SMILES.
- That's all!

## Basic usage

- input file format is tab delimited text format, "ID" \t "SMILES" \n .....
- $ python fastcluster.py {input; inputfile} {N; number of clusters} { --output; filename of output} {--centroid; filename of centroid information} 
- Example usage
- $ ptyhon fastcluster.py cdk2.smi 5 # clustering 47 compounds to 5 clusters.
```
Fastcluster iwatobipen$ python fastcluster.py cdk2.smi 5

real	0m0.015s
user	0m0.006s
sys	0m0.002s

```


## Output format

- clusterd.tsv is default output format of bayon. List of clusters with similarity points.
```
cluster_1 \t molid1 \t point \t molid2 \t point ... \n
cluster_2 \t molid4 \t point \t molid5 \t point ... \n
....
```

- cluser_parse.tsv is rectangle format of cluster.tsv
```
molid1 \t point \t clusterID1 \n
molid2 \t point \t clusterID2 \n
molid3 \t point \t clusterID3 \n
molid4 \t point \t clusterID4 \n
....
``` 

## Memo

- It will need more cpu time compared with directly using bayon. Because this script converts smiles to fingerprint dataset at first then performs clustering.