File: README.md

package info (click to toggle)
rdkit 202503.6-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 222,000 kB
  • sloc: cpp: 411,111; python: 78,482; ansic: 26,181; java: 8,285; javascript: 4,404; sql: 2,393; yacc: 1,626; lex: 1,267; cs: 1,090; makefile: 581; xml: 229; fortran: 183; sh: 121
file content (162 lines) | stat: -rw-r--r-- 4,653 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
FreeWilson
==========
A Mathematical Contribution to Structurre-Activity Studies
Spencer M, Free, Jr and James W Wilson
Journal of Medicinal Chemistry
Vol 7, Number 4, July 6, 1964

Basic usage: get a scaffold (or scaffolds) some compounds and their scores, 
then run freewilson:


Read some example data:
```
>>> import os, sys
>>> DATA_PATH = "data"
>>> smilesfile = os.path.join(DATA_PATH, "CHEMBL2321810.smi")
>>> scaffoldfile = os.path.join(DATA_PATH, "CHEMBL2321810_scaffold.mol")
>>> csvfile = os.path.join(DATA_PATH, "CHEMBL2321810_act.csv")
>>> mols = []
>>> for line in open(smilesfile):
...     smiles, name = line.strip().split()
...     m = Chem.MolFromSmiles(smiles)
...     m.SetProp("_Name", name)
...     mols.append(m)
>>> scaffold = Chem.MolFromMolBlock(open(scaffoldfile).read())
>>> data = {k:float(v) for k,v in list(csv.reader(open(csvfile)))[1:]}
>>> scores = [data[m.GetProp("_Name")] for m in mols]

```

And do the decomposition:
```
>>> from freewilson import FWDecompose, FWBuild, predictions_to_csv
>>> decomp = FWDecompose(scaffold, mols, scores)
```

Scores need to be in an appropriate form for regrerssion analysis, i.e. pIC50s as opposed to IC50s.
Enumerations are a lot faster if you know a prediction cutoff or molecular weight cutoff.

Scaffolds can be any scaffold or smarts pattern or list of either.  The system automatically 
enumerates matching cores in the analysis.

To see if the decomposition is useful, check the r squared value

```
>>> print(f"Training R^2 is {decomp.r2:0.2f}")
Training R^2 is 0.81

```

Finally you can build the decomposition into new molecules:

```
>>> for pred in FWBuild(decomp):
...     print(pred.smiles, pred.prediction)

```

Now this builds both well and poorly predicted molecules.  To prevent
this you can use the following filters while building:

   1. pred_filter:  given a prediction, return True to keep the molecule
   2. hvy_filter: given a heavy atom count, return True to keep the molecule
   3. mw_filter: given a molecular weight, return True to keep the molecule
   4. mol_filter: given a sanitized molecule, return True to keep the molecule

Here are some examples (see using Molecular Filters below)

```
>>> preds = FWBuild(decomp, 
...                 pred_filter=lambda x: x > 8, 
...                 mw_filter=lambda mw: 100<mw<550)
>>> predictions_to_csv(sys.stdout, decomp, preds)

```

Using Molecule filters
----------------------
Finally, the molecule filter can be used to prune greasy or otherwise undesirable
molecules:

```
>>> from rdkit.Chem import Descriptors
>>> for pred in FWBuild(decomp, pred_filter=lambda x: x > 8,
...                             mol_filter=lambda mol: -3 < Descriptors.MolLogP(mol) < 3):
...   print(pred.smiles, pred.prediction)

```

More Info
---------
You can also get some more information by setting logging to INFO

```
>>> import logging
>>> logging.getLogger().setLevel(logging.INFO)
>>> preds = list(FWBuild(decomp, pred_filter=lambda x: x > 8))

```

You'll see something like this:

```
INFO:root:Enumerating rgroups with no broken cycles...
INFO:root:	Core	1
INFO:root:	R1	73
INFO:root:	R10	2
INFO:root:	R3	445
100%|███████████████████████████████████████████████████████| 64970/64970 [00:05<00:00, 11247.38it/s]
INFO:root:Wrote 3 results out of 64970
	In Training set: 628
	Bad MW: 0
	Bad Pred: 64339
	Bad Filters: 0
	Bad smi: 0
	min mw: 380.429
	max mw: 772.4030000000001
	
	min pred: 2.8927324757327666
	max pred: 8.148182660251193
```

Using FMCS to find a scaffold
-----------------------------

If you don't have a good feel for the dataset, you can generate the scaffold by using
the rdFMCS package.  The following tries to find a decent MCS that covers 80% of the
input structures


```
>>> from rdkit.Chem import rdFMCS
>>> mcs = rdFMCS.FindMCS(mols, threshold=0.8, atomCompare=rdFMCS.AtomCompare.CompareAny,
...                      completeRingsOnly=True)
>>> decomp = FWDecompose(mcs.queryMol, mols, scores)

```

Note that the MCS returned can generate multiple Cores, this freewilson implementatino
explicitly supports this.

Using Molecule filters
----------------------
Finally, the molecule filter can be used to prune greasy or otherrwise undesirable
molecules:

```
>>> from rdkit.Chem import Descriptors
>>> for pred in FWBuild(decomp, pred_filter=lambda x: x > 8,
...                             mol_filter=lambda mol: -3 < Descriptors.MolLogP(mol) < 3):
...   print(pred.smiles, pred.prediction)
```


Testing
=======

From this directory:

```
PYTHONPATH=`pwd` pytest
```