# Retrieving CAS registry numbers

CAS Registry numbers are not officially supported as compound metadata or a property type in PubChem. However, in many instances, CAS registry numbers are present in the collection of name synonyms associated with a compound. Therefore it is straightforward to retrieve them by getting compound synonyms and then filtering these to just those with the CAS registry number format.

In [None]:
import re
import time

import pubchempy as pcp

In [None]:
# Optionally enable debug logging to make it easier to see what is going on:
# logging.basicConfig(level=logging.DEBUG)

A function to get CAS registry numbers by filtering a list of compound synonyms:

In [None]:
def filter_cas_rns(synonyms: list[str]) -> list[str]:
    """Filter a list of synonyms to just those that look like CAS registry numbers."""
    cas_rns = []
    for syn in synonyms:
        match = re.match(r"(\d{2,7}-\d\d-\d)", syn)
        if match:
            cas_rns.append(match.group(1))
    return cas_rns

## CAS Registry Numbers for a given PubChem CID

In [None]:
for result in pcp.get_synonyms(2242):
    cid = result["CID"]
    cas_rns = filter_cas_rns(result.get("Synonym", []))
    print(f"CAS registry numbers for CID {cid}: {cas_rns}")

CAS registry numbers for CID 2242: ['25548-16-7']


## CAS Registry Numbers for a batch of PubChem CIDs

In [None]:
for result in pcp.get_synonyms([2242, 134601, 6992065]):
    cid = result["CID"]
    cas_rns = filter_cas_rns(result.get("Synonym", []))
    print(f"CAS registry numbers for CID {cid}: {cas_rns}")

CAS registry numbers for CID 2242: ['25548-16-7']
CAS registry numbers for CID 134601: ['22839-47-0', '53906-69-7', '7421-84-3']
CAS registry numbers for CID 6992065: ['22839-65-2']


## CAS Registry Numbers for a PubChemPy Compound object

In [None]:
compound = pcp.Compound.from_cid(2242)
cas_rns = filter_cas_rns(compound.synonyms)
print(f"CAS registry numbers for CID 2242: {cas_rns}")

CAS registry numbers for CID 2242: ['25548-16-7']


## CAS Registry Numbers for substructure search results

In [None]:
count = 0
for result in pcp.get_synonyms(
    "COC(=O)C(CC1=CC=CC=C1)NC(=O)C(CC(=O)O)N", "smiles", searchtype="substructure"
):
    cid = result["CID"]
    cas_rns = filter_cas_rns(result.get("Synonym", []))
    print(f"CAS registry numbers for CID {cid}: {cas_rns}")
    count += 1
    if count >= 10:
        break

CAS registry numbers for CID 134601: ['22839-47-0', '53906-69-7', '7421-84-3']
CAS registry numbers for CID 9810996: ['165450-17-9']
CAS registry numbers for CID 2242: ['25548-16-7']
CAS registry numbers for CID 6992066: []
CAS registry numbers for CID 56843846: ['714229-20-6', '245650-17-3']
CAS registry numbers for CID 3804937: []
CAS registry numbers for CID 25130065: ['106372-55-8']
CAS registry numbers for CID 6992065: ['22839-65-2']
CAS registry numbers for CID 14060789: []
CAS registry numbers for CID 44364601: []


We could potentially get a TimeoutError if there are too many results. In this case, it might be better to perform the substructure search and then get the synonyms individually for each result or in batches:

In [None]:
cids = pcp.get_cids("[Pd]", "smiles", searchtype="substructure")
batch_size = 5
for i in range(0, len(cids), batch_size):
    batch = cids[i : i + batch_size]
    print(f"Getting synonyms for batch: {batch}")
    results = pcp.get_synonyms(batch)
    for result in results:
        cid = result["CID"]
        cas_rns = filter_cas_rns(result.get("Synonym", []))
        print(f"CAS registry numbers for CID {cid}: {cas_rns}")
    time.sleep(1)  # Respect PubChem's rate limits
    if i >= 20:
        break

Getting synonyms for batch: [24290, 23938, 5702536, 62732, 167845]
CAS registry numbers for CID 24290: ['7647-10-1']
CAS registry numbers for CID 23938: ['7440-05-3', '7440-05-3', '7440-05-3']
CAS registry numbers for CID 5702536: ['12107-56-1', '12193-11-2']
CAS registry numbers for CID 62732: ['19168-23-1', '13820-55-8']
CAS registry numbers for CID 167845: ['3375-31-3', '19807-27-3']
Getting synonyms for batch: [11979704, 9811564, 74855, 24932, 61732]
CAS registry numbers for CID 11979704: ['14221-01-3']
CAS registry numbers for CID 9811564: ['51364-51-3', '60748-47-2']
CAS registry numbers for CID 74855: ['2035-66-7']
CAS registry numbers for CID 24932: ['10102-05-3']
CAS registry numbers for CID 61732: ['14323-43-4', '13782-33-7']
Getting synonyms for batch: [73974, 161205, 424947, 153931, 166846]
CAS registry numbers for CID 73974: ['11113-77-2']
CAS registry numbers for CID 161205: ['16970-55-1']
CAS registry numbers for CID 424947: ['7790-38-7', '90-38-7']
CAS registry numbers 