File: README.md

package info (click to toggle)
tldextract 5.3.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 624 kB
  • sloc: python: 1,817; makefile: 10
file content (207 lines) | stat: -rw-r--r-- 5,506 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# tldextract [![PyPI version](https://badge.fury.io/py/tldextract.svg)](https://badge.fury.io/py/tldextract) [![Build Status](https://github.com/john-kurkowski/tldextract/actions/workflows/ci.yml/badge.svg)](https://github.com/john-kurkowski/tldextract/actions/workflows/ci.yml)

`tldextract` accurately separates a URL's subdomain, domain, and public suffix,
using [the Public Suffix List (PSL)](https://publicsuffix.org).

**Why?** Naive URL parsing like splitting on dots fails for domains like
`forums.bbc.co.uk` (gives "co" instead of "bbc"). `tldextract` handles the edge
cases, so you don't have to.

## Quick Start

```python
>>> import tldextract

>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)

>>> tldextract.extract('http://forums.bbc.co.uk/')
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)

>>> # Access the parts you need
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> ext.top_domain_under_public_suffix
'bbc.co.uk'
>>> ext.fqdn
'forums.bbc.co.uk'
```

## Install

```zsh
pip install tldextract
```

## How-to Guides

### How to disable HTTP suffix list fetching for production

```python
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=())
no_fetch_extract('http://www.google.com')
```

### How to set a custom cache location

Via environment variable:

```python
export TLDEXTRACT_CACHE="/path/to/cache"
```

Or in code:

```python
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/cache/')
```

### How to update TLD definitions

Command line:

```zsh
tldextract --update
```

Or delete the cache folder:

```zsh
rm -rf $HOME/.cache/python-tldextract
```

### How to treat private domains as suffixes

```python
extract = tldextract.TLDExtract(include_psl_private_domains=True)
extract('waiterrant.blogspot.com')
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
```

### How to use a local suffix list

```python
extract = tldextract.TLDExtract(
    suffix_list_urls=["file:///path/to/your/list.dat"],
    cache_dir='/path/to/cache/',
    fallback_to_snapshot=False)
```

### How to use a remote suffix list

```python
extract = tldextract.TLDExtract(
    suffix_list_urls=["https://myserver.com/suffix-list.dat"])
```

### How to add extra suffixes

```python
extract = tldextract.TLDExtract(
    extra_suffixes=["foo", "bar.baz"])
```

### How to validate URLs before extraction

```python
from urllib.parse import urlsplit

split_url = urlsplit("https://example.com:8080/path")
result = tldextract.extract_urllib(split_url)
```

## Command Line

```zsh
$ tldextract http://forums.bbc.co.uk
forums bbc co.uk

$ tldextract --update  # Update cached suffix list
$ tldextract --help    # See all options
```

## Understanding Domain Parsing

### Public Suffix List

`tldextract` uses the [Public Suffix List](https://publicsuffix.org), a
community-maintained list of domain suffixes. The PSL contains both:

- **Public suffixes**: Where anyone can register a domain (`.com`, `.co.uk`,
  `.org.kg`)
- **Private suffixes**: Operated by companies for customer subdomains
  (`blogspot.com`, `github.io`)

Web browsers use this same list for security decisions like cookie scoping.

### Suffix vs. TLD

While `.com` is a top-level domain (TLD), many suffixes like `.co.uk` are
technically second-level. The PSL uses "public suffix" to cover both.

### Default behavior with private domains

By default, `tldextract` treats private suffixes as regular domains:

```python
>>> tldextract.extract('waiterrant.blogspot.com')
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)
```

To treat them as suffixes instead, see
[How to treat private domains as suffixes](#how-to-treat-private-domains-as-suffixes).

### Caching behavior

By default, `tldextract` fetches the latest Public Suffix List on first use and
caches it indefinitely in `$HOME/.cache/python-tldextract`.

### URL validation

`tldextract` accepts any string and is very lenient. It prioritizes ease of use
over strict validation, extracting domains from any string, even partial URLs or
non-URLs.

## FAQ

### Can you add/remove suffix \_\_\_\_?

`tldextract` doesn't maintain the suffix list. Submit changes to
[the Public Suffix List](https://publicsuffix.org/submit/).

Meanwhile, use the `extra_suffixes` parameter, or fork the PSL and pass it to
this library with the `suffix_list_urls` parameter.

### My suffix is in the PSL but not extracted correctly

Check if it's in the "PRIVATE" section. See
[How to treat private domains as suffixes](#how-to-treat-private-domains-as-suffixes).

### Why does it parse invalid URLs?

See [URL validation](#url-validation) and
[How to validate URLs before extraction](#how-to-validate-urls-before-extraction).

## Contribute

### Setting up

1. `git clone` this repository.
2. Change into the new directory.
3. `pip install --upgrade --editable '.[testing]'`

### Running tests

```zsh
tox --parallel       # Test all Python versions
tox -e py311         # Test specific Python version
ruff format .        # Format code
```

## History

This package started from a
[StackOverflow answer](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219)
about regex-based domain extraction. The regex approach fails for many domains,
so this library switched to the Public Suffix List for accuracy.