1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
|
.. _advanced.base:
Relative Link Resolution
========================
Many feed elements and attributes are :abbr:`URI (Uniform Resource Identifier)`\s.
:program:`Universal Feed Parser` resolves relative :abbr:`URI (Uniform Resource Identifier)`\s
according to the `XML:Base <http://www.w3.org/TR/xmlbase/>`_ specification. We'll see how
that works in a minute, but first let's talk about which values are treated as
:abbr:`URI (Uniform Resource Identifier)`\s.
Which Values Are :abbr:`URI (Uniform Resource Identifier)`\s
------------------------------------------------------------
These feed elements are treated as :abbr:`URI (Uniform Resource Identifier)`\s,
and resolved if they are relative:
* :ref:`reference.entry.author_detail.href`
* :ref:`reference.entry.comments`
* :ref:`reference.entry.contributors.href`
* :ref:`reference.entry.enclosures.href`
* :ref:`reference.entry.id`
* :ref:`reference.entry.license`
* :ref:`reference.entry.link`
* :ref:`reference.entry.links.href`
* :ref:`reference.entry.publisher_detail.href`
* :ref:`reference.entry.source.author_detail.href`
* :ref:`reference.entry.source.contributors.href`
* :ref:`reference.entry.source.links.href`
* :ref:`reference.feed.author_detail.href`
* :ref:`reference.feed.contributors.href`
* :ref:`reference.feed.docs`
* :ref:`reference.feed.generator_detail.href`
* :ref:`reference.feed.id`
* :ref:`reference.feed.image.href`
* :ref:`reference.feed.image.link`
* :ref:`reference.feed.license`
* :ref:`reference.feed.link`
* :ref:`reference.feed.links.href`
* :ref:`reference.feed.publisher_detail.href`
* :ref:`reference.feed.textinput.link`
In addition, several feed elements may contain :abbr:`HTML (HyperText Markup Language)`
or :abbr:`XHTML (Extensible HyperText Markup Language)` markup. Certain elements and
attributes in :abbr:`HTML (HyperText Markup Language)` can be relative
:abbr:`URI (Uniform Resource Identifier)`\s, and :program:`Universal Feed Parser` will
resolve these :abbr:`URI (Uniform Resource Identifier)`\s according to the same rules
as the feed elements listed above.
These feed elements may contain :abbr:`HTML (HyperText Markup Language)` or
:abbr:`XHTML (Extensible HyperText Markup Language)` markup. In Atom feeds,
whether these elements are treated as :abbr:`HTML (HyperText Markup Language)`
depends on the value of the type attribute. In :abbr:`RSS (Rich Site Summary)`
feeds, these values are always treated as :abbr:`HTML (HyperText Markup Language)`.
* :ref:`reference.entry.content.value`
* :ref:`reference.entry.summary` (:ref:`reference.entry.summary_detail.value`)
* :ref:`reference.entry.title` (:ref:`reference.entry.title_detail.value`)
* :ref:`reference.feed.info` (:ref:`reference.feed.info_detail.value`)
* :ref:`reference.feed.rights` (:ref:`reference.feed.rights_detail.value`)
* :ref:`reference.feed.subtitle` (:ref:`reference.feed.subtitle_detail.value`)
* :ref:`reference.feed.title` (:ref:`reference.feed.title_detail.value`)
When any of these feed elements contains :abbr:`HTML (HyperText Markup Language)`
or :abbr:`XHTML (Extensible HyperText Markup Language)` markup, the
following :abbr:`HTML (HyperText Markup Language)` elements are treated as
:abbr:`URI (Uniform Resource Identifier)`\s and are resolved if they are
relative:
* <a href="...">
* <applet codebase="...">
* <area href="...">
* <audio src="...">
* <blockquote cite="...">
* <body background="...">
* <del cite="...">
* <form action="...">
* <frame longdesc="...">
* <frame src="...">
* <head profile="...">
* <iframe longdesc="...">
* <iframe src="...">
* <img longdesc="...">
* <img src="...">
* <img usemap="...">
* <input src="...">
* <input usemap="...">
* <ins cite="...">
* <link href="...">
* <object classid="...">
* <object codebase="...">
* <object data="...">
* <object usemap="...">
* <q cite="...">
* <script src="...">
* <source src="...">
* <video poster="...">
* <video src="...">
How Relative :abbr:`URI (Uniform Resource Identifier)`\s Are Resolved
---------------------------------------------------------------------
:program:`Universal Feed Parser` resolves relative :abbr:`URI (Uniform Resource Identifier)`\s
according to the `XML:Base <http://www.w3.org/TR/xmlbase/>`_ specification.
This defines a hierarchical inheritance system, where one element can define
the base :abbr:`URI (Uniform Resource Identifier)` for itself and all of its
child elements, using an xml:base attribute. A child element can then override
its parent's base :abbr:`URI (Uniform Resource Identifier)` by redeclaring
xml:base to a different value.
If no xml:base is specified, the feed has a default base :abbr:`URI (Uniform Resource Identifier)`
defined in the Content-Location :abbr:`HTTP (Hypertext Transfer Protocol)` header.
If no Content-Location :abbr:`HTTP (Hypertext Transfer Protocol)` header is
present, the :abbr:`URL (Uniform Resource Locator)` used to retrieve the feed
itself is the default base :abbr:`URI (Uniform Resource Identifier)` for all
relative links within the feed. If the feed was retrieved via an
:abbr:`HTTP (Hypertext Transfer Protocol)` redirect (any :abbr:`HTTP (Hypertext Transfer Protocol)`
3xx status code), then the final :abbr:`URL (Uniform Resource Locator)` of the
feed is the default base :abbr:`URI (Uniform Resource Identifier)`.
For example, an xml:base on the root-level element sets the base
:abbr:`URI (Uniform Resource Identifier)` for all :abbr:`URI (Uniform Resource Identifier)`\s in the feed.
xml:base on the root-level element
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.feed.link
u'http://example.org/index.html'
>>> d.feed.generator_detail.href
u'http://example.org/generator/'
An xml:base attribute on an <entry> overrides the xml:base on the parent <feed>.
Overriding xml:base on an <entry>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.entries[0].link
u'http://example.org/archives/000001.html'
>>> d.entries[0].author_detail.href
u'http://example.org/about/'
An xml:base on <content> overrides the xml:base on the parent <entry>. In
addition, whatever the base :abbr:`URI (Uniform Resource Identifier)` is for
the <content> element (whether defined directly on the <content> element, or
inherited from the parent element) is used as the base :abbr:`URI (Uniform Resource Identifier)`
for the embedded :abbr:`HTML (HyperText Markup Language)`
or :abbr:`XHTML (Extensible HyperText Markup Language)` markup within the
content.
Relative links within embedded :abbr:`HTML (HyperText Markup Language)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.entries[0].content[0].value
u'<p id="anchor1"><a href="http://example.org/archives/000001.html#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>'
The xml:base affects other attributes in the element in which it is declared.
xml:base and sibling attributes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/base.xml")
>>> d.entries[0].links[1].rel
u'service.edit'
>>> d.entries[0].links[1].href
u'http://example.com/api/client/37'
If no xml:base is specified on the root-level element, the default base
:abbr:`URI (Uniform Resource Identifier)` is given in the Content-Location
:abbr:`HTTP (Hypertext Transfer Protocol)` header. This can still be
overridden by any child element that declares an xml:base attribute.
Content-Location :abbr:`HTTP (Hypertext Transfer Protocol)` header
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/http_base.xml")
>>> d.feed.link
u'http://example.org/index.html'
>>> d.entries[0].link
u'http://example.org/archives/000001.html'
Finally, if no root-level xml:base is declared, and no Content-Location
:abbr:`HTTP (Hypertext Transfer Protocol)` header is present, the
:abbr:`URL (Uniform Resource Locator)` of the feed itself is the default base
:abbr:`URI (Uniform Resource Identifier)`. Again, this can still be overridden
by any element that declares an xml:base attribute.
Feed :abbr:`URL (Uniform Resource Locator)` as default base :abbr:`URI (Uniform Resource Identifier)`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/no_base.xml")
>>> d.feed.link
u'http://feedparser.org/docs/examples/index.html
>>> d.entries[0].link
u'http://example.org/archives/000001.html'
.. _advanced.base.disable:
Disabling Relative :abbr:`URI (Uniform Resource Identifier)`\s Resolution
-------------------------------------------------------------------------
Though not recommended, it is possible to disable :program:`Universal Feed Parser`\'s relative
:abbr:`URI (Uniform Resource Identifier)` resolution by passing ``resolve_relative_uris=False``
to :func:`feedparser.parse()`. This disables resolution within HTML content,
but not in other contexts such as :ref:`reference.entry.link`.
How to disable relative :abbr:`URI (Uniform Resource Identifier)` resolution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
>>> import feedparser
>>> d = feedparser.parse('http://feedparser.org/docs/examples/base.xml')
>>> d.entries[0].content[0].base
u'http://example.org/archives/000001.html'
>>> print d.entries[0].content[0].value
<p id="anchor1"><a href="http://example.org/archives/000001.html#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>
>>> feedparser.RESOLVE_RELATIVE_URIS = 0
>>> d2 = feedparser.parse('http://feedparser.org/docs/examples/base.xml')
>>> d2.entries[0].content[0].base
u'http://example.org/archives/000001.html'
>>> print d2.entries[0].content[0].value
<p id="anchor1"><a href="#anchor2">skip to anchor 2</a></p>
<p>Some content</p>
<p id="anchor2">This is anchor 2</p>
|