1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
|
======= =============================
SEP 7
Title ItemLoader processors library
Author Ismael Carnales
Created 2009-08-10
Status Draft
======= =============================
======================================
SEP-007: ItemLoader processors library
======================================
This SEP proposes a library of ``ItemLoader`` processor to ship with Scrapy.
date.py
=======
``to_date``
-----------
Converts a date string to a YYYY-MM-DD one suitable for ``DateField``
**Decision**: Obsolete. ``DateField`` doesn't exists anymore.
extraction.py
=============
``extract``
-----------
This adaptor tries to extract data from the given locations. Any
``XPathSelector`` in it will be extracted, and any other data will be added
as-is to the result.
**Decision**: Obsolete. Functionality included in ``XpathLoader``.
``ExtractImageLinks``
This adaptor may receive either XPathSelectors pointing to the desired
locations for finding image urls, or just a list of XPath expressions (which
will be turned into selectors anyway).
**Decision**: XXX
markup.py
=========
``remove_tags``
---------------
Factory that returns an adaptor for removing each tag in the ``tags`` parameter
found in the given value. If no ``tags`` are specified, all of them are
removed.
**Decision**: XXX
``remove_root``
---------------
This adaptor removes the root tag of the given string/unicode, if it's found.
**Decision**: XXX
``replace_escape``
------------------
Factory that returns an adaptor for removing/replacing each escape character in
the ``wich_ones`` parameter found in the given value.
**Decision**: XXX
``unquote``
-----------
This factory returns an adaptor that receives a string or unicode, removes all
of the CDATAs and entities (except the ones in CDATAs, and the ones you specify
in the ``keep`` parameter) and then, returns a new string or unicode.
**Decision**: XXX
misc.py
=======
``to_unicode``
--------------
Receives a string and converts it to unicode using the given encoding (if
specified, else utf-8 is used) and returns a new unicode object. E.g:
::
>> to_unicode('it costs 20\xe2\x82\xac, or 30\xc2\xa3')
[u'it costs 20\u20ac, or 30\xa3']
**Decision**: XXX
``clean_spaces``
----------------
Converts multispaces into single spaces for the given string. E.g:
::
>> clean_spaces(u'Hello sir')
u'Hello sir'
**Decision**: XXX
``drop_empty``
--------------
Removes any index that evaluates to None from the provided iterable. E.g:
::
>> drop_empty([0, 'this', None, 'is', False, 'an example'])
['this', 'is', 'an example']
**Decision**: Obsolete. Functionality included in reducers.
``delist``
----------
This factory returns and adaptor that joins an iterable with the specified
delimiter.
**Decision**: Obsolete. Functionality included in reducers.
``Regex``
----------
This adaptor must receive either a list of strings or an XPathSelector and
return a new list with the matches of the given strings with the given regular
expression (which is passed by a keyword argument, and is mandatory for this
adaptor).
**Decision**: XXX
|