File: ignore.list

package info (click to toggle)
websec 1.8.0-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 176 kB
  • ctags: 41
  • sloc: perl: 773; makefile: 79; lisp: 16; sh: 7
file content (163 lines) | stat: -rw-r--r-- 3,797 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
[General]
all rights reserved
an error occurred
click here
comments
copyright
daily articles for
details
discussion forum
downloads
in issues
last modified
last updated
maintained
posted
posted at
previous cartoon
search by
special offer
the current week
total votes
visits
votes
copyright

[Date_Time]
\d+ Jan(uary)? \d+
\d+ Feb(ruary)? \d+
\d+ Mar(ch)? \d+
\d+ Apr(il)? \d+
\d+ May \d+
\d+ June? \d+
\d+ July? \d+
\d+ Aug(ust)? \d+
\d+ Sep(tember)? \d+
\d+ Oct(ober)? \d+
\d+ Nov(ember)? \d+
\d+ Dec(ember)? \d+
# 28-03-2005 28/03/2005 28.3.2005 2005-03-28
\d+[\/\-.]\d+[\/\-.]\d+
# 02:24 PST
\d{2}:\d{2} [A-Z]{3}

[Adverts]
http://www.news.com/cgi-bin/acc_clickthru
http://ads2.zdnet.com/adverts/
http://doublclick4.net

[VIM]
[\d,]+ scripts, [\d,]+ downloads
[\d,]+ tips, [\d,]+ tip views

[cvsweb]
\d+ (years?|months?|weeks?|days?|hours?|minutes?)

[Slashdot]
\d+ of \d+

__END__

=head1 NAME

ignore.list - websec url monitoring configuration

=head1 DESCRIPTION

=head2 IGNORE KEYWORDS

When determining which parts of a particular web page has changed, you may
want to skip those paragraphs that contains certain predefined words. For
example, pages like InfoWorld, PC Magazine and PC Week often contain the
current date/time regardless of whether there is new or changed content. In
such cases, you can use IGNORE KEYWORDS to skip those paragraphs which
contains date/time information.

Ignore keywords are stored in a file called "ignore.list" in the same
directory as websec. Like the URL list, the ignore keywords are partitioned
into different sections. Each section has a user-defined name. An example is
shown below:

        [General]
        all rights reserved
        an error occurred
        click here
        comments
        copyright

        [Date_Time]
        January\s+\d{1,2}
        February\s+\d{1,2}
        March\s+\d{1,2}
        April\s+\d{1,2}
        May\s+\d{1,2}
    
In the example above, there are two sections: "General" and "Date_Time".
You can use them in the URL list as follows:

    Ignore = General

You can also use multiple sections at one go:

    Ignore = General,Date_Time

If you use certain ignore keywords regularly, you might want to add them to
a defaults section in the URL list.

Ignore keywords can contain regular expressions. For example, the ignore
keyword "January\s+\d{1,2}" tells websec to look for the string "January",
followed by one or more spaces, followed by at least one but not more than
two digits.

Two sections of ignore keywords are supplied in this distribution. "General"
contains some general ignore keywords which you may want to use. "Date_Time"
contains date/time detectors coded using regular expressions. Feel free to
add your own!


=head2 IGNORE URLS

Most advertisements in webpages are of the following form:

        <A HREF="http://page.url.com/advert/cgi-bin/" ...>
        <IMG SRC="advert.animated.gif" ...>
        Click here for free beer!
        </A>

Such advertisements can be ignored when running webdiff using ignore URLs.

Ignore URLs are also stored in "ignore.list". They contain all of parts of
the URL referred to by the <A HREF> tag which you want to ignore. An example
is shown below:

        [Adverts]
        page.url.com/advert/cgi-bin/
    
Use the "Adverts" section in the URL list as follows:

    IgnoreURL = Adverts

You can also use multiple sections at one go:

    IgnoreURL = Adverts1,Adverts2

If you use certain ignore URLs regularly, you might want to add them
to a defaults section in the URL list.

Like ignore keywords, ignore URLs can contain regular expressions.

An "Adverts" section is supplied in this distribution. Feel free to add your
own!


=head1 SEE ALSO

L<url.list(5)>


=head1 AUTHOR

Baruch Even <websec@ev-en.org> is maintaining this program.

=cut