File: README

package info (click to toggle)
libhtml-tableparser-perl 0.43-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, buster, forky, sid, trixie
  • size: 604 kB
  • sloc: perl: 927; makefile: 2
file content (138 lines) | stat: -rw-r--r-- 4,495 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
HTML::TableParser

HTML::TableParser uses HTML::Parser to extract data from an HTML table.
The data is returned via a series of user defined callback functions or
methods. Specific tables may be selected either by a matching a unique
table id or by matching against the column names. Multiple (even nested)
tables may be parsed in a document in one pass.

  Table Identification

Each table is given a unique id, relative to its parent, based upon its
order and nesting. The first top level table has id 1, the second 2,
etc. The first table nested in table 1 has id 1.1, the second 1.2, etc.
The first table nested in table 1.1 has id 1.1.1, etc. These, as well as
the tables' column names, may be used to identify which tables to parse.

  Data Extraction

As the parser traverses a selected table, it will pass data to user
provided callback functions or methods after it has digested particular
structures in the table. All functions are passed the table id (as
described above), the line number in the HTML source where the table was
found, and a reference to any table specific user provided data.

Table Start
        The start callback is invoked when a matched table has been
        found.

Table End
        The end callback is invoked after a matched table has been
        parsed.

Header  The hdr callback is invoked after the table header has been read
        in. Some tables do not use the <th> tag to indicate a header, so
        this function may not be called. It is passed the column names.

Row     The row callback is invoked after a row in the table has been
        read. It is passed the column data.

Warn    The warn callback is invoked when a non-fatal error occurs
        during parsing. Fatal errors croak.

New     This is the class method to call to create a new object when
        HTML::TableParser is supposed to create new objects upon table
        start.

  Callback API

Callbacks may be functions or methods or a mixture of both. In the
latter case, an object must be passed to the constructor. (More on that
later.)

The callbacks are invoked as follows:

  start( $tbl_id, $line_no, $udata );

  end( $tbl_id, $line_no, $udata );

  hdr( $tbl_id, $line_no, \@col_names, $udata );

  row( $tbl_id, $line_no, \@data, $udata );

  warn( $tbl_id, $line_no, $message, $udata );

  new( $tbl_id, $udata );

  Data Cleanup

There are several cleanup operations that may be performed
automatically:

Chomp   chomp() the data

Decode  Run the data through HTML::Entities::decode.

DecodeNBSP
        Normally HTML::Entitites::decode changes a non-breaking space
        into a character which doesn't seem to be matched by Perl's
        whitespace regexp. Setting this attribute changes the HTML
        "nbsp" character to a plain 'ol blank.

Trim    remove leading and trailing white space.

  Data Organization

Column names are derived from cells delimited by the <th> and </th>
tags. Some tables have header cells which span one or more columns or
rows to make things look nice. HTML::TableParser determines the actual
number of columns used and provides column names for each column,
repeating names for spanned columns and concatenating spanned rows and
columns. For example, if the table header looks like this:

 +----+--------+----------+-------------+-------------------+
 |    |        | Eq J2000 |             | Velocity/Redshift |
 | No | Object |----------| Object Type |-------------------|
 |    |        | RA | Dec |             | km/s |  z  | Qual |
 +----+--------+----------+-------------+-------------------+

The columns will be:

  No
  Object
  Eq J2000 RA
  Eq J2000 Dec
  Object Type
  Velocity/Redshift km/s
  Velocity/Redshift z
  Velocity/Redshift Qual

Row data are derived from cells delimited by the <td> and </td> tags.
Cells which span more than one column or row are handled correctly, i.e.
the values are duplicated in the appropriate places.

INSTALLATION

This is a Perl module distribution. It should be installed with whichever
tool you use to manage your installation of Perl, e.g. any of

  cpanm .
  cpan  .
  cpanp -i .

Consult http://www.cpan.org/modules/INSTALL.html for further instruction.
Should you wish to install this module manually, the procedure is

  perl Makefile.PL
  make
  make test
  make install

COPYRIGHT AND LICENSE

This software is Copyright (c) 2018 by Smithsonian Astrophysical
Observatory.

This is free software, licensed under:

  The GNU General Public License, Version 3, June 2007