File: hacking-page.html

package info (click to toggle)
netrik 1.16.1-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, buster, sid, stretch
  • size: 3,272 kB
  • ctags: 729
  • sloc: ansic: 6,657; sh: 994; makefile: 114
file content (514 lines) | stat: -rw-r--r-- 19,490 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
<html>
<head>
<title>netrik hacker's manual: page handling</title>
</head>
<body>

<h1 align="center">netrik hacker's manual<br />>========================&lt;</h1>

<p>
[This file contains a description of the page handling system. See hacking.txt
or <a href="hacking.html">hacking.html</a> for an overview of the
manual.]
</p>

<p>
The page handling system is responsible for the central browser functionality:
Loading pages to be displayed, loading new pages when links are followed etc.
This also includes all page history handling, as all page loads either affect
the page list (history), or depend on it, or both; and all history commands
involve a page load.
</p>

<p>
The page handling is also a central component, because it invokes all of the
other modules: The file loader
(<a href="hacking-load.html">hacking-load.*</a>) is used to fetch a
new document (file) if necessary, and the layout engine
(<a href="hacking-layout.html">hacking-layout.*</a>) to prepare it for
rendering; the pager (<a href="hacking-pager.html">hacking-pager.*</a>
is (indirectly) invoked to display the new page; and the link handling
mechanism tells when and what page to load. Moreover, most of the modules are
also coupled to the page handling, because they use the page structure handled
by the page loading mechanism.
</p>

<p>
So the page handling is really the central component, controlling the program
flow. Thus it can only partially be located in an own file (page.c); part of
the page handling has to be done in main() directly.
</p>

<a name="loadPage" id="loadPage">

<h2>load_page()</h2>

<p>
load_page() ist the main function of load.c, and does most of the work. It is
responsible for loading a page so that it will be shown in the pager next time
<a href="hacking-pager.html#display">display()</a> (see hacking-pager.*) is
called.
</p>

<p>
The exact actions necessary to achieve that vary on the nature of the load
operation. In the most basic operation mode it requires adding a new entry to
the page list and loading a new document to use. Sometimes however there are
only changes to the page list while the document is reused, or the page
history stays unchanged while some existing entry is reactivated and the
settings reloaded. The case of reloading some document while the page list
isn't altered, is possible as well.
</p>

<p>
load_page() takes almost the same arguments as layout(): A base URL, which is
usually the URL of the current page, which is used as the base when following a
realtive link or so; a main URL (as string), which can be absolute or relative,
in the latter case to be combined with the base URL to form an absolute target
URL; an optional form item, which tells where to find the form data of a form
to submit while loading a new document; the page width telling how to layout
loaded pages; and an error handle informing whether some problem occured while
loading the document.
</p>

<p>
In case a new document needs to be loaded, all of these parameters are simply
passed on to layout(), which is responsible for actually loading a document
from a file or an HTTP server, as well as invoking all necessary layouting
passes so that the document can be actually rendered and displayed in the pager.
This process is described in
<a href="hacking-layout.html">hacking-layout.*</a>.
</p>

<p>
load_page() takes an additional "reference" parameter. This one may refer to
some entry in the page list (normally the currently used entry), which contains an
already loaded document to reuse. In this case no document load needs to be
performed (i.e. layout() isn't called), as the layouting data of the reference
page is reused. (Most of the other parameters aren't used in this case.) This
feature is used when jumping to an anchor inside a document -- it doesn't
require loading a new document.
</p>

<p>
First action (even before loading the document) is creating a page descriptor
-- regardless whether a new document is loaded or existing layouting data is
reused.
</p>

<p>
This however is left out if the page is reloaded from history (indicated by
"url" being NULL), and thus already has a descriptor somewhere in the page
list.
</p>

<h3>Page List Handling</h3>

<a name="pageList" id="pageList">

<p>
The page list (history) is a global variable (of type "struct Page_list"),
which basically consists of an array of pointers to individual page
descriptors. It also stores the current number of entries in the list, as well
es the current active entry. (The one that describes the page visible in the
pager.)
</p>

<p>
The list has one entry for each page in the page history. Normally the last
entry is the visible page, but other pages are also possible after going back
in history.
</p>

<a name="pageStruct" id="pageStruct">

<p>
Each page descriptor is of type "struct Page". This struct contains:
</p>

<ul> <li>
 The "layout" pointer, which points to a descriptor with the document's
 layouting data created by <a href="hacking-layout.html#layout">layout()</a>.
</li> <li>
 The page URL in the split URL structure "url"
</li> <li>
 The current position of the pager, stored as the index of the first line
 visible on the screen ("pager_pos")
</li> <li>
 The cursor position in "cursor_x" and "cursor_y"
</li> <li>
 The hashes "url_fingerprint" and "text_fingerprint", used to recognize
 whether the link is identical to a link in the page (necessary when reloading
 a page from history)
</li> <li>
 The link number of the link currently active in the pager ("active_link")
</li> <li>
 The optionl anchor number of an anchor to jump to when starting the pager
 ("active_link")
</li> <li>
 The "mark" flag indicating that a return mark has been set on this page
</li> </ul>

</a>    <!-- pageStruct -->

<p>
The page list is manipulated exclusively by load_page() and its helper
functions. load_page() is also responsible for keeping the list up to date, so
it always contains exactly those entries that shall be used by the history
commands.
</p>

<p>
Thus, before a new page descriptor is created, the page list needs to be
adjusted.
</p>

<p>
Normally a new entry is simply appended at the end of the list; however, there
are several other cases.
</p>

<p>
When loading some new page while not being at the last entry in
the page list (after going back to some older page from the page history), all
entries after the current one have to be discarded.
</p>

<p>
Moreover, if the current page is an internal page (either a page loaded from
stdin, or an error page), it isn't to be kept in history; in that case, we go
further back until we find the last normal page entry.
</p>

<p>
All entries after the current or the last normal one are then cleared by
calling free_page() in a loop.
</p>

<p>
Afterwards, the new page descriptor is created.
</p>

<p>
When revisiting some page from history, it also may be necessary to delete internal
pages, if leaving such. The last non-internal page is determined (starting with
the end of the list), and all following are deleted.
</p>

</a>    <!-- pageList -->

<h4>add_page()</h4>

<p>
The add_page() function (in url-history.c) is responsible for actually creating
the new page list entry. The list is a "struct Page_list", and contains the
following information:
</p>

<ul> <li>
 "num" is the number of entries currently stored in the list
</li> <li>
 "pos" is the entry number of the entry corresponding to the currently visible
 page
</li> <li>
 The history entries themselfs are stored in "page", which is an array of
 pointers to the page descriptors of all pages.
</li> </ul>

<p>
The new entry is added at the position indicated by "page_list.pos".
</p>

<p>
First the array is resized to the new history size -- the history now will end
with the new entry generated. Then a new page descriptor is created and the
pointer to it is stored at the proper position (the last list entry).
</p>

<p>
After creating the descriptor, some default values are set. (Pager position etc.)
</p>

<h3>Loading</h3>

<p>
Now having the page descriptor, we need to get the layouting data someway so that
the page can be displayed in the pager.
</p>

<p>
Again, the standard case is loading a new document using
<a href="hacking-layout.html#layout">layout()</a>. The pointer to the
layouting data returned by that function is simply stored in the page
descriptor. The URL is extracted from the layouting data and stored in the
explicit "url" pointer of the page descriptor, which is necessary in case the
layout data will be discarded (when loading another document), but the page is kept
in history.
</p>

<h4>Local Links</h4>

<p>
As mentioned before, it's also possible instead of loading a new document, to
reuse existing layouting data of another one by passing a "reference" page. The
primary application of that is when following a link that points to some anchor
in the same document -- that doesn't require reloading the whole document, but
just jumping to the anchor. Of course, it is also used when returning to the
previous page after following such a local link, or going forward again.
</p>

<p>
The (pointer to) the layout data descriptor is simply copied from the page list
entry indicated by the given "reference" parameter.
</p>

<p>
As layout() and thus also
<a href="hacking-load.html#initLoad">init_load()</a> (see
hacking-load.*) isn't used in this case,
<a href="hacking-load.html#mergeUrls">merge_urls()</a> has to be
called directly. If URL merging fails here, load_page() returns immediately; no
page descriptor is created and nothing else is changed.
</p>

<p>
Also, if some anchor is active in the reference page,
<a href="hacking-links.html#highlight">highlight_link()</a> needs to
be used to remove the highlighting, to get a "clean" item tree.
</p>

<a name="reactivating" id="reactivating">

<h3>Reactivating Link</h3>

<p>
When revisiting a page from history, and some link was active when we last left
that page, that link is reactivated for convenience. This is non-trivial in
case the page contents changed in the meantime -- we need a heuristic to
find out which is the right link.
</p>

<p>
The best matching link is determined by examining several properties of the
links: The link number, the position on the page (x and y), and two hashes
identifying the link url and the text describing the link.
</p>

<p>
The properties of each link are stored in its link structure while they are
extracted in
<a href="hacking-layout.html#parseStruct">parse_struct()</a> (see
hacking-layout.*): The position is stored in each link in "x_start" and
"y_start" anyways. The hashes are extracted with fingerprint() (which just
generates a very simple (fast!) checksum of a string part (memory range)
specified by its starting and ending address), and stored in "url_fingerprint"
and "text_fingerprint". The link number is implicit.
</p>

<p>
Each time a link is activated with
<a href="hacking-links.html">activate_link</a> (described in
hacking-links.*), the properties of the link are copied to the page descriptor,
so that they can be compared against the actual links when revisiting the page
later. The hashes are stored in dedicated "url_fingerprint" and
"text_fingerprint" again. The position is stored in "cursor_x" and "cursor_y".
(Presently, those values are set *only* when a link is activated, and solely
for the sake of the link retrieval -- this will change as soon as real cursor
handling is implemented...)
</p>

<p>
When revisiting a page, as a first step we do a quick test whether the link
with the old number is still the same. If all five conditions are fulfilled
(i.e. the link number is still valid, and the other four values are identical),
we know nothing has changed, and need no further processing.
</p>

<p>
If some of the conditions fails, on the other hand, we need to test all links
on the page using the heuristics. First deviations are calculated seperately
for each property: "dev_url" and "dev_text" are flags, which eqal 0 if the hash
in the tested link is identical to the one of the old link, and 1 otherwise.
"dev_x" is also a flag, being 0 if the horizontal position is identical.
"dev_y" and "dev_num" are different: These are float numbers (also in the range
0 to 1) describing a relative deviation, i.e. the ratio between the number of
lines/links off and the total number of lines/links in the (new) page.
</p>

<p>
All of these are finally combined to form the total deviation, which is just a
weighted sum of all; the lower the value, the better the link matches.
</p>

<p>
All links are tested in a loop, and the best match is then stored as the new
"active_link". (Instead of testing all, we could limit the search to a smaller
range by pre-testing some of the criteria. However, a situation were this would
make a difference is unrealistic; it would be just a waste of code, meaning
wasted development time, increased binary size, and most important additional
error sources and worse maintainability.)
</p>

<p>
At start "best" is set to -1 (i.e. no link), and "best_deviation" to 1.0. Thus,
only links with a deviation below or eqal 1 are considered at all; if none was
found, "active_link" is assigned -1, and no link will be activated. This is
important to ensure that only links will be activated which are actually
*similar* to the original one, not activating some unrelated random link which
happens to be nearest the old one.
</p>

<p>
Of course, this heuristic isn't perfect by any means; however, it is certainly
more than sufficient. Anything better would be waste of performance and
development efforts. Recognition whether a candidate is worth considering
should be robust. Choosing the best if there are multiple similar candidates is
suboptimal, but such a situation is really unlikely; and if it occurs, the
question which is really the best link is disputable alyways...
</p>

<p>
The files test/mod[12].html are provided for testing the heuristics. These need
to be loaded alternatively; this is easy with Hurd, but not possible with
classical Unix systems; thus, it's necessary to use a web server with a CGI
script, or tcplisten:
</p>

<p><pre>
while true; do for i in test/mod*.html; do cat test/stat_only.http |tcplisten 8000; done; done
</pre></p>

<p>
(Point netrik to http://localhost:8000 and use ^R command.)
</p>

</a>    <!-- reactivating -->

<h3>Anchors</h3>

<p>
If the URL contains a fragment identifier, the corresponding anchor is
retrieved from the anchor list, and stored in "page->active_anchor"; this is
described under <a href="hacking-links.html#anchors">Anchors</a>
in hacking-links.*. The pager then jumps to the anchor position and highligts
it upon startup.
</p>

</a>    <!--loadPage-->

<h2>Handling in main()</h2>

<p>
Although load_page() does a great part of the work, some things have to be
taken care of in main(); particularily determining in which manner the new page
is to be loaded (reuse current document or load new one), and clearing the
layout data of an old document before loading a new one. Maybe that could be
done in load_page() too; however, it's not worth considering that now, as it
will probably need to be handled completely different with the planned new
basic program structure. (Using an event queue and a main dispatcher.)
</p>

<p>
Another thing that needs to be handled in main() is initiating a page load in
reaction to commands given by the user inside the pager (or on the command
promt), which often can't be clearly seperated from the load operation itself.
</p>

<p>
If the user activates the command prompt (by typing ':' inside the pager) and
issues the ":e" or ":E" command, the URL is extracted from the command, and
load_page() is called to load the desired new page. The URL of the current page
is used as base for a relative URL with ":e". With ":E" (and also for ":e", if
the current page is internal), no base is used; the URL is always interpreted
absolutely.
</p>

<p>
These commands never pass a reference page, i.e. always involve loading a new
document. (Even though it is possible to jump to a local anchor using ":e
#anchor".)
</p>

<p>
If a link/form control was activated by pressing &lt;return> on a selected link
inside the pager, the action depends on the link or from element type. For
normal links load_page() is used with the current URL as base, just as with
":e". The link URL is extracted from the text item containing the link with
<a href="hacking-links.html#getLink">get_link()</a>, by help of the
"<a href="hacking-links.html#linkList">link_list</a>" structure. This
process is described under
<a href="hacking-links.html#followLink">Following Links</a> in
hacking-links.*.
</p>

<p>
The only difference to the ":e" command (except for the way of getting the
target URL) is that the current page is passed as "reference" to load_page() if
the URL starts with '#' (i.e. points to an anchor in the same document), so
that the document isn't reloaded in that case, but only the anchor is
activated.
</p>

<p>
Form submit buttons are quite similar to normal links. First,
<a href="hacking-links.html#getForm">get_form_item()</a> (also in
hacking-links.*) is used to retrieve the item (from the structure tree) that
describes to the form in which the button resides. This is used to get the
form's submit address ("action") first. Having this, load_page() is used to do
the submit; the form item is passed as the "form" argument. (And passed on to
<a href="hacking-load.html#initLoad">init_load()</a> there; see
hacking-load.*.) This is both to tell init_load() that a form is to be
submitted, and where to find the form data. init_load() (and its sub-functions)
then take care of extracting the form data (using
<a href="hacking-links.html#urlEncode">url_encode()</a> or
<a href="hacking-links.html#mimeEncode">mime_encode()</a> from
forms.c, also described in hacking-links.*) and submitting it to the server.
The resulting response page is loaded just like any other document.
</p>

<p>
Other form controls do not issue a load operation, but only adjust the form
value appropriately. (This is described under
<a href="hacking-links.html#manipulating">Manipulating</a> in
hacking-links.*.)
</p>

<p>
The 'u', 'U' and 'c' pager commands also aren't really load operations, but
they are described here because they involve similar actions as the
preparations for a page load.
</p>

<p>
If 'u' was typed, the link URL is retrieved the same way like when following a
link, and then just printed to the screen.
</p>

<p>
'U' is similar. Instead of printing the (relative) link URL directly, it merges
it with the current page URL, thus getting the same absolute target URL that
would be used if the link was actually followed.
</p>

<p>
'c' simply prints the "full_url" component of the current page URL.
</p>

<p>
If a history command was given, load_page() is called with the (split) URL
taken from the requested "page_list" entry. We know which entry to take by
"page_list.pos", which is set the desired new value before returning from the
pager.
</p>

<p>
If the history entry refers to the same HTML document as the one displayed up
to now, the current page descriptor is passed as "reference". To determine
whether it is the same document, we need to check if all entries between the
old and the new one (regardles whether the new one is before or after the old
one in history) have "local" URLs, i.e. if the newer of the two entries was
created only by following links to local anchors from the older one.
</p>

</body>
</html>