File: BUGS

package info (click to toggle)
pdftohtml 0.36-13
  • links: PTS
  • area: main
  • in suites: etch-m68k
  • size: 1,880 kB
  • ctags: 3,931
  • sloc: cpp: 43,304; sh: 859; ansic: 778; makefile: 258
file content (35 lines) | stat: -rw-r--r-- 1,533 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Known bugs (and TODOs) in pdftohtml 0.34

* encoding not stored in xml output

* Complex output will not work correctly when there are pages of different size
(large background image will be screwed up)
I think that in order to fix this I will have to keep some info about each page
and then run ghostscript multiple times for the page ranges with different page
sizes. On the positive size this can be used to also keep track of pages with text only.
These pages do not have to be processed by ghostscript at all.

* Related to previous: we need to keep track of rotation of each page and
specify rotation to ghostscript when needed.

* Plain (non-complex) output might not preserve the order of text sometimes.

* For demo1.pdf first font is there twice. This is because we keep separete 
fonts for <b> and <i> even though they are otherwise identical. Not really 
a problem, just ugly. The solution is not to keep italic and bold attributes
with text, but rather keep it in HtmlString. Makes more sense.

* command line options for directory where to put extracted images

* move all the GBool settings for pdf to html conversion into GlobalParams for
consistency

* when -c -noframes it is obvious that text which is out of the bounding box should be hidden

* xml output is broken because of <i> and <b>s... Not sure what to do with them.

* order of <i> and <b> is broken.

* when -c -noframes -stdout output could probably go to stdout, but does not

* in xml output give an options of preserving more information about fonts