1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
|
<!-- vim:tabstop=4:shiftwidth=4:noexpandtab:textwidth=80
-->
<chapter id="kernel_part_layers_ch">
<title>Changes to pdf document</title>
<para>
As mentioned before, pdf document is represented by bunch
of objects. Most of these objects (with exception of document trailer)
are so called indirect objects. Those are accessible by cross reference
table which provides object and generation number mapping to file
offset where such object is stored.
</para>
<para>
Pdf format enables document changes natively by so called <emphasis>incremental</emphasis>
update mechanism. This enables to have several cross reference tables
each describing objects specific for that revision overwritting old
values. All objects which have to be changed are just written to the
end of file with new cross reference table which describes this new
object. All pdf files viewers should be aware of incremental update
and if object is accessible from several cross reference tables, the
newest one is always used.
</para>
<para>
Previous very short description says that making changes requires
taking over cross reference table manipulation. This has to be done
transparently that nobody knows that object is changed and he always
gets the most accurate objects. We also want to control who can
do the changes and who is just consumer of objects and so he is not
supposed to do changes. This is handled by 3 layer model described in
following section.
</para>
<sect1 id="kernel_3_layer_model">
<title>3 layer model</title>
<sect2 id="lowest_layer">
<title>The lowest layer</title>
<para>
Cross reference table is mantained by XRef class in xpdf code. This
class is responsible for cross reference sections parsing according
pdf specification, keeps table of this information inside and indirect
objects fetching.
<footnote>
Object fetching means parsing of indirect object according its
reference (object and generation pair).
</footnote>
XRef class is not designed to be extensible for making changes very
well. So we have reused this class as lowest layer in our 3 layer
model designed to enable making changes to document. This XRef layer
keeps logic of pdf file parsing and correct assignment of object
accoring referencies. So basic idea and responsibility is kept.
</para>
<para>
To enable this reusability in C++ language we had to make some minor
changes to xpdf code, basicaly prepare them for transparent dynamic
inheritance, so all neccessary methods were virtualized and also data
fields turned to protected (from private). (see also
<xref linkend="general_xpdf_changes"/>).
</para>
</sect2>
<sect2 id="middle_layer">>
<title>The middle layer</title>
<para>
Second (middle) layer of model is formed by CXref (see <xref linkend="kernel_part_cxref"/>)
- descendant of XRef class. Prime responsibility is to provide methods
which can register changes and keeps all changed objects in its
internal state. All methods which enables making changes are not
public to hide them from normal usage. They are protected, so they
can be reused by descendants. It overwrites public methods from XRef
and always use changed objects if they are avialable. Otherwise
delegates to lower layer (XRef implementation). This aproach enables
to use CXref transparently anywhere where XRef instance is required
(e. g. in rest of xpdf code which may be reuseable) with advantage of
access to the most accurate values without any special logic from
class user. To prevent inconsistencies and to make usage and
implementation easier, all methods providing change functionality are
protected. They are implemented without any special logic. All
changes are stored to the mapping where they can be accessible. No
special checking is performed. It is safe to return CXref instance,
because it is guarantied that nobody can use this class to make chages.
Pdfedit code uses CXref not to be depended on Xpdf XRef class.
</para>
<para>
CXref is also responsible for new object handling. This means that it
provides methods to reserve new reference and add new objects. All new
referencies are stored in newStorage container where each new reference
is mapped to its current state. If new reference is reserved by
reserveRef method, it is marked as RESERVED_REF and after changeObject
with given reference is called for the first time it is changed to
INITIALIZED_REF state. This state separation enables correct object
counting, because just those which are INITIALIZED_REF are counted and
also RESERVED_REF reference is not returned twice. This functionality
is also protected and so unvisible to instance users and is used by
3rd layer.
</para>
<para>
Class also implements simple type checking method
<programlisting>
virtual bool typeSafe(Object * obj1, Object * obj2);
</programlisting>
which is public and does the following test to guarantee that obj2 can
replace obj1 and it would be syntacticaly correct:
<itemizedlist>
<listitem>obj2 has to have same type as obj1 if they are not referencies
</listitem>
<listitem>if at least one is reference, fetched objects has to have same
types
</listitem>
<listitem>obj2 may have different type only if obj1 is (pdf) null object.
</listitem>
</itemizedlist>
Note that CXref doesn't use this method internally (as mentioned before it
doesn't any checking on values at all), but exports it, so instance user
can do the checking for himself (XRefWriter in 3rd layer uses this method
in paranoid mode).
</para>
</sect2>
<sect2 id="highest_layer">
<title>The highest layer</title>
<para>
Highest layer is represented by <xref linkend="kernel_part_xrefwriter"/>
class - extension of <xref linkend="kernel_part_cxref"/> class. Its
responsibility is to keep logic upon changes, to enable writting them
to the file and to maintain revisions of the document. Logic upon
changes means some type checking to prevent object type inconsistency.
</para>
</sect2>
<para>
For more information about responsibility and functionality separation
see following figure.
<mediaobject>
<imageobject>
<imagedata fileref="kernel/images/xref_layer_diagram.png" format="PNG"/>
</imageobject>
<caption><para>3 layers diagram</para></caption>
</mediaobject>
</para>
</sect1>
<sect1 id="document_saving">
<title>Document saving</title>
<para>
As it was mantioned above, PDF format supports changes in document in so
called incremental update (all changed objects are appened to document
end and new cross reference section for changed objects). This means that
each set of changes forms new revision. This brings little task to think
about. What should be stored in one revision and which changes are not
worth of new revision?
User usually wants to save everything because of fear of data lost and
doesn't thing about some revisions. If each save created it would lead to
mess with horrible number of referencies without any meaning.
</para>
<sect2 id="revision_saving">
<title>Revision saving</title>
<para>
XRefWriter provides save functionality with flag. This flag sais how data
should be stored with respect to revisions:
<itemizedlist>
<listitem>temporal saving, which dumps all changes with correct cross
reference table and trailer at the end of document but doesn't
care for it (no internal structures are touched and they are kept
as if no save has been done). If any problem occures changed data
are stored, so no data lost happens. Whenever save is done again
it will rewrite older temporarily saved changes.
</listitem>
<listitem>revision saving, which do the very same as previous one except
all internal structures are prepared to state as if this document
was opend again after saving. This means that we are working on
freshly created revision after saving. This makes sense when
user knows that changes made by him are gathered together in one
revision and nothing else messes with it. Implementation is
straightforward because we just need to force CXref to reopen
(call CXref::reopen method) and move storePos behind stored
data).
</listitem>
</itemizedlist>
It is up to user to use the way how he wants to save changes. However
temporal changes are default and new revision saving is done only if
it is explicitly said.
</para>
</sect2>
<sect2 id="content_writing">
<title>Content writing and IPdfWriter</title>
<para>
XRefWriter uses abstract IPdfWriter class to write changed content when
save method is called. This enables separation of implementation from
design. All saving is delegated to pdfWriter implementation holder and
it depends on it how content is writen (see <xref linkend="kernel_part_pdfwriter"/>.
We have implemented <classname>OldStylePdfWriter</classname> pdf writer,
which writes objects according pdf specification and creates an old
style pdf cross reference table (standard for Pdf specification prior to
version 1.5, see <xref linkend="crossref_table"/>).
</para>
</sect2>
<sect2 id="document_cloning">
<title>Document cloning</title>
<para>
To be able to effectively solve problem with PDF disability to branch
document and so making changes to older revisions, XRefWriter brings
so called cloning capability (this doesn't anything to do with object
cloning mention in other chapters). This means copying document content
until current revision (including current one). If user wants to change
something in such revision, he can switch to that revision and clone
it to different file. Changes are enabled to created document, because
current revision in original document is the newest one in cloned
document. Nevertheless document merging is not implemented yet, so
there is no way to get those changes back to main document (by any of
pdfedit component).
</para>
</sect2>
</sect1>
<sect1 id="linearized_pdf_documents">
<title>Linearized pdf documents</title>
<para>
All previously mentioned functionality depends on <emphasis>incremental
update</emphasis> mechanism. However pdf document may have format little
bit different. Such documents are called <emphasis>Linearized</emphasis>
and are designed for environment where it may be problem (e. g. time
problem) to wait until whole document is read and so parsing from end of
file can start (see Pdf specification Appendix F for more information).
</para>
<para>
Such documents have special requirements and they are not designed for
making changes. 3rd layer handles this situation rather strictly and
XRefWriter checks whether given file is linearized during initialization.
Some of operations are not implmented in linearized document, such as
revision handling and document saving may product not correct document
(pdf viewers which strictly relies on linearized information may
display different output).
</para>
<para>
Because many of documents (specialy from internet) are linearized, we have
provided Delinearizator class placed in utils. It is able to get rid of
linearized structures and create new pdf document which has same objects
but normal structure. Usage of the class is very simple, see the following
example:
<programlisting>
<code>
<![CDATA[
IPdfWriter * writer=new OldStylePdfWriter();
Delinearizator *delinearizator=Delinearizator::getInstance(fileName.c_str(), writer);
if(!delinearizator)
{
printf("\t%s is not suitable because it is not linearized.\n",
fileName.c_str());
return;
}
string outputFile=fileName+"-delinearizator.pdf";
printf("\tDelinearized output is in %s file\n", outputFile.c_str());
delinearizator->delinearize(outputFile.c_str());
delete delinearizator;
]]>
</code>
</programlisting>
</para>
</sect1>
</chapter>
|