1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
|
Gnumeric Structured Text Format (STF) Parser
============================================
by Almer S. Tigelaar
1. Creation/destruction
2. Row separators
3. Trimming
4. Modification determination
5. CSV Parsing
5.1 General
5.2 What is the string indicator?
5.3 How are adjacent string indicators handled?
5.4 What does "duplicates" mean?
6. Fixed width parsing
7. Cached parsing
1. Creation/Destruction
=======================
To parse you first need to create a StfParseOptions_t struct.
this can be done like :
StfParseOptions_t *parseoptions;
parsoptions = stf_parse_options_new ();
After using a parse options struct you must free it by calling :
stf_parse_options_free (parseoptions);
You _HAVE_ to set the parsing method you want to use, either csv or
fixed width, you do this with :
stf_parse_options_set_type (parseoptions, parsetype);
where parsetype is PARSE_TYPE_CSV or PARSE_TYPE_FIXED.
2. Row separators
=================
Normally the newline (\n) character is used to separate rows, however it can
in some cases be desirable to set this to something else, e.g. a return character
(\r). You can archieve this by typing :
stf_parse_options_set_line_terminator (parseoptions, '\r');
3. Trimming
===========
An common problem is to get excess spaces on left and/or right sides
of parsed text, e.g :
" Example "
This is in most cases undesirable. Therefore the stf parser removes
these spaces on _both_ sides by default :
"Example"
You can turn this off or change this with :
stf_parse_options_set_trim_spaces (parseoptions, TRIM_TYPE_NEVER);
stf_parse_options_set_trim_spaces (parseoptions, TRIM_TYPE_LEFT);
stf_parse_options_set_trim_spaces (parseoptions, TRIM_TYPE_RIGHT);
stf_parse_options_set_trim_spaces (parseoptions, TRIM_TYPE_LEFT | TRIM_TYPE_RIGHT);
4. Modification determination
=============================
If you want to know, between several functions calls, if the actual contents of
the StfParseOptions_t struct have been modified you can use the before and after
modification calls, like :
stf_parse_options_before_modification (parseoptions);
/* make some changes to the parseoptions with the set functions calls */
if (stf_parse_options_after_modification (parseoptions)) {
/* The parse options contents have changed...do something */
}
Note that this will keep track of content changes, for example :
(Splitpositions contains the numbers 5, 7 and 9 before modification)
stf_parse_options_before_modification (parseoptions);
stf_parse_options_fixed_splitpositions_clear (parseoptions);
stf_parse_options_fixed_splitpositions_add (parseoptions, 5);
stf_parse_options_fixed_splitpositions_add (parseoptions, 7);
stf_parse_options_fixed_splitpositions_add (parseoptions, 9);
stf_parse_options_after_modification (parseoptions);
stf_parse_options_after_modification WILL return FALSE, even though you cleared the splitpositions
and re-added them the contents (5, 7 and 9) remain the same, so the parseoptions have not been
modified.
5.1 CSV Parsing -> General
==========================
CSV parsing is parsing data like :
hello;this;is;data
this;is;the;second;row;of;data
So lines with columns separated by separators, in this case a colon.
The general way to parse CSV data is :
stf_parse_options_set_type (parseoptions, PARSE_TYPE_CSV);
stf_parse_options_csv_set_separators (parseoptions, ";", NULL);
stf_parse_options_csv_set_stringindicator (parseoptions, '\"');
stf_parse_options_csv_set_duplicates (parseoptions, FALSE);
This code will set the tab and colon characters to be recognized as column separators and
it will set the " characters as the string indicator and it sets no duplicates.
after that we'll call a parsing routine (for the example I'll call the general one)
(normally you don't call the stf_parse_general and stf_parse_general_cached directly,
you create a separate function which parses the GSList returned by stf_parse_general or
stf_parse_general_cached into a custom datastructure)
GSList *mylist;
mylist = stf_parse_general (parseoptions);
5.2 CSV Parsing -> What is the string indicator?
================================================
If you have data where the column separator(s) also appear within
the column themselves the string indicator can be quite handy.
Say you have the following data to parse :
"some";"example;data"
if you would set the string indicator to none and the column separator to
colon you would get three cells :
"some"
"example
data"
This is not what we want ofcourse, "example;data" should be in one cell, so we
can force this be setting the string indicator to " , then the result would be
two cells :
some
example;data
5.3 CSV Parsing -> How are adjacent string indicators handled?
============================================================
When parsing fields which are bounded by string indicators the convention
of doubling the indicator is used to encode an indicator that is NOT
the termination of the string.
eg
"a""b" encodes either the string
a"b
or
ab
To turn this off (it is turned on by default) you can use the following function :
stf_parse_options_csv_set_indicator_2x_is_single (parseoptions, FALSE);
5.4 CSV Parsing -> What does duplicates mean?
=============================================
This means that several column separators are seen as one.
Say we have this data :
this;is;some;;data
(Notice the double colon)
If we would parse this with duplicates set to FALSE, we would get 5 cells :
this
is
some
<-- empty
data
However if we would parse this with duplicates set to TRUE, we would get only 4 cells :
this
is
some
data
So the two colons between "some" and "data" are seen as one.
6. Fixed width parsing
======================
Fixed width means that each column consists of a fixed number of characters.
if we have :
hello this is a test
is this really a test
yes it is a test
Now you can see that each column has a certain, fixed, width.
The widths of the sample data are : 6, 5, 7, 2, 4
But we always have to give absolute positions so the list will
become : 6, 11, 18, 20, 24
If we want to parse this we'll do it like this :
stf_parse_options_fixed_splitpositions_clear (parseoptions);
stf_parse_options_fixed_splitpositions_add (parseoptions, 6);
stf_parse_options_fixed_splitpositions_add (parseoptions, 11);
stf_parse_options_fixed_splitpositions_add (parseoptions, 18);
stf_parse_options_fixed_splitpositions_add (parseoptions, 20);
stf_parse_options_fixed_splitpositions_add (parseoptions, 24);
Alternatively you can also call the autodiscovery function :
stf_parse_options_fixed_autodiscover (parseoptions, lines, text);
This function will try to recognize columns in the text and adjust the
splitpositions accordingly.
after that we'll call a parsing routine (for the example I'll call the general one)
(normally you don't call the stf_parse_general and stf_parse_general_cached directly,
you create a separate function which parses the GSList returned by stf_parse_general or
stf_parse_general_cached into a custom datastructure)
GSList *mylist;
mylist = stf_parse_general (parseoptions);
|