1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318
|
<?xml version="1.0" encoding="UTF-8"?>
<chapter id="performance_tips">
<title>Performance tips</title>
<sect1 id="small_tables_large_objects">
<title>Small tables of large geometries</title>
<sect2>
<title>Problem description</title>
<para>Current PostgreSQL versions (including 8.0) suffer from a query
optimizer weakness regarding TOAST tables. TOAST tables are a kind of
"extension room" used to store large (in the sense of data size) values
that do not fit into normal data pages (like long texts, images or
complex geometries with lots of vertices), see
<ulink url="http://www.postgresql.org/docs/current/static/storage-toast.html">the PostgreSQL Documentation for TOAST</ulink> for more
information).</para>
<para>The problem appears if you happen to have a table with rather
large geometries, but not too much rows of them (like a table containing
the boundaries of all European countries in high resolution). Then the
table itself is small, but it uses lots of TOAST space. In our example
case, the table itself had about 80 rows and used only 3 data pages, but
the TOAST table used 8225 pages.</para>
<para>Now issue a query where you use the geometry operator &&
to search for a bounding box that matches only very few of those rows.
Now the query optimizer sees that the table has only 3 pages and 80
rows. He estimates that a sequential scan on such a small table is much
faster than using an index. And so he decides to ignore the GIST index.
Usually, this estimation is correct. But in our case, the &&
operator has to fetch every geometry from disk to compare the bounding
boxes, thus reading all TOAST pages, too.</para>
<para>To see whether your suffer from this bug, use the "EXPLAIN
ANALYZE" postgresql command. For more information and the technical
details, you can read the thread on the postgres performance mailing
list:
http://archives.postgresql.org/pgsql-performance/2005-02/msg00030.php</para>
</sect2>
<sect2>
<title>Workarounds</title>
<para>The PostgreSQL people are trying to solve this issue by making the
query estimation TOAST-aware. For now, here are two workarounds:</para>
<para>The first workaround is to force the query planner to use the
index. Send "SET enable_seqscan TO off;" to the server before issuing
the query. This basically forces the query planner to avoid sequential
scans whenever possible. So it uses the GIST index as usual. But this
flag has to be set on every connection, and it causes the query planner
to make misestimations in other cases, so you should "SET enable_seqscan
TO on;" after the query.</para>
<para>The second workaround is to make the sequential scan as fast as
the query planner thinks. This can be achieved by creating an additional
column that "caches" the bbox, and matching against this. In our
example, the commands are like:</para>
<programlisting>SELECT AddGeometryColumn('myschema','mytable','bbox','4326','GEOMETRY','2');
UPDATE mytable SET bbox = ST_Envelope(ST_Force2D(the_geom));</programlisting>
<para>Now change your query to use the && operator against bbox
instead of geom_column, like:</para>
<programlisting>SELECT geom_column
FROM mytable
WHERE bbox && ST_SetSRID('BOX3D(0 0,1 1)'::box3d,4326);</programlisting>
<para>Of course, if you change or add rows to mytable, you have to keep
the bbox "in sync". The most transparent way to do this would be
triggers, but you also can modify your application to keep the bbox
column current or run the UPDATE query above after every
modification.</para>
</sect2>
</sect1>
<sect1 id="database_clustering">
<title>CLUSTERing on geometry indices</title>
<para>For tables that are mostly read-only, and where a single index is
used for the majority of queries, PostgreSQL offers the CLUSTER command.
This command physically reorders all the data rows in the same order as
the index criteria, yielding two performance advantages: First, for index
range scans, the number of seeks on the data table is drastically reduced.
Second, if your working set concentrates to some small intervals on the
indices, you have a more efficient caching because the data rows are
spread along fewer data pages. (Feel invited to read the CLUSTER command
documentation from the PostgreSQL manual at this point.)</para>
<para>However, currently PostgreSQL does not allow clustering on PostGIS
GIST indices because GIST indices simply ignores NULL values, you get an
error message like:</para>
<programlisting>lwgeom=# CLUSTER my_geom_index ON my_table;
ERROR: cannot cluster when index access method does not handle null values
HINT: You may be able to work around this by marking column "the_geom" NOT NULL.</programlisting>
<para>As the HINT message tells you, one can work around this deficiency
by adding a "not null" constraint to the table:</para>
<programlisting>lwgeom=# ALTER TABLE my_table ALTER COLUMN the_geom SET not null;
ALTER TABLE</programlisting>
<para>Of course, this will not work if you in fact need NULL values in
your geometry column. Additionally, you must use the above method to add
the constraint, using a CHECK constraint like "ALTER TABLE blubb ADD CHECK
(geometry is not null);" will not work.</para>
</sect1>
<sect1 id="avoiding_dimension_conversion">
<title>Avoiding dimension conversion</title>
<para>Sometimes, you happen to have 3D or 4D data in your table, but
always access it using OpenGIS compliant ST_AsText() or ST_AsBinary()
functions that only output 2D geometries. They do this by internally
calling the ST_Force2D() function, which introduces a significant
overhead for large geometries. To avoid this overhead, it may be feasible
to pre-drop those additional dimensions once and forever:</para>
<programlisting>UPDATE mytable SET the_geom = ST_Force2D(the_geom);
VACUUM FULL ANALYZE mytable;</programlisting>
<para>Note that if you added your geometry column using
AddGeometryColumn() there'll be a constraint on geometry dimension. To
bypass it you will need to drop the constraint. Remember to update the
entry in the geometry_columns table and recreate the constraint
afterwards.</para>
<para>In case of large tables, it may be wise to divide this UPDATE into
smaller portions by constraining the UPDATE to a part of the table via a
WHERE clause and your primary key or another feasible criteria, and
running a simple "VACUUM;" between your UPDATEs. This drastically reduces
the need for temporary disk space. Additionally, if you have mixed
dimension geometries, restricting the UPDATE by "WHERE
dimension(the_geom)>2" skips re-writing of geometries that already are
in 2D.</para>
</sect1>
<sect1 id="database_tuning_configuration">
<title>Tuning your configuration</title>
<para>Tuning for PostGIS is much like tuning for any PostgreSQL workload. The only additional note to keep in mind is that
geometries and rasters are heavy so memory related optimizations generally have more of an impact on PostGIS than other types of PostgreSQL queries.</para>
<para>For general details about optimizing PostgreSQL, refer to <ulink url="https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server">Tuning your PostgreSQL Server</ulink>.</para>
<para>For PostgreSQL 9.4+ all these can be set at the server level without touching postgresql.conf or postgresql.auto.conf
by using the <code>ALTER SYSTEM..</code> command.</para>
<programlisting>ALTER SYSTEM SET work_mem = '256MB';
-- this will force, non-startup configs to take effect for new connections
SELECT pg_reload_conf();
-- show current setting value
-- use SHOW ALL to see all settings
SHOW work_mem;</programlisting>
<para>In addition to these settings, PostGIS also has some custom settings which you can find listed in <xref linkend="PostGIS_GUC" />.</para>
<sect2>
<title>Startup</title>
<para>
These settings are configured in postgresql.conf:
</para>
<para>
<ulink url="http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-CONSTRAINT-EXCLUSION">constraint_exclusion</ulink>
</para>
<itemizedlist>
<listitem>
<para>
Default: partition
</para>
</listitem>
<listitem>
<para>
This is generally used for table partitioning. The default for this is set to "partition" which is ideal for PostgreSQL 8.4 and above since
it will force the planner to only analyze tables for constraint consideration if they are in an inherited hierarchy
and not pay the planner penalty otherwise.
</para>
</listitem>
</itemizedlist>
<para>
<ulink url="http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-SHARED-BUFFERS">shared_buffers</ulink>
</para>
<itemizedlist>
<listitem>
<para>
Default: ~32MB
</para>
</listitem>
<listitem>
<para>
Set to about 1/3 to 3/4 of available RAM
</para>
</listitem>
</itemizedlist>
<para>
<ulink url="https://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-MAX-WORKER-PROCESSES">max_worker_processes</ulink>
This setting is only available for PostgreSQL 9.4+. For PostgreSQL 9.6+ this setting has additional importance in that it controls the
max number of processes you can have for parallel queries.
</para>
<itemizedlist>
<listitem>
<para>
Default: 8
</para>
</listitem>
<listitem>
<para>
Sets the maximum number of background processes that
the system can support. This parameter can only be set at
server start.
</para>
</listitem>
</itemizedlist>
</sect2>
<sect2>
<title>Runtime</title>
<para>
<ulink url="http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-WORK-MEM">work_mem</ulink> (the memory used for sort operations and complex queries)
</para>
<itemizedlist>
<listitem>
<para>
Default: 1-4MB
</para>
</listitem>
<listitem>
<para>
Adjust up for large dbs, complex queries, lots of RAM
</para>
</listitem>
<listitem>
<para>
Adjust down for many concurrent users or low RAM.
</para>
</listitem>
<listitem>
<para>
If you have lots of RAM and few developers:
<programlisting>
SET work_mem TO '256MB';;
</programlisting>
</para>
</listitem>
</itemizedlist>
<para>
<ulink url="http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM">maintenance_work_mem</ulink> (used for VACUUM, CREATE INDEX, etc.)
</para>
<itemizedlist>
<listitem>
<para>
Default: 16-64MB
</para>
</listitem>
<listitem>
<para>
Generally too low - ties up I/O, locks objects while swapping memory
</para>
</listitem>
<listitem>
<para>
Recommend 32MB to 1GB on production servers w/lots of RAM, but depends
on the # of concurrent users. If you have lots of RAM and few developers:
<programlisting>
SET maintenance_work_mem TO '1GB';
</programlisting>
</para>
</listitem>
</itemizedlist>
<para>
<ulink url="https://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-MAX-PARALLEL-WORKERS-PER-GATHER">max_parallel_workers_per_gather</ulink>
This setting is only available for PostgreSQL 9.6+ and will only affect PostGIS 2.3+, since only PostGIS 2.3+ supports parallel queries.
If set to higher than 0, then some queries such as those involving relation functions like <code>ST_Intersects</code> can use multiple processes and can run
more than twice as fast when doing so. If you have a lot of processors to spare, you should change the value of this to as many processors as you have.
Also make sure to bump up <code>max_worker_processes</code> to at least as high as this number.
</para>
<itemizedlist>
<listitem>
<para>
Default: 0
</para>
</listitem>
<listitem>
<para>
Sets the maximum number of workers that can be started
by a single <varname>Gather</varname> node.
Parallel workers are taken from the pool of processes
established by <varname>max_worker_processes</varname>.
Note that the requested number of workers may not
actually be available at run time. If this occurs, the
plan will run with fewer workers than expected, which may
be inefficient. Setting this value to 0, which is the
default, disables parallel query execution.
</para>
</listitem>
</itemizedlist>
</sect2>
</sect1>
</chapter>
|