File: temporal_types_aggregation.xml

package info (click to toggle)
mobilitydb 1.3.0~rc1-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 119,380 kB
sloc: ansic: 175,127; sql: 100,930; xml: 23,111; yacc: 447; makefile: 200; lex: 151; sh: 142
file content (344 lines) | stat: -rw-r--r-- 23,784 bytes
<?xml version="1.0" encoding="UTF-8"?>
<!--
   ****************************************************************************
    MobilityDB Manual
    Copyright(c) MobilityDB Contributors

    This documentation is licensed under a Creative Commons Attribution-Share
    Alike 3.0 License: https://creativecommons.org/licenses/by-sa/3.0/
   ****************************************************************************
-->
<chapter xml:id="ttype_aggregation">
	<title>Temporal Types: Aggregation and Indexing</title>

	<sect1 xml:id="ttype_aggregations">
		<title>Aggregation</title>

		<para>The temporal aggregate functions generalize the traditional aggregate functions. Their semantics is that they compute the value of the function at every instant in the <emphasis>union</emphasis> of the temporal extents of the values to aggregate. In contrast, recall that all other functions manipulating temporal types compute the value of the function at every instant in the <emphasis>intersection</emphasis> of the temporal extents of the arguments.</para>

		<para>The temporal aggregate functions are the following ones:</para>
		<itemizedlist>
			<listitem><para>For all temporal types, the function <varname>tCount</varname> generalize the traditional function <varname>count</varname>. The temporal count can be used to compute at each point in time the number of available objects (for example, number of cars in an area).</para></listitem>
			<listitem><para>For all temporal types, function <varname>extent</varname> returns a bounding box that encloses a set of temporal values. Depending on the base type, the result of this function can be a <varname>tstzspan</varname>, a <varname>tbox</varname> or an <varname>stbox</varname>.</para></listitem>
			<listitem><para>For the temporal Boolean type, the functions <varname>tAnd</varname> and <varname>tOr</varname> generalize the traditional functions <varname>and</varname> and <varname>or</varname>.</para></listitem>
			<listitem><para>For temporal numeric types, there are two types of temporal aggregate functions. The functions <varname>tMin</varname>, <varname>tMax</varname>, <varname>tSum</varname>, and <varname>tAvg</varname> generalize the traditional functions <varname>min</varname>, <varname>max</varname>, <varname>sum</varname>, and <varname>avg</varname>. Furthermore, the functions <varname>wMin</varname>, <varname>wMax</varname>, <varname>wCount</varname>, <varname>wSum</varname>, and <varname>wAvg</varname> are window (or cumulative) versions of the traditional functions that, given a time interval w, compute the value of the function at an instant t by considering the values during the interval [t-w, t]. All window aggregate functions are available for temporal integers, while for temporal floats only window minimum and maximum are meaningful.</para></listitem>
			<listitem><para>For the temporal text type, the functions <varname>tMin</varname> y <varname>tMax</varname> generalize the traditional functions <varname>min</varname> and <varname>max</varname>.</para></listitem>
			<listitem><para>Finally, for temporal point types, the function <varname>tCentroid</varname> generalizes the function <varname>ST_Centroid</varname> provided by PostGIS. For example, given set of objects that move together (that is, a convoy or a flock) the temporal centroid will produce a temporal point that represents at each instant the geometric center (or the center of mass) of all the moving objects.</para></listitem>
		</itemizedlist>

		<para>In the examples that follow, we suppose the tables <varname>Department</varname> and <varname>Trip</varname> contain the two tuples introduced in <xref linkend="ttype_examples"/>.</para>
		<itemizedlist>
			<listitem xml:id="tCount">
				<indexterm significance="normal"><primary><varname>tCount</varname></primary></indexterm>
				<para>Temporal count</para>
				<para><varname>tCount(ttype) → {tintSeq,tintSeqSet}</varname></para>
				<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT tCount(NoEmps) FROM Department;
-- {[1@2001-01-01, 2@2001-02-01, 1@2001-08-01, 1@2001-10-01)}
</programlisting>
			</listitem>

			<listitem xml:id="extent">
				<indexterm significance="normal"><primary><varname>extent</varname></primary></indexterm>
				<para>Bounding box extent</para>
				<para><varname>extent(temp) → {tstzspan,tbox,stbox}</varname></para>
				<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT extent(noEmps) FROM Department;
-- TBOX XT((4,12),[2001-01-01,2001-10-01])
SELECT extent(Trip) FROM Trips;
-- STBOX XT(((0,0),(3,3)),[2001-01-01 08:00:00+01, 2001-01-01 08:20:00+01))
</programlisting>
			</listitem>

			<listitem xml:id="tAnd">
				<indexterm significance="normal"><primary><varname>tAnd</varname></primary></indexterm>
				<indexterm significance="normal"><primary><varname>tOr</varname></primary></indexterm>
				<para>Temporal and, temporal or</para>
				<para><varname>tOr(tbool) → tbool</varname></para>
				<para><varname>tAnd(tbool) → tbool</varname></para>
				<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT tAnd(NoEmps #&gt; 6) FROM Department;
-- {[t@2001-01-01, f@2001-04-01, f@2001-10-01)}
SELECT tOr(NoEmps #&gt; 6) FROM Department;
-- {[t@2001-01-01, f@2001-08-01, f@2001-10-01)}
</programlisting>
			</listitem>

			<listitem xml:id="tMin">
				<indexterm significance="normal"><primary><varname>tMin</varname></primary></indexterm>
				<indexterm significance="normal"><primary><varname>tMax</varname></primary></indexterm>
				<indexterm significance="normal"><primary><varname>tSum</varname></primary></indexterm>
				<indexterm significance="normal"><primary><varname>tAvg</varname></primary></indexterm>
				<para>Temporal minimum, maximum, sum, and average</para>
				<para><varname>tMin(ttype) → ttype</varname></para>
				<para><varname>tMax(ttype) → ttype</varname></para>
				<para><varname>tSum(tnumber) → {tnumberSeq,tnumberSeqSet}</varname></para>
				<para><varname>tAvg(tnumber) → {tfloatSeq,tfloatSeqSet}</varname></para>
				<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT tMin(NoEmps) FROM Department;
-- {[10@2001-01-01, 4@2001-02-01, 6@2001-06-01, 6@2001-10-01)}
SELECT tMax(NoEmps) FROM Department;
-- {[10@2001-01-01, 12@2001-04-01, 6@2001-08-01, 6@2001-10-01)}
SELECT tSum(NoEmps) FROM Department;
/* {[10@2001-01-01, 14@2001-02-01, 16@2001-04-01, 18@2001-06-01, 6@2001-08-01,
   6@2001-10-01)} */
SELECT tAvg(NoEmps) FROM Department;
/* {[10@2001-01-01, 10@2001-02-01), [7@2001-02-01, 7@2001-04-01),
   [8@2001-04-01, 8@2001-06-01), [9@2001-06-01, 9@2001-08-01),
   [6@2001-08-01, 6@2001-10-01) */
</programlisting>
			</listitem>

			<listitem xml:id="wCount">
				<indexterm significance="normal"><primary><varname>wCount</varname></primary></indexterm>
				<para>Window count</para>
				<para><varname>wCount(tnumber,interval) → {tintSeq,tintSeqSet}</varname></para>
				<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT wCount(NoEmps, interval '2 days') FROM Department;
/* {[1@2001-01-01, 2@2001-02-01, 3@2001-04-01, 2@2001-04-03, 3@2001-06-01, 2@2001-06-03,
   1@2001-08-03, 1@2001-10-03)} */
</programlisting>
			</listitem>

			<listitem xml:id="wMin">
				<indexterm significance="normal"><primary><varname>wMin</varname></primary></indexterm>
				<indexterm significance="normal"><primary><varname>wMax</varname></primary></indexterm>
				<indexterm significance="normal"><primary><varname>wSum</varname></primary></indexterm>
				<indexterm significance="normal"><primary><varname>wAvg</varname></primary></indexterm>
				<para>Window minimum, maximum, sum, and average</para>
				<para><varname>wMin(tnumber,interval) → {tnumberSeq,tnumberSeqSet}</varname></para>
				<para><varname>wMax(tnumber,interval) → {tnumberDiscSeq,tnumberSeqSet}</varname></para>
				<para><varname>wSum(tint,interval) → {tintSeq,tintSeqSet}</varname></para>
				<para><varname>wAvg(tint,interval) → {tfloatSeq,tfloatSeqSet}</varname></para>
				<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT wMin(NoEmps, interval '2 days') FROM Department;
-- {[10@2001-01-01, 4@2001-04-01, 6@2001-06-03, 6@2001-10-03)}
SELECT wMax(NoEmps, interval '2 days') FROM Department;
-- {[10@2001-01-01, 12@2001-04-01, 6@2001-08-03, 6@2001-10-03)}
SELECT wSum(NoEmps, interval '2 days') FROM Department;
/* {[10@2001-01-01, 14@2001-02-01, 26@2001-04-01, 16@2001-04-03, 22@2001-06-01,
   18@2001-06-03, 6@2001-08-03, 6@2001-10-03)} */
SELECT round(wAvg(NoEmps, interval '2 days'), 3) FROM Department;
/*  Interp=Step;{[10@2001-01-01, 7@2001-02-01, 8.667@2001-04-01, 8@2001-04-03, 
   7.333@2001-06-01, 9@2001-06-03, 6@2001-08-03, 6@2001-10-03)} */
</programlisting>
			</listitem>

			<listitem xml:id="tCentroid">
				<indexterm significance="normal"><primary><varname>tCentroid</varname></primary></indexterm>
				<para>Temporal centroid</para>
				<para><varname>tCentroid(tgeompoint) → tgeompoint</varname></para>
				<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT tCentroid(Trip) FROM Trips;
/* {[POINT(0 0)@2001-01-01 08:00:00+00, POINT(1 0)@2001-01-01 08:05:00+00),
   [POINT(0.5 0)@2001-01-01 08:05:00+00, POINT(1.5 0.5)@2001-01-01 08:10:00+00,
   POINT(2 1.5)@2001-01-01 08:15:00+00),
   [POINT(2 2)@2001-01-01 08:15:00+00, POINT(3 3)@2001-01-01 08:20:00+00)} */
</programlisting>
			</listitem>

		</itemizedlist>
	</sect1>

	<sect1 xml:id="ttype_indexing">
		<title>Indexing</title>
		<para>GiST and SP-GiST indexes can be created for table columns of temporal types. The GiST index implements an R-tree and the SP-GiST index implements an n-dimensional quad-tree. Examples of index creation are as follows:
			<programlisting language="sql" xml:space="preserve" format="linespecific">
CREATE INDEX Department_NoEmps_Gist_Idx ON Department USING Gist(NoEmps);
CREATE INDEX Trips_Trip_SPGist_Idx ON Trips USING SPGist(Trip);
</programlisting>
		</para>

		<para>The GiST and SP-GiST indexes store the bounding box for the temporal types. As explained in <xref linkend="ttype_p1"/>, these are
			<itemizedlist>
				<listitem>
					<para>the <varname>tstzspan</varname> type for the <varname>tbool</varname> and <varname>ttext</varname> types,</para>
				</listitem>

				<listitem>
					<para>the <varname>tbox</varname> type for the <varname>tint</varname> and <varname>tfloat</varname> types,</para>
				</listitem>

				<listitem>
					<para>the <varname>stbox</varname> type for the <varname>tgeompoint</varname>, <varname>tgeogpoint</varname>, <varname>tgeometry</varname> and <varname>tgeography</varname> types.</para>
				</listitem>
			</itemizedlist>
		</para>

		<para>A GiST or SP-GiST index can accelerate queries involving the following operators (see <xref linkend="ttype_bbox"/> for more information):
			<itemizedlist>
				<listitem>
					<para><varname>&lt;&lt;</varname>, <varname>&amp;&lt;</varname>, <varname>&amp;&gt;</varname>, <varname>&gt;&gt;</varname>, which only consider the value dimension in temporal alphanumeric types,</para>
				</listitem>

				<listitem>
					<para><varname>&lt;&lt;</varname>, <varname>&amp;&lt;</varname>, <varname>&amp;&gt;</varname>, <varname>&gt;&gt;</varname>, <varname>&lt;&lt;|</varname>, <varname>&amp;&lt;|</varname>, <varname>|&amp;&gt;</varname>, <varname>|&gt;&gt;</varname>, <varname>&amp;&lt;/</varname>, <varname>&lt;&lt;/</varname>, <varname>/&gt;&gt;</varname>, and <varname>/&amp;&gt;</varname>, which only consider the spatial dimension in temporal point types,</para>
				</listitem>

				<listitem>
					<para><varname>&amp;&lt;#</varname>, <varname>&lt;&lt;#</varname>, <varname>#&gt;&gt;</varname>, <varname>#&amp;&gt;</varname>, which only consider the time dimension for all temporal types,</para>
				</listitem>

				<listitem>
					<para><varname>&amp;&amp;</varname>, <varname>@&gt;</varname>, <varname>&lt;@</varname>, <varname>~=</varname>, and <varname>|=|</varname>, which consider as many dimensions as they are shared by the indexed column and the query argument. These operators work on bounding boxes (that is, <varname>tstzspan</varname>, <varname>tbox</varname>, or <varname>stbox</varname>), not the entire values.</para>
				</listitem>
			</itemizedlist>
		</para>

		<para>For example, given the index defined above on the <varname>Department</varname> table and a query that involves a condition with the <varname>&amp;&amp;</varname> (overlaps) operator, if the right argument is a temporal float then both the value and the time dimensions are considered for filtering the tuples of the relation, while if the right argument is a float value, a float span, or a time type, then either the value or the time dimension will be used for filtering the tuples of the relation. Furthermore, a bounding box can be constructed from a value/span and/or a timestamp/period, which can be used for filtering the tuples of the relation. Examples of queries using the index on the <varname>Department</varname> table defined above are given next.
			<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT * FROM Department WHERE NoEmps &amp;&amp; intspan '[1, 5)';
SELECT * FROM Department WHERE NoEmps &amp;&amp; tstzspan '[2001-04-01, 2001-05-01)';
SELECT * FROM Department WHERE NoEmps &amp;&amp;
  tbox(intspan '[1, 5)', tstzspan '[2001-04-01, 2001-05-01)');
SELECT * FROM Department WHERE NoEmps &amp;&amp;
  tfloat '{[1@2001-01-01, 1@2001-02-01), [5@2001-04-01, 5@2001-05-01)}';
</programlisting>
		</para>

		<para>Similarly, examples of queries using the index on the <varname>Trips</varname> table defined above are given next.
			<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT * FROM Trips WHERE Trip &amp;&amp; geometry 'Polygon((0 0,0 1,1 1,1 0,0 0))';
SELECT * FROM Trips WHERE Trip &amp;&amp; timestamptz '2001-01-01';
SELECT * FROM Trips WHERE Trip &amp;&amp; tstzspan '[2001-01-01, 2001-01-05)';
SELECT * FROM Trips WHERE Trip &amp;&amp;
  stbox(geometry 'Polygon((0 0,0 1,1 1,1 0,0 0))', tstzspan '[2001-01-01, 2001-01-05]');
SELECT * FROM Trips WHERE Trip &amp;&amp;
  tgeompoint '{[Point(0 0)@2001-01-01, Point(1 1)@2001-01-02, Point(1 1)@2001-01-05)}';
</programlisting>
		</para>

		<para>Finally, B-tree indexes can be created for table columns of all temporal types. For this index type, the only useful operation is equality. There is a B-tree sort ordering defined for values of temporal types, with corresponding <varname>&lt;</varname>, <varname>&lt;=</varname>, <varname>&gt;</varname>, <varname>&gt;=</varname> and operators, but the ordering is rather arbitrary and not usually useful in the real world. B-tree support for temporal types is primarily meant to allow sorting internally in queries, rather than creation of actual indexes.</para>

		<para>In order to speed up several of the functions for temporal types, we can add in the <varname>WHERE</varname> clause of queries a bounding box comparison that make uses of the available indexes. For example, this would be typically the case for the functions that project the temporal types to the value/spatial and/or time dimensions. This will filter out the tuples with an index as shown in the following query.
			<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT atTime(T.Trip, tstzspan '[2001-01-01, 2001-01-02)')
FROM Trips T
-- Bouding box index filtering
WHERE T.Trip &amp;&amp; tstzspan '[2001-01-01, 2001-01-02)';
</programlisting>
		</para>

		<para>In the case of temporal points, all spatial relationships with the ever and always semantics (see <xref linkend="tgeo_spatial_rel"/>) automatically include a bounding box comparison that will make use of any indexes that are available on the temporal points. For this reason, the first version of the relationships is typically used for filtering the tuples with the help of an index when computing the temporal relationships as shown in the following query.
			<programlisting language="sql" xml:space="preserve" format="linespecific">
SELECT tIntersects(T.Trip, R.Geom)
FROM Trips T, Regions R
-- Bouding box index filtering
WHERE eIntersects(T.Trip, R.Geom);
</programlisting>
		</para>
	</sect1>

	<sect1 xml:id="ttype_statistics">
		<title>Statistics and Selectivity</title>
		<sect2>
			<title>Statistics Collection</title>
			<para>The PostgreSQL planner relies on statistical information about the contents of tables in order to generate the most efficient execution plan for queries. These statistics include a list of some of the most common values in each column and a histogram showing the approximate data distribution in each column. For large tables, a random sample of the table contents is taken, rather than examining every row. This enables large tables to be analyzed in a small amount of time. The statistical information is gathered by the <varname>ANALYZE</varname> command and stored in the <varname>pg_statistic</varname> catalog table. Since different kinds of statistics may be appropriate for different kinds of data, the table only stores very general statistics (such as number of null values) in dedicated columns. Everything else is stored in five “slots”, which are couples of array columns that store the statistics for a column of an arbitrary type.</para>

			<para>The statistics collected for time types and temporal types are based on those collected by PostgreSQL for scalar types and span types. For scalar types, such as <varname>float</varname>, the following statistics are collected:
				<orderedlist numeration="arabic" inheritnum="ignore" continuation="restarts">
					<listitem>
						<para>fraction of null values,</para>
					</listitem>
					<listitem>
						<para>average width, in bytes, of non-null values,</para>
					</listitem>
					<listitem>
						<para>number of different non-null values,</para>
					</listitem>
					<listitem>
						<para>array of most common values and array of their frequencies,</para>
					</listitem>
					<listitem>
						<para>histogram of values, where the most common values are excluded,</para>
					</listitem>
					<listitem>
						<para>correlation between physical and logical row ordering.</para>
					</listitem>
				</orderedlist>
			</para>

			<para>For span types, like <varname>tstzspan</varname>, three additional histograms are collected:
				<orderedlist continuation="continues" numeration="arabic" inheritnum="ignore">
					<listitem>
						<para>length histogram of non-empty spans,</para>
					</listitem>
					<listitem>
						<para>histograms of lower and upper bounds.</para>
					</listitem>
				</orderedlist>
			</para>

			<para>For geometries, in addition to (1)–(3), the following statistics are collected:
				<orderedlist continuation="continues" numeration="arabic" inheritnum="ignore">
					<listitem>
						<para>number of dimensions of the values, N-dimensional bounding box, number of rows in the table, number of rows in the sample, number of non-null values,</para>
					</listitem>
					<listitem>
						<para>N-dimensional histogram that divides the bounding box into a number of cells and keeps the proportion of values that intersects with each cell.</para>
					</listitem>
				</orderedlist>
			</para>

			<para>The statistics collected for columns of the time and span types <varname>tstzset</varname>, <varname>tstzspan</varname>, <varname>tstzspanset</varname>, <varname>intspan</varname>, and <varname>floatspan</varname> replicate those collected by PostgreSQL for the <varname>tstzrange</varname>. This is clear for the span types in MobilityDB, which are more efficient versions of the range types in PostgreSQL. For the <varname>tstzset</varname> and the <varname>tstzspanset</varname> types, a value is converted into its bounding period, then the statistics for the <varname>tstzspan</varname> type are collected.</para>

			<para>The statistics collected for columns of temporal types depend on their subtype and their base type. In addition to statistics (1)–(3) that are collected for all temporal types, statistics are collected for the time and the value dimensions independently. More precisely, the following statistics are collected for the time dimension:
				<itemizedlist>
					<listitem>
						<para>For columns of instant subtype, the statistics (4)–(6) are collected for the timestamps.</para>
					</listitem>

					<listitem>
						<para>For columns of other subtype, the statistics (7)–(8) are collected for the (bounding box) periods.</para>
					</listitem>
				</itemizedlist>
			</para>

			<para>The following statistics are collected for the value dimension:
				<itemizedlist>
					<listitem>
						<para>For columns of temporal types with step interpolation (that is, <varname>tbool</varname>, <varname>ttext</varname>, or <varname>tint</varname>):
							<itemizedlist>
								<listitem>
									<para>For the instant subtype, the statistics (4)–(6) are collected for the values.</para>
								</listitem>

								<listitem>
									<para>For all other subtypes, the statistics (7)–(8) are collected for the values.</para>
								</listitem>
							</itemizedlist>
						</para>
					</listitem>

					<listitem>
						<para>For columns of the temporal float type (that is, <varname>tfloat</varname>):
							<itemizedlist>
								<listitem>
									<para>For the instant subtype, the statistics (4)–(6) are collected for the values.</para>
								</listitem>
								<listitem>
									<para>For all other subtype, the statistics (7)–(8) are collected for the (bounding) value spans.</para>
								</listitem>
							</itemizedlist>
						</para>
					</listitem>

					<listitem>
						<para>For columns of temporal point types (that is, <varname>tgeompoint</varname> and <varname>tgeogpoint</varname>) the statistics (9)–(10) are collected for the points.</para>
					</listitem>
				</itemizedlist>
			</para>
		</sect2>

		<sect2>
			<title>Selectivity Estimation</title>

			<para>Boolean operators in PostgreSQL can be associated with two selectivity functions, which compute how likely a value of a given type will match a given criterion. These selectivity functions rely on the statistics collected. There are two types of selectivity functions. The <emphasis>restriction</emphasis> selectivity functions try to estimate the percentage of the rows in a table that satisfy a <varname>WHERE</varname>-clause condition of the form <varname>column OP constant</varname>. On the other hand, the <emphasis>join</emphasis> selectivity functions try to estimate the percentage of the rows in a table that satisfy a <varname>WHERE</varname>-clause condition of the form <varname>table1.column1 OP table2.column2</varname>.</para>

			<para>MobilityDB defines 23 classes of Boolean operators (such as <varname>=</varname>, <varname>&lt;</varname>, <varname>&amp;&amp;</varname>, <varname>&lt;&lt;</varname>, etc.), each of which can have as left or right arguments a PostgreSQL type (such as <varname>integer</varname>, <varname>timestamptz</varname>, etc.) or a MobilityDB type (such as <varname>tstzspan</varname>, <varname>tint</varname>, etc.). As a consequence, there is a very high number of operators with different arguments to be considered for the selectivity functions. The approach taken was to group these combinations into classes corresponding to the value and time dimensions. The classes correspond to the type of statistics collected as explained in the previous section.</para>

			<para>MobilityDB estimates both restriction and join selectivity for time, span, and temporal types.</para>
		</sect2>
	</sect1>
</chapter>