File: dox.h

package info (click to toggle)
vc 1.4.5-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 4,552 kB
  • sloc: cpp: 19,220; ansic: 15,669; sh: 453; xml: 186; makefile: 30
file content (707 lines) | stat: -rw-r--r-- 31,737 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
/*  This file is part of the Vc library. {{{
Copyright © 2009-2015 Matthias Kretz <kretz@kde.org>

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * Neither the names of contributing organizations nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

}}}*/

/*!
\mainpage
\image html logo.png

The %Vc library implements portable, zero-overhead C++ types for explicitly data-parallel
programming.

The 1.x releases ship implementations for x86 SIMD instruction sets: SSE, AVX, AVX2, and
the Xeon Phi (MIC). A scalar implementation ensures full portability to any C++11 capable
compiler and target system.

This documentation is structured in three main areas:
1. Several manually written documentation pages in the *Overview Documents*.
   They cover an introduction to SIMD and data-parallelism, portability issues, macros,
   how to set up the build system, and examples / tutorials.
   \li \subpage intro
   \li \subpage portability
   \li \subpage vcmacros
   \li \subpage featuremacros
   \li \subpage buildsystem
   \li \subpage examples

2. The *API Reference* section contains a manually structured access to the API
   documentation generated from the %Vc sources.
   \li \ref Vectors
   \li \ref Masks
   \li \ref SimdArray
   \li \ref Simdize
   \li \ref Math
   \li \ref Utilities
   \li \ref Containers

3. The *Indexes* section contains automatically generated indexes to the same API
   documentation.

\page intro Introduction

Recent generations of CPUs, and GPUs in particular, require data-parallel codes for full efficiency.
Data parallelism requires that the same sequence of operations is applied to different input data.
CPUs and GPUs can thus reduce the necessary hardware for instruction decoding and scheduling in favor of more arithmetic and logic units, which execute the same instructions synchronously.
On CPU architectures this is implemented via SIMD registers and instructions.
A single SIMD register can store N values and a single SIMD instruction can execute N operations on those values.
On GPU architectures N threads run in perfect sync, fed by a single instruction decoder/scheduler.
Each thread has local memory and a given index to calculate the offsets in memory for loads and stores.

Current C++ compilers can do automatic transformation of scalar codes to SIMD instructions (auto-vectorization).
However, the compiler must reconstruct an intrinsic property of the algorithm that was lost when the developer wrote a purely scalar implementation in C++.
Consequently, C++ compilers cannot vectorize any given code to its most efficient data-parallel variant.
Especially larger data-parallel loops, spanning over multiple functions or even translation units, will often not be transformed into efficient SIMD code.

The %Vc library provides the missing link.
Its types enable explicitly stating data-parallel operations on multiple values.
The parallelism is therefore added via the type system.
Competing approaches state the parallelism via new control structures and consequently new semantics inside the body of these control structures.

If you are new to vectorization please read this following part and make sure you
understand it:
- The term *vector* used for data-parallel programming is not about the vectors you
  studied in math classes.
- Do not confuse *vector* with containers that also go by the same name. SIMD vectors
  actually do implement some aspect of a container, but they are closer to a fixed-sized
  `std::array` than to a dynamically resizable `std::vector`.
- The *vector* type in %Vc is defined by the target hardware as a group of values with a
  fixed number of entries (\VSize{T}).
  Typically one Vc::Vector object then fits into a SIMD register on the target system.
  Such a SIMD register consequently stores \VSize{T} scalar values; in contrast to a
  general purpose register, which stores only one scalar value.
  This value \VSize{T} is thus an unchangeable property of the hardware and not a variable
  in the %Vc API.
  You can access the \VSize{T} value via the static Vc::Vector::size() function.
  Since this function is a constant expression you can also use it for template arguments.
- Note that some hardware may use different vector register widths for different data
  types.
  For example, AVX has instructions for 256-bit floating point registers, but only 128-bit
  integer instructions, which is why the integral Vc::Vector types use the SSE
  implementation for AVX target systems.

\par Example 1:

You can modify a function to use vector types and thus implement a horizontal vectorization. The
original scalar function could look like this (If you are confused about the adjective
"scalar" in this context, note that the function mathematically does a vector to vector
transformation. However, the computer computes it with \em scalar instructions, i.e. one
value per operand.):
\code
void normalize(float &x, float &y, float &z)
{
  const float d = std::sqrt(x * x + y * y + z * z);
  x /= d;
  y /= d;
  z /= d;
}
\endcode
To vectorize the \c normalize function with %Vc, the types must be substituted by their %Vc counterparts and math functions
must use the %Vc implementation (which is, per default, also imported into \c std namespace):
\code
using Vc::float_v;

void normalize(float_v &x, float_v &y, float_v &z)
{
  const float_v d = Vc::sqrt(x * x + y * y + z * z);
  x /= d;
  y /= d;
  z /= d;
}
\endcode
The latter function is able to normalize four 3D vectors when compiled for SSE in the same
time the former function normalizes one 3D vector.

For completeness, note that you can optimize the division in the normalize function further:
\code
  const float_v d_inv = float_v::One() / Vc::sqrt(x * x + y * y + z * z);
  const float_v d_inv = Vc::rsqrt(x * x + y * y + z * z); // less accurate, but faster
\endcode
Then you can multiply \c x, \c y, and \c z with \c d_inv, which is considerably faster than three
divisions.

As you can probably see, the new challenge with %Vc is the use of good data-structures which
support horizontal vectorization. Depending on your problem at hand this may become the main
focus of design (it does not have to be, though).

\section intro_alignment Alignment

\subsection intro_alignment_background What is Alignment

If you do not know what alignment is, and why it is important, read on, otherwise skip to \ref
intro_alignment_tools. Normally the alignment of data is an implementation detail left to the
compiler. Until C++11, the language did not even have any (official) means to query or modify
alignment.

Most data types require more than one Byte for storage. Thus, even most atomic data types span
several locations in memory. E.g. if you have a pointer to \c float, the address stored in this
pointer just determines the first of four Bytes of the \c float. Naively, one could think that
any address (which belongs to the process) can be used to store such a float. While this is true
for some architectures, some architectures may terminate the process when a misaligned pointer is
dereferenced. The natural alignment for atomic data types typically is the same as their size.
Thus the address of a \c float object should always be a multiple of 4 Bytes.

Alignment becomes more important for SIMD data types.
1. There are different instructions to load/store aligned and unaligned vectors. The unaligned
load/stores recently were greatly improved in x86 CPUs. Still, the rule of thumb
says that aligned loads/stores are faster.
2. Access to an unaligned vector with an instruction that expects an aligned vector crashes the
application. Once you write vectorized code you might want to make it a habit to check crashes
for unaligned addresses.
3. Memory allocation on the heap will return addresses aligned to some system specific alignment
rule. E.g. Linux 32bit aligns on 8 Bytes, while Linux 64bit aligns on 16 Bytes. Both alignments
are not strict enough for AVX vectors. Worse, if you develop on Linux 64bit with SSE you won't
notice any problems until you switch to a 32bit build or AVX.
4. Placement on the stack is determined at compile time and requires the compiler to know the
alignment restrictions of the type.
5. The size of a cache line is just two or four times larger than the SIMD types (if not equal).
Thus, if you load several vectors consecutively from memory every fourth, second, or even every
load will have to be read from two different cache lines. This is called a cache line split. They
lead to degraded performance, which becomes very noticeable for memory intensive code.

\subsection intro_alignment_tools Tools

%Vc provides several classes and functions to get alignment right.
\li Vc::VectorAlignment is a compile time constant that equals the largest alignment restriction
                  (in Bytes) for the selected target architecture.
\li Vc::AlignedBase, Vc::VectorAlignedBase, and Vc::MemoryAlignedBase implement the alignment
                  restrictions needed for aligned vector loads and stores. They set the
                  alignment attribute and reimplement the \c new and \c delete operators,
                  returning correctly aligned pointers to the heap.
\li Vc::malloc and Vc::free are meant as replacements for \c malloc and \c free. They can be used
                  to allocate any type of memory with an abstract alignment restriction: \ref
                  Vc::MallocAlignment. Note, that (like \c malloc) the memory is only allocated
                  and not initialized. If you allocate memory for a type that has a constructor,
                  use the placement new syntax to initialize the memory.
\li Vc::Allocator is an STL compatible allocator class that behaves as specified in the C++
                  specification, implementing the optional support for over-aligned types.
                  Therefore, memory addresses returned from this allocator will always be
                  aligned to at least the constraints attached to the type \c T. STL containers
                  will already default to Vc::Allocator for Vc::Vector<T>. For all other
                  composite types you want to use, you can take the \ref Vc_DECLARE_ALLOCATOR
                  convenience macro to set is as default.
\li Vc::Memory, Vc::Memory<V, Size, 0u>, Vc::Memory<V, 0u, 0u>
                  The three different variants of the memory class can be used like a more
                  convenient C-array. It supports two-dimensional statically sized arrays and
                  one-dimensional statically and dynamically sized arrays. The memory can be
                  accessed easily via aligned vectors, but also via unaligned vectors or
                  gathers/scatters.



\page portability Portability Issues

One of the major goals of %Vc is to ease development of portable code, while achieving highest
possible performance that requires target architecture specific instructions. This is possible
through having just a single type use different implementations of the same API depending on the
target architecture. Many of the details of the target architecture are often dependent on the
compiler flags that were used. Also there can be subtle differences between the implementations
that could lead to problems. This page aims to document all issues you might need to know about.

\par Compiler Flags

\li \e GCC: The compiler should be called with the `-march=\<target\>` flag. Take a look at the GCC
manpage to find all possibilities for `\<target\>`.
If no SIMD instructions are enabled via compiler flags, %Vc must fall back
to the scalar implementation.
\li \e Clang: The same as for GCC applies.
\li \e ICC: The same as for GCC applies (at least on Linux).
\li \e MSVC: The compiler supports, among others, the `/arch:AVX`, `/arch:AVX2` and `/arch:AVX512 flags.
Without such a flag, at least SSE2 is enabled.

\par Where does the final executable run?

You must be aware of the fact that a binary that is built for a given SIMD hardware may not run
on a processor that does not have these instructions. The executable will work fine as long as no
such instruction is actually executed and only crash at the place where such an instruction is
used. Thus it is better to check at application start whether the compiled in SIMD hardware is
really supported on the executing CPU. This can be determined with the
currentImplementationSupported function.

If you want to distribute a binary that runs correctly on many different systems you either must
restrict it to the least common denominator (which often is SSE2), or you must compile the code
several times, with the different target architecture compiler options. A simple way to combine
the resulting executables would be via a wrapping script/executable that determines the correct
executable to use. A more sophisticated option is the use of the ifunc attribute GCC provides.
Other compilers might provide similar functionality.

\par Guarantees

It is guaranteed that:
\li \code int_v::Size == uint_v::Size == float_v::Size \endcode
\li \code short_v::Size == ushort_v::Size \endcode

\par Important Differences between Implementations

\li Obviously the number of entries in a vector depends on the target architecture.
\li Hardware that does not support 16-Bit integer vectors can implement the short_v and ushort_v
API via 32-Bit integer vectors. Thus, some of the overflow behavior might be slightly different,
and truncation will only happen when the vector is stored to memory.

\section portability_compilerquirks Compiler Quirks

Since SIMD is not part of the C/C++ language standards %Vc abstracts more or less standardized
compiler extensions. Sadly, not every issue can be transparently abstracted.
Therefore this will be the place where differences are documented:
\li MSVC is incapable of parameter passing by value, if the type has alignment restrictions. The
consequence is that all %Vc vector types and any type derived from Vc::VectorAlignedBase cannot be
used as function parameters, unless a pointer is used (this includes reference and
const-reference). So \code
void foo(Vc::float_v) {}\endcode does not compile, while \code
void foo(Vc::float_v &) {}
void foo(const Vc::float_v &) {}
void foo(Vc::float_v *) {}
\endcode all work.
Normally you should prefer passing by value since a sane compiler will then pass the data in a
register and does not have to store/load the data to/from the stack. %Vc defines \c
Vc_PASSING_VECTOR_BY_VALUE_IS_BROKEN for such cases. Also the %Vc vector types contain a composite
typedef \c AsArg which resolves to either const-ref or const-by-value. Thus, you can always use
\code void foo(Vc::float_v::AsArg) {}\endcode.


\page vcmacros Pre-defined Macros

The %Vc library defines a few macros that you may rely on in your code:

\section vc_impl Implementation Identification

One or more of the following macros will be defined:
\li \ref Vc_IMPL_Scalar
\li \ref Vc_IMPL_SSE
\li \ref Vc_IMPL_SSE2
\li \ref Vc_IMPL_SSE3
\li \ref Vc_IMPL_SSSE3
\li \ref Vc_IMPL_SSE4_1
\li \ref Vc_IMPL_SSE4_2
\li \ref Vc_IMPL_AVX
\li \ref Vc_IMPL_AVX2

You can use these macros to enable target-specific implementations.
In general, it is better to rely on function overloading or template mechanisms, though.
Per default, code compiled against the %Vc headers will use the instruction set that the compiler advertises as available.
For example, compiling with "g++ -mssse3" chooses the SSE implementation (Vc::VectorAbi::Sse) with instructions from SSE, SSE2, SSE3 and SSSE3.
After you include a %Vc header, you will have the following macros available, which you can (but normally should not) use to determine the implementation %Vc uses:

\section vc_size Vector/Mask Sizes

The macros \ref Vc_DOUBLE_V_SIZE, \ref Vc_FLOAT_V_SIZE, \ref Vc_INT_V_SIZE, \ref Vc_UINT_V_SIZE, \ref Vc_SHORT_V_SIZE, and \ref Vc_USHORT_V_SIZE make the default vector width accessible in the preprocessor.
In most cases you should prefer the Vector::size() function, though.
Since this function is \c constexpr you can use it for compile-time decisions (e.g. as template argument).

\section vc_compiler Compiler Identification (and related)

- \ref Vc_GCC
- \ref Vc_CLANG
- \ref Vc_APPLECLANG
- \ref Vc_ICC
- \ref Vc_MSVC
- \ref Vc_PASSING_VECTOR_BY_VALUE_IS_BROKEN

\section vc_version Version Macros

- \ref Vc_VERSION_STRING
- \ref Vc_VERSION_NUMBER
- \ref Vc_VERSION_CHECK

\section vc_boilerplate Boilerplate Code Generation

- \ref Vc_SIMDIZE_INTERFACE


\page featuremacros Feature Macros

You can define the following macros to enable/disable specific features:

\section set_vc_impl Vc_IMPL

If you want to force compilation against a specific implementation of the vector classes you can set the macro Vc_IMPL to either
\c Scalar, \c SSE, \c SSE2, \c SSE3, \c SSSE3, \c SSE4_1, \c SSE4_2, \c AVX, \c AVX2, or \c MIC.
Additionally, you may (should) append \c +XOP, \c +FMA4, \c +FMA, \c +SSE4a, \c +F16C, \c +BMI2, and/or \c +POPCNT.
For example, `-D Vc_IMPL=SSE+XOP+FMA4` tells the Vc library to use the best SSE instructions available for the target (according to the information provided by the compiler) and additionally use XOP and FMA4 instructions (this might be a good choice for some AMD processors, which support AVX but may perform slightly better if only SSE widths are used).
Setting \c Vc_IMPL to \c SSE forces the SSE instruction set, but lets the headers figure out the exact SSE revision to use, or, if that fails, uses SSE4.1.

If you do not specify \c Vc_IMPL the %Vc headers determine the implementation from compiler-specific pre-defined macros (which in turn are determined from compiler flags that determine the target micro-architecture, such as \c -mavx2).

\section Vc_NO_STD_FUNCTIONS

If this macro is defined, the %Vc math functions are not imported into the \c std namespace.
They are still available in the %Vc namespace and through [ADL](http://en.cppreference.com/w/cpp/language/adl).

\section Vc_NO_VERSION_CHECK

Define this macro to disable the safety check for the libVc version.
The check generates a small check for every object file, which is called at startup, i.e. before
the main function.

\section Vc_CHECK_ALIGNMENT

If this macro is defined %Vc will assert correct alignment for all objects that require correct alignment.
This can be very useful to debug crashes resulting from misaligned memory accesses.
This check will introduce a significant overhead.

\section Vc_ENABLE_FLOAT_BIT_OPERATORS

Define this macro to enable bitwise operators (&, |, ^, ~) on floating-point vectors. Since these
operators are not provided for the builtin floating-point types, the default is to not provide
them for SIMD vector types as well.



\page buildsystem Build System

%Vc uses CMake as its buildsystem.
It also provides much of the CMake logic it uses for itself for other projects that use CMake and %Vc.
Here's an (incomplete) list of features you can get from the CMake scripts provided with %Vc:
\li check for a required %Vc version
\li locate libVc and %Vc includes
\li compiler flags to work around %Vc related quirks/bugs in specific compilers
\li compiler flags to enable/disable SIMD instruction sets (defaults to full support for the host system)

\section buildsystem_variables CMake Variables

To make use of these features simply copy the FindVc.cmake as installed by %Vc to your project.
Add \code find_package(Vc [version] [REQUIRED]) \endcode to your CMakeLists.txt. After that you
can use the following variables:
\li \e Vc_FOUND: tells whether the package was found
\li \e Vc_INCLUDE_DIR: you must add this to your include directories for the targets that you
want to compile against %Vc: \code include_directories(${Vc_INCLUDE_DIR}) \endcode
\li \e Vc_DEFINITIONS: recommended preprocessor definitions. You can use them via \c add_definitions.
\li \e Vc_COMPILE_FLAGS: recommended compiler flags. You can use them via the
\li \e Vc_ARCHITECTURE_FLAGS: recommended compiler flags for a selected target
microarchitecture. You can use them via the \c COMPILE_OPTIONS property or via \c
add_compile_options.
\li \e Vc_ALL_FLAGS: a list combining the above three variables. Use it to conveniently
set all required compiler flags in one place (e.g. via \c add_compile_options).

The following variables might be of interest, too:
\li \e Vc_SSE_INTRINSICS_BROKEN
\li \e Vc_AVX_INTRINSICS_BROKEN
\li \e Vc_XOP_INTRINSICS_BROKEN
\li \e Vc_FMA4_INTRINSICS_BROKEN

\section buildsystem_macros CMake Macros

The macro vc_compile_for_all_implementations is provided to help with compiling a given source
file multiple times with all different possible SIMD targets for the given architecture.
Example:
\verbatim
vc_compile_for_all_implementations(objs src/trigonometric.cpp
  FLAGS -DSOME_FLAG
  EXCLUDE Scalar SSE2)
\endverbatim
You can specify an arbitrary number of additional compiler flags after the FLAGS argument. These
flags will be used for all compiler calls. After an optional EXCLUDE argument you can specify targets
that you want to exclude. After an optional ONLY argument you can specify targets that you want
to compile for. (So either you exclude some, or you explicitly list the targets you want.)

Often it suffices to have SSE2 or SSE3 as the least common denominator and provide SSE4_1 and
AVX. Here is the currently complete list of possible targets the macro will compile for:
\li Scalar
\li SSE2
\li SSE3
\li SSSE3
\li SSE4_1
\li SSE4_2
\li SSE3+SSE4a
\li SSE+XOP+FMA4
\li AVX
\li AVX+XOP+FMA4
\li AVX2+FMA+BMI2

\section buildsystem_other Using Vc without CMake

If your project does not use CMake all you need to do is the following:
\li Find the header file "Vc/Vc" and add its path to your include paths.
\li Find the library libVc and link to it.
\li Ensure you use the right compiler flags to enable the relevant SIMD instructions.


\defgroup Vectors Vectors
\defgroup Masks Masks
\defgroup SimdArray SIMD Array
\defgroup Simdize simdize<T>
\defgroup Math Math
\defgroup Utilities Utilities
\defgroup Containers Containers

\addtogroup Vectors

The vector classes abstract the SIMD registers and their according instructions into types that
feel very familiar to C++ developers.

Note that the documented types Vc::float_v, Vc::double_v, Vc::int_v, Vc::uint_v, Vc::short_v,
and Vc::ushort_v are actually \c typedefs of the \c Vc::Vector<T> class:
\code
namespace Vc {
  template<typename T> class Vector;
  typedef Vector<double> double_v;
  typedef Vector<float> float_v;
  // ...
}
\endcode

\par Some general information on using the vector classes:

Generally you can always mix scalar values with vectors as %Vc will automatically broadcast the
scalar to a vector and then execute a vector operation. But, in order to ensure that implicit
type conversions only happen as defined by the C standard, there is only a very strict implicit
scalar to vector constructor:
\code
int_v a = 1;     // good:             int_v(int)
uint_v b = 1u;   // good:             uint_v(unsigned int)
uint_v c = 1;    // does not compile: uint_v(int)
float_v d = 1;   // does not compile: float_v(int)
float_v e = 1.;  // does not compile: float_v(double)
float_v f = 1.f; // good:             float_v(float)
\endcode

The following ways of initializing a vector are not allowed:
\code
int_v v(3, 2, 8, 0); // constructor does not exist because it is not portable
int_v v;
v[0] = 3; v[1] = 2; v[2] = 8; v[3] = 0; // do not hardcode the number of entries!
// You can not know whether somebody will compile with %Vc Scalar where int_v::Size == 1
\endcode

Instead, if really necessary you can do:
\code
Vc::int_v v;
for (int i = 0; i < int_v::Size; ++i) {
  v[i] = f(i);
}
// which is equivalent to:
v.fill(f);
// or:
v = int_v::IndexesFromZero().apply(f);
\endcode



\addtogroup Masks

Mask classes are abstractions for the results of vector comparisons. The actual implementation
differs depending on the SIMD instruction set. On SSE they contain a full 128-bit datatype while
on a different architecture they might be bit-fields.


\addtogroup Utilities

Additional classes, macros, and functions that help to work more easily with the main vector
types.


\addtogroup Containers

For some problems, standard (or third-party) containers can be used.
Simply use a `value_type` of `Vc::Vector<T>`.
However, this requires:
\li You actually have control over the data structures and can design/modify them for easy
    vectorization usage.
\li The access patterns are non-random. Because random access to individual `value_type`
    \em elements is going to be a pain (two subscripts, first into the container, then
    into the `Vc::Vector`)

Therefore, for some problems you need to work with containers over elements of
non-`Vector` type (e.g. of type `double` or a `struct`).
Vc provides some help:
\li Vc::vector
\li Vc::array
\li Vc::span
\li Vc::Common::Memory (discouraged)
\li Vc::Common::InterleavedMemoryWrapper


\addtogroup Math

Functions that implement math functions. Take care that some of the implementations will return
results with less precision than what the FPU calculates.


\addtogroup SimdArray

This set of class templates and associated functions and operators enables
data-parallel algorithms and data structures requiring a user-defined number of elements
(fixed at compile time, in contrast to \c std::valarray where the number of elements is
 only determined at run time).
The main motivation for a user-defined number of elements is the need for type conversion
and thus a guaranteed equal number of elements in data-parallel vectors for e.g. \c float
and \c int.
A typical pattern looks like this:
\code
using floatv = Vc::float_v;
using doublev = Vc::SimdArray<double, floatv::size()>;
using intv = Vc::SimdArray<int, floatv::size()>;
using uintv = Vc::SimdArray<unsigned int, floatv::size()>;
\endcode

The second motivation for a user-defined number of elements is that many vertical
vectorizations require a fixed number of elements (i.e. number known at development time
and not chosen at compile time).
The implementation can then choose how to support this number most efficiently with the
available hardware resources.
Consider, for example, a need for processing 12 values in parallel.
On x86 with AVX, the implementation could build such a type from one AVX and one SSE register.

In contrast to \c std::array the types behave like the Vc::Vector types, implementing the same
operators and functions.
The semantics with regard to implicit conversions differ slightly:
The Vc::Vector conversion rules are safer with regard to source compatibility.
The Vc::SimdArray conversion rules are less strict and could potentially lead to portability issues.
Therefore, it is best to stick to the pattern of type aliases shown above.
*/

/**
 * \brief Vector Classes Namespace
 *
 * All functions and types of %Vc are defined inside the %Vc namespace.
 *
 * \internal
 * Internal types and functions should be defined either in the Vc::Detail namespace or in
 * a `Vc::<Impl>` namespace such as Vc::SSE.
 */
namespace Vc
{
#define INDEX_TYPE Vc::SimdArray<int, size()>
#define VECTOR_TYPE Vc::Vector<T>
#define ENTRY_TYPE T
#define MASK_TYPE Vc::Mask<T>
#define EXPONENT_TYPE Vc::SimdArray<int, size()>

#include "dox-math.h"

#undef INDEX_TYPE
#undef VECTOR_TYPE
#undef ENTRY_TYPE
#undef MASK_TYPE
#undef EXPONENT_TYPE

/**
 * \name SIMD Support Feature Macros
 * \ingroup Utilities
 */
//@{
///\addtogroup Utilities
//@{
/**
 * This macro is defined if the current translation unit is compiled with XOP instruction support.
 */
#define Vc_IMPL_XOP
/**
 * This macro is defined if the current translation unit is compiled with FMA4 instruction support.
 */
#define Vc_IMPL_FMA4
/**
 * This macro is defined if the current translation unit is compiled with F16C instruction support.
 */
#define Vc_IMPL_F16C
/**
 * This macro is defined if the current translation unit is compiled with POPCNT instruction support.
 */
#define Vc_IMPL_POPCNT
/**
 * This macro is defined if the current translation unit is compiled with SSE4a instruction support.
 */
#define Vc_IMPL_SSE4a
/**
 * This macro is defined if the current translation unit is compiled without any SIMD support.
 */
#define Vc_IMPL_Scalar
/**
 * This macro is defined if the current translation unit is compiled with any version of SSE (but not
 * AVX).
 */
#define Vc_IMPL_SSE
/**
 * This macro is defined if the current translation unit is compiled with SSE2 instruction support
 * (excluding SSE3 and up).
 */
#define Vc_IMPL_SSE2
/**
 * This macro is defined if the current translation unit is compiled with SSE3 instruction support (excluding SSSE3 and up).
 */
#define Vc_IMPL_SSE3
/**
 * This macro is defined if the current translation unit is compiled with SSSE3 instruction support (excluding SSE4.1 and up).
 */
#define Vc_IMPL_SSSE3
/**
 * This macro is defined if the current translation unit is compiled with SSE4.1 instruction support (excluding SSE4.2 and up).
 */
#define Vc_IMPL_SSE4_1
/**
 * This macro is defined if the current translation unit is compiled with SSE4.2 instruction support (excluding AVX and up).
 */
#define Vc_IMPL_SSE4_2
/**
 * This macro is defined if the current translation unit is compiled with AVX instruction support (excluding AVX2 and up).
 */
#define Vc_IMPL_AVX
/**
 * This macro is defined if the current translation unit is compiled with AVX2 instruction support.
 */
#define Vc_IMPL_AVX2
//@}
//@}

/**
 * \name SIMD Vector Size Macros
 * \ingroup Utilities
 */
//@{
/**
 * \ingroup Utilities
 * An integer (for use with the preprocessor) that gives the number of entries in a double_v.
 */
#define Vc_DOUBLE_V_SIZE
/**
 * \ingroup Utilities
 * An integer (for use with the preprocessor) that gives the number of entries in a float_v.
 */
#define Vc_FLOAT_V_SIZE
/**
 * \ingroup Utilities
 * An integer (for use with the preprocessor) that gives the number of entries in a int_v.
 */
#define Vc_INT_V_SIZE
/**
 * \ingroup Utilities
 * An integer (for use with the preprocessor) that gives the number of entries in a uint_v.
 */
#define Vc_UINT_V_SIZE
/**
 * \ingroup Utilities
 * An integer (for use with the preprocessor) that gives the number of entries in a short_v.
 */
#define Vc_SHORT_V_SIZE
/**
 * \ingroup Utilities
 * An integer (for use with the preprocessor) that gives the number of entries in a ushort_v.
 */
#define Vc_USHORT_V_SIZE
//@}

} // namespace Vc

// vim: ft=doxygen