1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375
|
.. _model:
Model
=====
The data model design is key in a Lager application. The good news
is: changing the design and shape of the data-model in a
value-oriented world tends to be easy. However, for this to be the
case, it is important that we follow some principles strictly.
.. _value-semantics:
.. _independence-principle:
Value semantics
---------------
In the previous section we emphasised that the data-model **must be a
value type**. But what is a `value type`_ really?
In a purely `functional programming`_ language, like Haskell,
everything you name is a *value*. This is, all entities are like
mathematical objects: they are immutable and have some definition of
equivalence.
In a multi-paradigm language like C++, you have both *value types* and
*reference types*. Intances of value types define its meaning through
the values that they represent, that is why we say, they have *value
semantics*. Each instance of a value type has its own copy of a
representation of a value, on which we can operate independently.
Most of the time, we do not care about the identity of an instance,
but about the content, the value, which can be *equal* between
instances. *Reference types*, however, reference other mutable
objects. The identity of these objects is crucial. We can observe
changes to the referred objects through the references. This means
that we have to be particularly careful when dealing with reference
types.
The canonical reference types in C++ are references (``&``) and
pointers (``*``). There are many other reference types. For example,
iterators of a container are references that point into the
container. If we change the container, we can observe the changes
through the iterator. Sometimes even, the iterator may become invalid.
In C++, instances of value types are not necessarily immutable, only
the *value* they represent is. For example, I can mutate a
`std::vector`_ through its ``push_back()`` method. Or I can modify a
``struct`` by assigning to its members. Instead, *they are value
types because each instance of the type is independent and is a
attributed meaning through the values it represent.* This **principle
of independence** is highlighted by this code snippet:
.. code-block:: c++
auto a = std::vector<int>{};
auto b = a;
foo(b);
assert(a.empty());
Here we have an instance of an empty ``std::vector`` named ``a`` and
copy it into another instance ``b``. Then we call ``foo()`` and
finally make an assertion about the state of ``a``. We do not need to
know anything about what ``foo()`` does to reason about the state of
``a``. It could have taken ``b`` by reference and changed it, maybe
even from another thread---it does not matter. In any case, we can be
sure that our assertion about ``a`` holds, because ``b`` is an
independent copy of it, so we can not observe any change in ``b``
through ``a``. This frees our mind, allowing us to reason locally
about the state of the program. This is fundamental to write correct
concurrent programs: we can make sure each thread works on their local
copy of the data to ensure the absence of race conditions.
.. _value type: https://en.wikipedia.org/wiki/Value_type_and_reference_type
.. _std::vector: https://en.cppreference.com/w/cpp/container/vector
.. _functional programming: https://en.wikipedia.org/wiki/Functional_programming
.. admonition:: Spotting reference types
In order to avoid them, it is important to be able to clearly
identify *reference types*. They are sneaky. For example,
``std::vector<T>`` is a value type only as long as ``T`` is also a
value type. Same applies to ``struct``, it is not a value as soon
as a member is not.
Also, there are some C++ types that behave a bit like values, they
seem to be copyable at least---but their copies reference external
entities that mutate, and these changes can be observed through all
these shallow copies! Here are some examples of such types that
shall be avoided in our data-model:
* A **pointer**. A pointer has a value, which is the memory location
of an object. But we can also use this pointer value to change
and observe changes in this memory location.
* An **iterator** into a container, as mentioned above.
* A **file handle**. The file changes independently and we can use
the handle to observe or make changes. The file can even change
outside of our program, completely outside of our control.
* Most **resource handles** actually. Ids of OpenGL buffers, Posix
process handles, etc. Lots of these things look like an ``int``,
but they are not, they are a reference to a mutable entity.
* **Function objects** that capture references, particularly lambdas
with ``&``-captures.
**Are pointers always bad?**
Not always. For instance, the *implementation* of ``std::vector``
uses pointers internally. But it defines ``operator=`` to ensure
the independence principle. Also, Sean Parent shows in his famous
talk `"Inheritance is the base class of evil"`_ that given a value
type ``T``, a forest of ``std::shared_ptr<const T>`` is
effectively a value type. However, the ``immer::box`` type from
the Immer_ library serves the same purpose with safer and more
convenient interface. In the :ref:`performance` section we will
cover the power immutability in more detail.
In general, the application level data model must never use
pointers. It can however use generic container types that
encapsulate the pointers and properly deal with the intricacies of
defining ``operator=`` as to ensure the independence principle.
.. _"Inheritance is the base class of evil": https://www.youtube.com/watch?v=bIhUE5uUFOA
.. _immer: https://github.com/arximboldi/immer
.. admonition:: Further reading
:class: note
Value-semantic thinking is a vast topic. Here are some resources
that can help you adopt this design mindset:
* The book `Elements of Programming`_, by `Alexander Stepanov`_, author of
the STL and great thinker on the intersection between programming
and math.
* The talk `Better Code: Data Structures`_ by `Sean Parent`_,
author of the Adobe Source Libraries, famous for popularizing
`type erasure`_ for value based runtime polymorphism.
* The talk `The most valuable values`_, by `Juan Pedro Bolívar
Puente`_, author of this library.
In the functional programming realm these ideas are taken much
further:
* The talk `The value of values`_, and the supporting essay `Values
and Change`_, by `Rich Hickey`_, author of the programming
language Clojure.
* The talk `Denotational Design: from meanings to programs`_, where
`Conal Elliott`_, inventor of `functional reactive programming`_
among many other other things, talks about applying mathematical
thinking to the design of programs.
* The book `Functional Programming in C++`_ by `Ivan Čukić`_, which
shows how C++ not only supports value semantics, but the
functional programming paradigm as a whole.
.. _elements of programming: http://elementsofprogramming.com
.. _alexander stepanov: https://en.wikipedia.org/wiki/Alexander_Stepanov
.. _better code\: data structures: https://www.youtube.com/watch?v=sWgDk-o-6ZE
.. _sean parent: https://sean-parent.stlab.cc/
.. _the most valuable values: https://www.youtube.com/watch?v=_oBx_NbLghY
.. _the value of values: https://www.youtube.com/watch?v=-6BsiVyC1kM
.. _values and change: https://clojure.org/about/state
.. _juan pedro bolívar puente: http://sinusoid.al
.. _denotational design\: from meanings to programs: https://www.youtube.com/watch?v=bmKYiUOEo2A
.. _functional programming in c++: https://www.manning.com/books/functional-programming-in-c-plus-plus
.. _Ivan Čukić: https://cukic.co/
.. _conal elliott: http://conal.net/
.. _functional reactive programming: https://en.wikipedia.org/wiki/Functional_reactive_programming
.. _type erasure: https://www.youtube.com/watch?v=QGcVXgEVMJg
.. _rich hickey: https://twitter.com/richhickey
.. _identity:
Identity
--------
.. image:: _static/identity.png
:align: center
When writing the model as value types, we soon encounter the problem
of dealing with **identity**. Consider an interactive application
shows a moving person. This person *changes*, it moves around. Our
model would be a *snapshot* of the *state* of this person. But
clearly, the *state* of the *person* is different than the person
itself:
* The **same** person can be in different states, this is, these state
values
are ``!=``.
* Two **different** people can be in the same state, this is, their
state values are ``==``.
In Object Oriented programming, we normally *identify* a language
object with the entity it represents, in this case, the person. The
*identity* of the thing is the memory location of the storage for its
state. This means that we need to use mutation to deal with the state,
that there is only one state per entity at time, that time only
progresses in one direction, that change is an implicit construction,
that entities can not be dealt with concurrently. Identity becomes an
implicit and flaky construction.
However, in real life, we deal with identity in an explicit way. That is
why people have *names* or *passport numbers*. These are special
values, **identity values**, that help us identify people. Identity as
such serves a double purpose, solving the forementioned state/identity
problems:
* *Identity* allows us to recognise *different states* as belonging to
*the same entity*. For example, when you show up at different
institutional offices, you show your id card to show that these
are the same person.
* *Identity* allows us to *differentiate* to a specific entity, that
might have otherwise similar states. In a room full of people, you
can call someone by their full name to refer to a particular person
univocally, distinguishing them from the group.
Considering this duality, when your program deals with changing
entities, you will have to think about the domain of entities as a
whole, and give each of those entities a different, explicit, identity
value. `Universally Unique Identifiers`_ are a powerful tool to
identify entities not only in the running programm, but also across
files and machines. Often though, context will allow us to have more
lightweight identity values. In some cases, even its index in a
vector might suffice.
.. _universally unique identifiers: https://en.wikipedia.org/wiki/Universally_unique_identifier
.. admonition:: References in a value world
Consider this data-structure which implements agenda of people with
friends in a referential way:
.. code-block:: c++
struct person
{
std::string name;
std::string phone_number;
std::vector<std::weak_ptr<person>> friends;
};
struct agenda
{
std::unordered_set<std::shared_ptr<person>> people;
};
This is not a valid model to use in Lager, because ``person`` and
``agenda`` are reference semantic types. Not only is the
identification of memory objects with entities problematic from a
conceptual programming sense: there is some extra friction, like
having to allocate each person in a separate memory block (instead
of having a flat ``std::vector``), and then dealing with the
lifetime of those blocks with ``shared_ptr``, ``weak_ptr``, and so
on.
How do we do references with out pointers then? We use
explicit identity values:
.. code-block:: c++
:emphasize-lines: 1,7,12
using person_id = std::string;
struct person
{
std::string name;
std::string phone_number;
std::vector<person_id> friends;
};
struct agenda
{
std::unordered_map<person_id, person> people;
};
Now we have decoupled the *identity* (``person_id``) from the
*state* (``person``). Whenever we want to know the state for a
given person, we can access it through the ``people`` map in the
agenda. Whenever we want to refer to a person, like in the list of
friends, we use a ``person_id``. We can now have multiple copies
of the whole agenda, and compare how a particular person changes as
we manipulate it. People are not tied to their representation in
memory anymore, so we can be more playful with the data-structures
and apply :ref:`data oriented design <data-oriented-design>` to
reach better cache locality and and overall performance!
.. _normalization:
Normalization
-------------
After applying the principle of explicit identity to your program, you
might realise this insight: *the data-model of the application starts
to look like a data-base!*
And you are correct: the model of our application is an in-memory
data-base, and the Lager store, combined with reducers and actions,
provide a reproducible, logic aware, `event sourced`_ data-store.
This means that you can start applying data-base design wisdom to your
application, which has been accrued over decades by academics and
practitioners. In particular, you may find interesting the notion of
`database normalization`_, both the Redux documentation and the
Data-Oriented Design book do indeed talk about it:
* The `Normalizing State Shape`_ section from the Redux documentation
discusses normalization in the context the unidirectional
data-flow architecture.
* The `Relational Databases`_ section, from the *Data-Oriented Design*
book by Richard Fabian, discusses normalization of the in memory
model of C++ programs, with special focus on performance.
.. _event sourced: https://martinfowler.com/eaaDev/EventSourcing.html
.. _database normalization: https://en.wikipedia.org/wiki/Database_normalization
.. _normalizing state shape: https://redux.js.org/recipes/structuring-reducers/normalizing-state-shape
.. _relational databases: http://www.dataorienteddesign.com/dodbook/node3.html
.. _performance:
Performance
-----------
One strong concern when applying value-semantics for the data-model of
big applications is performance. We encourage passing by value around
freely, and storing copies of the model values as needed without much
concern. This often rises skepticism from experienced developers with
an eye for optimization. For big data-models, isn't that gonna be
slow, and even explode the memory usage? Not necessarily.
In C++, we normally associate value-semantics, and in particular the
:ref:`independence principle<independence-principle>`, with deep
copying. For standard containers like ``std::vector``, it is the case
that whenever we pass by value, a new memory object is created where
the whole representation of the value is copied into. This is however
not a consequence of value-semantics, but a consequence of mutability!
If the object that stores the value can mutate arbitrarily, when
passed by value, all of its contents must be copied to ensure that the
new object does not change when the original changes. However, if the
original object is in some way **immutable**, the immutable parts of
the representation can be internally shared accross all the "copies"
of the value. This property is called *structural sharing* and makes
value-semantics very efficient!
In the field of `persistent data-structures`_ we can find many
examples of containers designed with the *structural sharing* in mind.
Today, we have good implementations of some of these data-structures
in C++:
* Immer_, **immutable data-structures for C++**.
* `Postmodern immutable data-structures`_, CppCon'18 talk about Immer.
* `Persistence for the masses`_, ICFP'17 paper on immutable
data-structures in C++.
.. _immer: https://github.com/arximboldi/immer
.. _postmodern immutable data-structures: https://www.youtube.com/watch?v=sPhpelUfu8Q
.. _persistence for the masses: https://public.sinusoid.es/misc/immer/immer-icfp17.pdf
.. _persistent data-structures: https://en.wikipedia.org/wiki/Persistent_data_structure
Also, in modern C++ one may often avoid copies altogether by
leveraging `copy ellision`_ and `move semantics`_. It is important to
familiarize oneself with these concepts.
.. _copy ellision: https://en.cppreference.com/w/cpp/language/copy_elision
.. _move semantics: https://stackoverflow.com/questions/3106110/what-is-move-semantics
In practice, when combining value-oriented design with immutable
data-structures, you will find that not only is performance not a
problem, but that **your programs are faster**! This is due to the
fact that our data-model becomes more compact, with less pointer
chasing and better cache locality, and with flatter call stacks and no
traversal of forests of listeners, signals and slots.
|