1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
|
See also PERFORMANCE in the libmypaint repository
for notes about performance there.
INPROGRESS: Multithreaded compositing
=====================================
https://gitorious.org/mypaint/jonnors-clone/commits/compositing-mt
Replaces the manual iteration over tiles in Python and calling functions to compose document,
with an entrypoint to C++ which takes a set of tiles and layers and does the compositing
in a multi-threaded way. Allows to scale with number of processors.
FIXME/TODO
------------
* The paint_rotated test is currently failing
* During the gui tests, there is some corruption in the display
* PNG saving has major performance regression. Move most of it to C++
* Check and ideally remove the last places where tiles are accessed individually.
Grep for "blit_tile_into", "get_tiles", "get_tile_numpy", "tile_request"
INPROGRESS: C/C++ tile store
============================
https://gitorious.org/mypaint/jonnors-clone/commits/tilestore-cpp
Benchmarks show that the major hotspot is now tiledsurface.py _get_tile_numpy()
In order to fully eliminate that, need to stop calling up into Python and do everything needed for
brush rendering in C/C++.
However, a Python API like what we have now should still exist for compatibility and things which
are not worth, performance wise, doing in C/C++.
Results so far show that this combined with multithreaded compositing gives very good speedup
on dual-core system (175%).
FIXME/TODO
-----------
* Does not work for non-zero mipmap levels (zoom!=100%)
* Fix several failing gui tests
* Move the store into MyPaintTiledSurface, as a default impl. of a tile backend,
and let it keep the buffers in the same map which is used to keep the OperationQueues
IDEA: Fully on-demand rendering
==========================
Instead of rendring stroke down to surface in response to stroke_to() and document compositing
in response to draw_cb(), trigger it all from draw_cb()
Isolates all intense graphical computations into one area, giving a clearer target for optimization.
* Make mypaint_surface_end_atomic() just return invalidated area, not actually compute it
* Introduce a surf_process_bla() or similar for processing. Should it take/return rectangles? tiles?
* Call this on each layer/surface when rendering the document to compute the tiles for requested region
* Probably need API to reset/clear invalidations on the surface
Ideally there will only be one OpenMP parallel for section.
So maybe there should be a mypaint_tiled_surface_compute_tile() ?
Ideally the runtime of one process_region() invocation would be bounded. How to ensure a suitable time?
Limiting how many tiles are done in one go will probably go a long way.
Could also have a limit on number of ops popped from queue.
Problem: how to make dabs/strokes show up ordered by the input time when brush engine is overloaded?
Maybe operation queue should have a sequence number per op, so that one can process the queues on
all the tiles until a given "time", and use that as a synchronization point. From user POV we don't
need dab-level granularity though, just stroke_to level.
IDEA: mipmapped rendering
==========================
Currently we always realize brush strokes on level 0 in the mipmap pyramid, even if the user
never sees the document on that detail level. This means that computing complexity scales
with the size of the document affected by a stroke. If one would instead only realize what
is needed to display the current viewport, the work to be computed scales only with viewport size,
which is essentially constant.
Requires significant parts of the multithreaded compositing, on-demand rendering work to be done first.
Need to be able to store pending operations on the various mipmap levels.
In first iterations, one could continue to
* Always realized tiles on higher mipmap levels that are invalidated by changes on lower-level is changed.
Performance penalty is bounded to 50% due to exponential series.
* Always calculate changes that are outside of the viewport on the same level
Performance penalty is currently bounded because no(?) user actions in MyPaint can cause changes outside viewport.
But the architecture should probably be such that it will just-work even when this is turned off.
Challenges:
* Performance when changing viewport. Need to realize a lot of things which were deferred fast.
* Implement Progressive LOD level-of-detail
When zooming in, could first show a scaled down version of the (realized) level one was currently on,
then realize the changes on the lower level and transition between the two using by blending them.
* Implement opportunistic background processing
When the user is not doing much and there is spare computing power, always work to realize other parts
of the mipmap pyramid. Unsure how the priority should be, so the piece of code for prioriting should
be isolated and easy to change. Things that should be done immediately could just have priority >= NOW_BLOCKING
Theoretically one could even prioritize predictively, based on user action and/or aggregated statistics.
A possible fitness test would be to minimize the time needed to realize the viewport the user moves to.
* Only proactively realize down to what user has set as normal/start resolution
Realizing on a more detailed mipmap/zoom level than this could be an operation. Could also be a priority thing.
* New bottlenecks: Performance of persistence to disk, and memory usage on large documents
As performance/responsiveness of the brush/rendering engine is no longer a limiting factor for
working on larger documents, other issues will become more obvious. At some point saving the document
as raster regularly will become overly tedious (its already bad), and keeping it all in memory will be challenging.
Moving the canonical document format (on disk and in-memory) to a journal could eliminate this. One would then
treat the mipmap pyramid as a tile cache when persisted, and possibly the same when in-memory.
* Keeping operationqueues from growing without bounds
At 64 px tile size, and 50 bytes per DrawDab op, can keep ~1000 ops in queue before it reaches size of the pixel buffer
Depending on brush used, this could be filled in seconds. Being instead able to refer to a revision in the document
journal, and calculate ops on-demand would make this problem go away. Compacting the document journal a different problem.
Implementation
* Store operation queues on all levels in the mipmap pyramid, not just on 0
* When brush dabs are drawn, just store them in operation queue(s)
* Defer realization/processing of tiles until they are requested explicitly.
IDEA: GPU-based processing
===========================
There would be three major aspects of this:
* GPU-powered viewport transformations
Fairly straightforward: use a GLSL shader to transform GPU texture for display.
Can be introduced incrementally, just processing and uploading texture from CPU on-demand
* GPU-powered layer compositing.
Should be fairly straightforward: use a GLSL shader
Possible problems: how to address variable variable number of layers?
Cannot iterate a variable amount of times in for loop
Workaround: limit max layers (in one pass?), break early
Can be introduced incrementally and combined with CPU-based brush engine. However,
the increased data to migrate to GPU may cause performance to go down?
* GPU-powered brush engine. See libmypaint's PERFORMANCE too.
GEGL and its GPU-side tile store, OpenCL processing would be worthwhile to have a look at
before starting any serious work in this area.
TODO: Improved tests
====================
* Implement a GUI benchmark/mode that tracks total processing pipeline time/latency.
Parameterize on brush+settings, ability to calculate average and max latencies,
maybe also plot the distribution or measure according to deadlines
* Implement a test which determines the throughput in Megapixels/second
Use this to do ballpark estimates of how much more performance potential there is left with CPU, and
how much we could gain by alternative technology like GPU.
|