-------- Mailing List Message ------------------------------------
Subject: [Yoshimi-devel] PADSynth background rebuild ("padthread")
Date: Mon, 31 Jan 2022 04:20:52 +0100
From: Ichthyostega <prg@ichthyostega.de>
To: Yohimi-developers <yoshimi-devel@lists.sourceforge.net>


At 15.12.21 at 13:22 Will Godfrey wrote:
> This now has an auto apply feature fairly well implemented.
> However it will crash on extreme wavetable sizes (I don't know why yet).


Hello Yoshimi developers,

As we all know, concurrent programming can be surprisingly tricky,
even for seemingly simple stuff -- and so this new feature kept us
busy for quite some time, until reaching a state now where it works
without crashes and sound glitches, and thus appears to be "feasible".
Some issues (most notably XRuns) need to be sorted out yet.

You can see the current state of this experimental feature in my Github

https://github.com/Ichthyostega/yoshimi.git

Branch: padthread

/Unfortunately this is a huge changeset, going deep down the rabbit's hole/


===========================================================================
What do we hope to achieve?

The PADSynth is based on a exceptionally fine-grained spectrum distribution
and uses a huge Fast-Fourier-Transform operation to generate a likewise large
yet perfectly looped wavetable. The generated sound is conceptionally equivalent
to results produced by "granular synthesis".

Rendering this huge spectrum is a compute intensive task, and can
easily take up several seconds. During that time, event processing in
Yoshimi is blocked -- which leads to the idea to perform wavetable
re-building as a background operation and load the results when ready.

At 18.12.21 12:12 Will Godfrey wrote:
> Ideally I'd want as seamless behaviour as possible.
> Looking ahead there are three scenarios that I'd ideally like to see.
>
> 1 (manual) User moves control; nothing happens until setting 'apply'.
> Nothing else is interrupted apart from the discontinuity of the actual swap.
>
> 2 (partially automatic) System tracks control changes applying them as they
> arrive. If they come too fast the build can be interrupted so only the last
> one is completed. Again, just a discontinuity at the swap time.
>
> 3 (fully automatic) As 2, but instead of a swap, maintain both original and
> new sample set while morphing between the two, then delete the old ones,
> but keep the relatively small 'framework' - so a form of double buffer.
> Morph time could be made user variable, with the proviso that further
> harmonics changes would be ignored during this time.
>
> 3 is really icing on the cake, and if it can be done would be something to
  shout about

To translate this feature description into a programming task....

(1) we want to move the expensive rebuilding of wavetables
    into a background thread, so the event handling thread
    is no longer blocked.

(2) we want to ensure the following conditions
    * whenever the PadSynth-Parameters are "dirty", a rebuild should happen
    * only after that rebuild is really complete, the swap-in should happen

(3) we want to prevent redundant rebuilds from happening at the same time.

Addition to (1): under some conditions (CLI Scripts) we still want to block
the calling thread until the actual build is complete, in order to ensure
predictable state.


===========================================================================
Challenges

The Yoshimi code base can be described as rather cohesive and tangled.
Many parts are written in some "I know what I am about to do so get out
of my way" style, leading to code that is hard to understand and maintain,
and easy to break. Notable raw buffers of various size are allocated and
then passed through dozens and dozens of functions, at the end to be
processed somewhere by an algorithm which just "magically" seems to
know how to deal with that data, and often behaves quite different
based on some implied condition detected from magic markers.

Moreover Yoshimi uses effectively global yet mutable state even where
this wouldn't be necessary, and this state is often manipulated from
a totally remote code location by grabbing into the innards of another
seemingly unrelated facility. There is often no notion of ownership or
hierarchy, parts are mutually dependent and have to be bootstrapped
and initialised in a very specific order.

Thus, to extract some functionality and perform it in a different and
effectively non-deterministic order, we're bound to trace down and
understand lots of details meticulously to identify which parts
can be rearranged or need to be disentangled.

A further complication arises from constraints imposed by the lib FFTW3,
which Yoshimi relies on to implement the Fast Fourier Transform operation.
This library in itself is very elaborate and flexible and meanwhile has
been adapted to allow concurrent and re-entrant calculations, albeit
with very strictly delineated prerequisites -- which the existing usage
in Yoshimi did not need to fulfil, since up to now it operated on the
assumption of a single deterministic computation path.

The necessary changes were especially related to the feature of a "FFT
computation plan". At start, Lib FFTW3 requires the user to pick the
appropriate feature set and invocation scheme. Some users e.g. want to use
complex numbers and multidimensional functions, while others (like Yoshimi)
just need real valued functions and prefer to work with "sine" and "cosine"
coefficients in the Spectrum to represent the phase of a spectral line.
Actually, libFFTW3 would even be able to perform timing measurements and
persist or load a FFT plan optimised for the specific setup and hardware --
an advanced feature Yoshimi does not exploit. Unfortunately this definition
of FFT plans turned out to be not threadsafe -- and Yoshimi sometimes happened
to re-build those FFT plans during normal operation, especially after GUI
interactions, thereby relying on the ability of libFFTW3 to detect and re-use
similar plan definitions behind the scenes (and this caching seems to be one
of the reasons why the setup of such FFT plans interferes with other concurrent
memory management operations.

To overcome these difficulties, we had to overturn and rearrange all memory
management related to spectrum and waveform data -- to get reliable control
over the actual allocations and change the point in time when allocations are
workable. So the FFT plans are now prepared at first usage and shared by all
further calculations, while spectrum data is now arranged in memory right from
start in the very specific order required by the Fast Fourier algorithm, and
with appropriate alignment to allow for SIMD optimisation. Thus the transform
calculation can now be invoked directly on the working data within OscilGen or
the PADnoteParameters, instead of allocating a shared data block and copying
and rearranging the spectrum coefficients for each invocation (as it was done
as of yet). To carry out this tricky refactoring safely, we relied on the help
by the compiler: Spectrum and Waveform data became encapsulated into a data holder
object (based on a single-ownership smart-pointer); various function signatures
within OscilGen and SynthEngine have been converted step by step from using raw
and unbounded float* to accepting these new data holder types.


===========================================================================
Implementation of PADSynth background builds

Whenever a new instrument involving PADSynth Kit-Items is loaded, and also when
the user hits the "Apply" button in the PAD editor, or by the new »auto-Apply«
feature detecting relevant parameter changes, a background build is triggered.
Further changes during ongoing builds will cause these to start over afresh --
however in the case of »auto-Apply« with a short delay to integrate several
change messages caused from dragging the sliders in the GUI.

The data storage for the PADSynth wavetables was likewise encapsulated into
a new data holder type "PADtables", which can be moved only (single ownership).
This result data will be handed over from the background thread to the Synth
thread with the help of a C++ std::future, while the rebuild-trigger is
coordinated through a std::atomic variable. For the background tasks a rather
simplistic scheduler has been added, to start a limited number of background
threads, based on the number of available CPU cores, as reported by the C++
runtime system. Incoming build tasks are enqueued and picked up by those
worker threads. Since these operations never interfere directly with the
Synth, we can keep matters straight and use a simple Mutex for protection.

Within the SynthEngine thread, at the begin of every buffer cycle when
calculating sound for PADSynth notes, the readiness state of the future
is probed (non blocking), to swap in the new PADtables when actually ready.

All of this state handling logic has been embodied into a new component
"FutureBuild", defined in Misc/BuildScheduler.h|cpp. Each PADnoteParameters
instance now holds a PADtables instance and a FutureBuild instance, and
delegates to the latter for all requests pertaining wavetable builds.
This FutureBuild state manager has been written in a way to remain agnostic
both of the actual data type to transport (which is the PADtables) and the
actual scheduler backend implementation to use, allowing to tweak and evolve
those parts independently as we see fit.

===========================================================================
Integration and Extensions: Cross-Fade and Random Walk

[30.4.2022: added these explanations]

While "a transition by cross fade" might be deemed simple at first sight,
it turned out as rather tricky on close investigation, because cross-fading
is an ongoing task and need to be interwoven with the actual sound computation
on the inner processing loop. It would be a simple addition indeed within a
processing architecture based on processing tasks and a scheduler -- Yoshimi
however takes the opposite ascent with a single top-down compute-buffer call,
handling any variations by pre-coded forking in the computation path. Moreover,
the concept of a "note" was shaped rather accidentally and then extended by
copy-n-paste to SUBnote and PADnote after the fact. And so, especially for
PADSynth, there is no room between the triggering of a note instance within
Part.cpp, and the actual low-level wavetable based sample computation.

Duplicating this sample computation code into a cross-fading version was not
deemed acceptable, and directly hooking the cross fade into the computation
was not even considered (for obvious performance reasons). Which leaves the
option of abstracting out the actual computation as a wavetable interpolator
component, which can then be wired either directly, or combined with a fade.
The details of this refactoring however turn out to be quite involved, since
a set of wavetables is maintained for each PADSynth kit-item in its own
PADnoteParameters object, which in turn can be shared by multiple note
instances, which additionally can also be legato or portamento notes.

The "check point" for integrating a newly built set of wavetables is at the
begin of the buffer computation cycle for some note, and this might happen
at any point and for any note right in the middle of the overall calculation
call. At this point, a new XFadeManager component was added, to take hold of
the old wavetable, mark an ongoing cross-fade and keep track of all users
through reference counting. This could have been implemented by just using
a std::shared_ptr -- but this idea was rejected, since shared_ptr uses
atomic locks for coordination, which might add a considerable overhead
within the inner processing loop, and thus would necessitate extended
investigation and timing measurements to be safe -- on the other hand,
the actual SynthEngine code is known to run entirely single threaded,
and to fit in with that image, even the note-initialisation of PADnotes
has now be moved over into the Synth thread to forego any necessitation
of thread synchronisation (beyond the FutureBuild used for handing in
the new wavetables).

In listening tests it turned out that a simple linear crossfade is not sufficient
at this point, since the waveform of old and new wavetables typically show very
low correlation (due to the randomised phases). Thus an equal-power mix seems
adequate. Moreover, even such a mix would still be noticeable as a "manipulation",
and for that reason, a typical S-Fade edit curve was devised, and combined with
segment-wise linear interpolation, so to compute the expensive square root for
the equal power mix only once per block.

And finally, after having built all this scaffolding, it became simple to add
a user-visible new feature on top, which hopefully expands the musical viability
of PADSynth: it is now possible to re-trigger this background build process
periodically, and even to perform a classical random walk on some parameters,
to break the subtle trait of "sameness", which, after playing for some time,
arises from the fixed wavetables. Even while the sound superficially might
seem random, in fact the patterns repeat after some time; but rebuilding
a new set of wavetables will completely re-shuffle all phases and thereby
randomise the patterning in the sound.