File: ProgramStructureAndCompilationModel.rst

package info (click to toggle)
swiftlang 6.0.3-2
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 2,519,992 kB
sloc: cpp: 9,107,863; ansic: 2,040,022; asm: 1,135,751; python: 296,500; objc: 82,456; f90: 60,502; lisp: 34,951; pascal: 19,946; sh: 18,133; perl: 7,482; ml: 4,937; javascript: 4,117; makefile: 3,840; awk: 3,535; xml: 914; fortran: 619; cs: 573; ruby: 573
file content (386 lines) | stat: -rw-r--r-- 18,310 bytes
:orphan:

.. _ProgramStructureAndCompilationModel:

.. highlight:: none

Swift Program Structure and Compilation Model
=============================================

.. warning:: This is a very early design document discussing the features of
  a Swift build model and modules system. It should not be taken as a plan of
  record.

Commentary
----------

The C spec only describes things up to translation unit granularity: no
discussion of file system layout, build system, linking, runtime concepts of
code (dynamic libraries, executables, plugins), dependence between parts of a
program, Versioning + SDKs, human factors like management units, etc. It leaves
all of this up to implementors to sort out, and we got what the unix world
defined in the 60's and 70's with some minor stuff that could be shoehorned into
the old unix toolchain model without too much trouble. C also doesn't help with
resources (images etc), has a miserable incremental compilation model and many,
many, other issues.

Swift should strive to make trivial programs really simple. Hello world should
just be something like::

  print("hello world")

while also acknowledging and strongly supporting the real world demands and
requirements that library implementors (hey, that's us!)  face every day. In
particular, note how the language elements (described below) correspond directly
to the business and management reality of the world:

**Ownership Domain / Top Level Component**: corresponds to a product that is
shipped as a unit (Mac OS/X, iWork, Xcode), is a collection of frameworks/dylibs
and resources. Only acyclic dependencies between different domains is
allowed. There is some correlation in concept here to "umbrella headers" or
"dyld shared cache" though it isn't exact.

**Namespace**: Organizational structure within a domain, similar to C++ or
Java. Programmers can use or abuse them however they wish.

**Subcomponent**: corresponds to an individual team or management unit, is one
dylib + optional resources. All contributing source files and resources live in
one directory (with optional subdirs), and have a single "project file". Can
contribute to multiple namespaces. The division of a domain into components is
an implementation detail, not something externally visible as API. Can have
cyclic dependencies between other components. Components roughly correspond to
"xcode project" or "B&I project" granularity at Apple. Can rebuild a "debug
version" of a subcomponent and drop it into an app without rebuilding the entire
world.

**Source File**: Organizational unit within a component.

In the trivial hello world example, the source file gets implicitly dropped into
a default component (since it doesn't have a component declaration). The default
component has settings that corresponds to an executable. As the app grows and
wants to start using sub-libraries, the author would have to know about
components. This ensures a simple model for new people, because they don't need
to know anything about components until they want to define a library and stable
APIs.

We'll also eventually build tools to do things like:

* Inspect and maintain dependence graphs between components and subcomponents.

* Diff API [semantically, not "by symbol" like 'nm'] across versions of products

* Provide code migration tools, like "rewrite rules" to update clients that use
  obsoleted and removed API.

* Pure swift apps won't be able to use SPI (they just won't build), but mixed
  swift/C apps could (through the C parts, similar to using things like "extern
  int Z3fooi(int)" to access C++ mangled symbols from C today). It will be
  straight-forward to write a binary verifier that cross references the NM
  output with the manifest file of the components it legitimately depends on.

* Lots of other cool stuff I'm sure.

Anyway, that's the high-level thoughts and motivation, this is what I'm
proposing:

Program structure
-----------------

Programs and frameworks in swift consist of declarations (functions, variables,
types) that are (optionally) defined in possibly nested namespaces, which are
nested in a component, which are (optionally) split into
subcomponents. Components can also have associated resources like images and
plists, as well as code written in C/C++/ObjC.

A "**Top Level Component**" (also referred to as "an ownership domain") is a
unit of code that is owned by a single organization and is updated (shipped to
customers) as a whole. Examples of different top-level components are products
like the swift standard libraries, Mac OS/X, iOS, Xcode, iWork, and even small
things like a theoretical third-party Perforce plugin to Xcode.

Components are explicitly declared, and these declarations can include:

* whether the component should be built into a dylib or executable, or is a
  subcomponent.

* the version of the component (which are used for "availability macros" etc)

* an explicit list of dependencies on other top-level components (whose
  dependence graph is required to be acyclic) optionally with specific versions:
  "I depend on swift standard libs 1.4 or later"

* a list of subcomponents that contribute to the component: "mac os consists of
  appkit, coredata, ..."

* a list of resource files and other stuff that makes up the framework

* A list of subdirectories to get source files out of (see filesystem layout
  below) if the component is more that one directory full of code.

* A list of any .c/.m/.cpp files that are built and linked into the component,
  along with build flags etc.

Top-Level Components define the top level of the namespace stack. This means
everything in the swift libraries are "swift.array.xyz", everything in MacOS/X
is "macosx.whatever". Thus you can't have naming conflicts across components.

**Namespaces** are for organization within a component, and are left up to the
developer to handle however they want. They will work similarly to C++
namespaces and aren't described in detail here. For example, you could have a
macosx.coredata namespace that coredata drops all its stuff into.

Components can optionally be broken into a set of "**Subcomponents**", which are
organizational units within a top-level component. Subcomponents exist to
support extremely large components that have multiple different teams
contributing to a single large product. Subcomponents are purely an
implementation detail of top-level components and have no runtime,
naming/namespace, or other externally visible artifacts that persist once the
entire domain is built. If version 1.0 of a domain is shipped, version 1.1 can
completely reshuffle the internal subcomponent organization without affecting
its published API or anything else a client can see.

Subcomponents are explicitly declared, and these declarations can include:

* The component they belong to.

* The set of other (optionally versioned) top-level components they depend on.

* The set of components (within the current top-level component) that this
  subcomponent depends on. This dependence is an acyclic dependence: "core data
  depends on foundation".

* A list of declarations they use within the current top-level component that
  aren't provided by the subcomponents they explicitly depend on. This is used
  to handle cyclic dependencies across subcomponents within an ownership domain:
  for example: "libsystem depends on libcompiler_rt", however, "libcompiler_rt
  depends on 'func abort();' in libsystem". This preserves the acyclic
  compilation order across components.

* A list of subdirectories to get source files out of (see filesystem layout
  below) if the component is more that one directory full of code.

* A list of any .c/.m/.cpp files that are linked into the component, with build
  flags.

**Source Files** and **Resources** make up a component. Swift source files can
include:

* The component they belong to.

* Import declarations that affect their local scope lookups (similar to java
  import statements)

* A set of declarations of variables, functions, types etc.

* C and other language files are just another kind of resource to be built.

**Declarations** of variables, functions and types are the meat of the program,
and populate source files. Declarations can be scoped to be externally exported
from the component (aka API), internal to the component (aka SPI), local to a
subcomponent (aka "visibility hidden", the default), or local to the file (aka
static). Top-level components also have a simple runtime representation which is
used to ensure that reflection only returns API and decls within the current
ownership domain: "App's can't get at iOS SPI".

**Executable expressions** can also be included at file scope (outside other
declarations). This global code is run at startup time (same as static
constructors), eliminating the need for "main". This initialization code is
correctly run bottom-up in the explicit dependence graph. Order of
initialization between multiple cyclicly dependent files within a single
component is not defined (and perhaps we can make it be an outright error).

File system layout and compiler UI
----------------------------------

The filesystem layout of a component is a directory with at least one .swift
file in it that has the same name as the directory. A common case is that the
component is a single directory with a bunch of .swift files and resources in
it. The "large component" case can break up its source files and resources into
subdirectories.

Here is the minimal hello world example written as a proper app::

  myapp/
  myapp.swift

You'd compile it like this::

  $ swift myapp
  myapp compiled successfully!

or::

  $ cd myapp
  $ swift
  myapp compiled successfully!

and it would produce this filesystem layout::

  myapp/
  myapp.swift
  products/
  myapp
  myapp.manifest
  buildcache/
  <stuff>

Here is a moderately complicated example of a library::

  mylib/
  mylib.swift
  a.swift
  b.swift
  UserManual.html
  subdir/
  c.swift
  d.swift
  e.png

mylib.swift tells the compiler about your sub directories, resources, how to
process them, where to put them, etc. After compiling it you'd keep your source
files and get::

  mylib/
  products/
  mylib.dylib
  mylib.manifest
  e.png
  docs/
  UserManual.html
  buildcache/
  <more stuff>

Swift compiler command line is very simple: "swift mylib" is enough for most
uses. For more complex use cases we'll support specifying paths to search for
components (similar to clang -F or -L) etc. We'll also support a "clean" command
that nukes buildcache/ and products/.

The BuildCache directory holds object files, dependence information and other
stuff needed for incremental [re]builds within the component. The generated
manifest file is used by the compiler when a client lib/app import mylib (it
contains type information for all the stuff exported from mylib) but also at
runtime by the runtime library (e.g. for reflection). It needs to be a
fast-to-read but extensible format.

What the build system does, how it works
----------------------------------------

Assuming that we're starting with an empty build cache, the build system starts
by parsing the mylib.swift file (the main file for the directory). This file
contains the component declaration. If this is a subcomponent, the subcomponent
declares which super-component it is in (in which case, the super-component info
is loaded). In either case, the compiler verifies that all of the depended-on
components are built, if not, it goes off and recursively builds them before
handling this one: the component dependence graph is acyclic, and cycles are
diagnosed here.

If this directory is a subcomponent (as opposed to a top-level component), the
subcomponent declaration has already been read. If this subcomponent depends on
any other components that are not up-to-date, those are recursively
rebuilt. Explicit subcomponent dependencies are acyclic and cycles are diagnosed
here. Now all depended-on top-level components and subcomponents are built.

Now the compiler parses each swift file into an AST. We'll keep the swift
grammar carefully factored to keep types and values distinct, so it is possible
to parse (but not fully typecheck) the files without first reading "all the
headers they depend on". This is important because we want to allow arbitrary
type and value cyclic dependencies between files in a component. As each file is
parsed, the compiler resolves as many intra-file references as it can, and ends
up with a list of (namespace qualified) types and values that are imported by
the file that are not satisfied by other components. This is the list of things
the file requires that some other files in the component provide.

Now that the compiler has the full set of dependence information between files
in a component, it processes the files in strongly connected component (SCC)
order processing an SCC of dependent files at a time. Given the entire SCC it is
able to resolve values and types across the files (without needing prototypes)
and complete type checking. Assuming type checking is successful (no errors) it
generates code for each file in the SCC, emits a .o file for them, and emits
some extra metadata to accelerate incremental builds. If there are .c files in
the component, they are compiled to .o files now (they are also described in the
component declaration).

Once all of the source files are compiled into .o files, they are linked into a
final linked image (dylib or executable). At this point, a couple of other
random things are done: 1) metadata is checked to ensure that any explicitly
declared cyclic dependencies match the given and actual prototype. 2) resources
are copied or processed into the product directory. 3) the explicit dependence
graph is verified, extraneous edges are warned about, missing edges are errors.

In terms of implementation, this should be relatively straight-forward, and is
carefully layered to be memory efficient (e.g. only processing an SCC at a time
instead of an entire component) as well as highly parallel for multicore
machines. For incremental builds, we will have a huge win because the
fine-grained dependence information between .o files is tracked and we know
exactly what dependencies to rebuild if anything changes. The build cache will
accelerate most of this, which will eventually be a hybrid on-disk/in-memory
data structure.

The build system should be scalable enough for B&I to eventually do a "swift
macos" and have it do a full incremental (and parallel) build of something the
scale of Mac OS. Actually implementing this will obviously be a big project that
can happen as the installed base of swift code grows.

SDKs
----

The manifest file generated as a build product describes (among other things)
the full list of decls exported by the top-level component (which includes their
type information, not just symbol names). This manifest file is used when a
client builds against the component to type check the client and ensure that its
references are resolved.

Because we have the version number as well as the full interface to the
component available in a consumable format is that we can build an SDK generation
tool. This tool would take manifest files for a set of releases (e.g. iOS 4.0,
4.0.1, 4.0.2, 4.1, 4.1.1, 4.2) and build a single SDK manifest which would have
a mapping from symbol+type -> version list that indicates what the versions a
given symbol are available in. This means that framework authors don't have to
worry about availability macros etc, it just naturally falls out of the system.

This tool can also produce warnings/errors about cases where API is in version N
but removed in version N+1, or when some declaration has an invalid change
(e.g. an argument added or something else "fragile").  Blue sky idea: We could
conceivable extend it so that the SDK manifest file contains rewrite rules for
obsolete APIs that the compiler could automatically apply to upgrade user's
source code.

Future optimization opportunities
---------------------------------

The system has been carefully designed to allow fast builds at -O0 (including
keeping cached dependence information and the compiler around in memory "across
builds"), allowing a very incremental compilation model and allowing carefully
limited/understood cyclic dependencies across components. However, we also care
about really fast runtime performance (better than our current system), and we
should be able to get that as well.

There are several different possibilities to look at in the future:

1. Components are a natural unit to do "link time" optimization. Since the
   entire thing is shipped as a unit, we know that it is safe to inline
   functions and analyze side effects within the bounds of the component. This
   current LTO model should scale to the component level, but we'd need new
   (more scalable/parallel and memory efficient) approaches to optimize across
   the entire mac os product. Processing components bottom-up within a large
   component allows efficient context sensitive (and summary-based) analyzes,
   like mod/ref, interprocedural constant prop, inlining, and nocapture
   propagation. I expect nocapture to be specifically important to get stuff on
   the stack instead of causing them to get promoted to the heap all the time.

2. The dyld shared cache can be seen as an optimization across components within
   the mac os top-level component. Though it has the capability to include third
   party and other dylibs, in practice it is rooted from a few key apps, so it
   doesn't get "everything" in macos and it isn't used for other stuff (like
   xcode). The proposed (but never implemented) "per-app shared cache" is a
   straight-forward extension if this were based on optimizing across
   components.

3. There are a bunch of optimizations to take advantage of known fragility
   levels for devirtualization, inlining, and other stuff that I'm not going to
   describe here. Generalization of DaveZ's positive/negative ivar/vtable idea.

4. The low level tools are already factored to be mostly object file format
   independent. There is no reason that we need to keep using actual macho .o
   files if it turns out to be inconvenient. We obviously must keep around macho
   executables and dylibs.