File: gsoc-ideas-old-html

package info (click to toggle)
checker-framework-java 3.0.1%2Bds2-3
links: PTS, VCS
area: main
in suites: bullseye
size: 22,736 kB
sloc: java: 145,286; xml: 785; sh: 456; makefile: 401; perl: 26
file content (117 lines) | stat: -rw-r--r-- 4,413 bytes
<h2 id="Avoiding_exponential_blowup_when_processing_DAGs">Avoiding exponential blowup when processing DAGs</h2>

<!-- John Field of Google is interested in this. -->

<p>
Google's <a href="https://bazel.build/">Bazel</a> open-source project is a
publicly-released version of their build system, Blaze.  Blaze builds every
line of source code that is written by any Google programmer &mdash; all of
that source code appears in a single repository!  Therefore, Bazel/Blaze needs to
be fast.  Bazel represents all of the source code and its dependencies as a
large DAG (a
<a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">directed
acyclic graph</a>).  It needs to manipulate these DAGs efficiently.
</p>

<p>
One of the biggest problems that the Bazel developers face is exponential
blowup of DAG sizes and therefore of run time.  Periodically, one of the
Bazel developers makes such a mistake, and Bazel becomes unusable until
they can diagnose and fix the problem.
</p>

<p>
Here are two different ways to view the problem.
</p>

<ol>
  <li>
    In a DAG, multiple nodes may have the same child.  Traversing the DAG
    naively would visit the child multiple times &mdash; in the worst case,
    exponentially many times.  It is necessary to avoid doing so.
  </li>
  <li>
    Bazel contains a function that takes a DAG as input and generates a DAG
    as output (the output is an object graph).  The Bazel developers want to
    ensure that the size of the output DAG is O(|input DAG|). The input DAG is
    processed bottom-up (it ensures that each input node is visited once) and
    Bazel stores the results of intermediate computations that construct the
    output DAG with nodes of the input DAG. The key thing the Bazel developers
    want to avoid is copying intermediate subgraphs that have unbounded size.
  </li>
</ol>

<p>
More concretely, there is only one Java type for all DAGs, and there is a
method <code>flatten()</code>.  It's a mistake to call
<code>flatten()</code> on certain DAGs, because doing so may cause
exponential blowup.
</p>

<!--
</p>

<p>
A pluggable type system can subdivide the Java type, using this type hierarchy:
</p>

<p>
  @PossbilyPropagatable
          |
  @NotPropagatable
</p>

<p>
A call to flatten() is permitted only if the receiver is known to be
@NotPropagatable.
</p>

<p>
This should be very easy to code up and evaluate, once we have more precise
definitions.
-->

<p>
The goal of this project would be to better understand the problem with
Bazel, to formalize it, and to create a program analysis that solves
the problem.  You would evaluate your work by running it on the Bazel
codebase to discover latent problems, or by providing it to the Bazel
developers to run each time they propose a code change.  The Bazel
team is interested in collaborating by evaluating a tool.
</p>


<h2 id="case-study-index-out-of-bounds">Array indexing (index-out-of-bounds errors)</h2>

<p>
An index-out-of-bounds error occurs when a programmer provides an illegal
index to an array or list, as in <code>a[i]</code> or <code>a.get(i)</code>
where <code>i</code> is less than 0 or greater than the length of
<code>a</code>.  In languages like C, this is disastrous:  buffers
overflows lead to about 1/6 of all security vulnerabilities.  In languages
like Java, the result is &ldquo;merely&rdquo; that the program crashes.  In
both languages, it is desirable to prevent programmers from making this
error and to prevent users from suffering the bad consequences.
</p>

<p>
This project will be a substantial case study with
the <a href="https://checkerframework.org/manual/#index-checker">Index
Checker</a>.  The first goal is to identify its merits and limitations.
Does it scale up to big,
interesting programs?  Are there common, important code patterns that it
fails to handle?  Does it produce too many false positive warnings?  Does
it place too heavy a burden on the user, either in terms of annotations or
in terms of complexity of the error messages?  Worst of all, are there
unknown unsoundnesses in the tool?
The second goal is to improve its precision enough to make it usable by
real-world programmers.
</p>

<p>
A <a href="#index-checker-mutable-length">related project</a> is to extend
the Index Checker to handle
mutable collections such as
<code>List</code>s, where the <code>remove()</code> method makes sound,
precise analysis very tricky.
</p>