File: reviewing.rst

package info (click to toggle)
apache-arrow 23.0.1-1
links: PTS
area: main
in suites: sid
size: 76,220 kB
sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (315 lines) | stat: -rw-r--r-- 15,167 bytes
parent folder | download | duplicates (4)
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

=======================
Reviewing contributions
=======================

Principles
==========

Arrow is a foundational project that will need to evolve over many years
or even decades, while serving potentially millions of users.  We believe
that being meticulous when reviewing brings greater rewards to the project
than being lenient and aiming for quick merges.

Code reviews like this lead to better quality code, more people who are
engaged with and understand the code being changed, and a generally
healthier project with more room to grow and accommodate emerging needs.


Guidelines
==========

Meta
----

**Use your own judgement**.  These guidelines are not hard rules.
Committers are expected to have sufficient expertise on their work
areas to be able to adjust their approach based on any concerns they have.

These guidelines are not listed in a particular order and are not intended
to be used as a checklist.

Finally, these guidelines are not exhaustive.

Scope and completeness
----------------------

* Our general policy is to not introduce regressions or merge PRs that require
  follow-ons to function correctly (though exceptions to this can be made).
  Making necessary changes after a merge is more costly both for the
  contributor and the reviewer, but also for other developers who may be
  confused if they hit problems introduced by a merged PR.

* What changes are in-scope for a PR and what changes might/could/should be
  pushed out of scope and have a follow-up issue created should be determined
  in collaboration between the authors and the reviewers.

* When a large piece of functionality is being contributed and it seems
  desirable to integrate it piecewise, favour functional cohesion when
  deciding how to divide changes (for example, if a filesystem implementation
  is being contributed, a first PR may contribute directory metadata
  operations, a second PR file reading facilities and a third PR file writing
  facilities).

Public API design
-----------------

* Public APIs should nudge users towards the most desirable constructs.
  In other words, if there is a "best" way to do something, it should
  ideally also be the most easily discoverable and the most concise to type.
  For example, safe APIs should be featured more prominently than
  unsafe APIs that may crash or silently produce erroneous results on
  invalid input.

* Public APIs should ideally tend to produce readable code.  One example
  is when multiple options are expected to be added over time: it is better
  to try to organize options logically rather than juxtapose them all in
  a function's signature (see for example the CSV reading APIs in
  :ref:`C++ <cpp-api-csv>` and :ref:`Python <py-api-csv>`).

* Naming is important.  Try to ask yourself if code calling the new API
  would be understandable without having to read the API docs.
  Vague naming should be avoided; inaccurate naming is even worse as it
  can mislead the reader and lead to buggy user code.

* Be mindful of terminology.  Every project has (explicitly or tacitly) set
  conventions about how to name important concepts; steering away from those
  conventions increases the cognitive workload both for contributors and
  users of the project.  Conversely, reusing a well-known term for a different
  purpose than usual can also increase the cognitive workload and make
  developers' lives more difficult.

* If you are unsure whether an API is the right one for the task at hand,
  it is advisable to mark it experimental, such that users know that it
  may be changed over time, while contributors are less wary of bringing
  code-breaking improvements.  However, experimental APIs should not be
  used as an excuse for eschewing basic API design principles.

Robustness
----------

* Arrow is a set of open source libraries that will be used in a very wide
  array of contexts (including fiddling with deliberately artificial data
  at a Jupyter interpreter prompt).  If you are writing a public API, make
  sure that it won't crash or produce undefined behaviour on unusual (but
  valid) inputs.

* When a non-trivial algorithm is implemented, defensive coding can
  be useful to catch potential problems (such as debug-only assertions, if
  the language allows them).

* APIs ingesting potentially untrusted data, such as on-disk file formats,
  should try to avoid crashing or produce silent bugs when invalid or
  corrupt data is fed to them.  This can require a lot of care that is
  out of the scope of regular code reviews (such as setting up
  :ref:`fuzz testing <cpp-fuzzing>`), but basic checks can still be
  suggested at the code review stage.

* When calling foreign APIs, especially system functions or APIs dealing with
  input / output, do check for errors and propagate them (if the language
  does not propagate errors automatically, such as C++).

Performance
-----------

* Think about performance, but do not obsess over it.  Algorithmic complexity
  is important if input size may be "large" (the meaning of large depends
  on the context: use your own expertise to decide!).  Micro-optimizations
  improving performance by 20% or more on performance-sensitive functionality
  may be useful as well; lesser micro-optimizations may not be worth the
  time spent on them, especially if they lead to more complicated code.

* If performance is important, measure it.  Do not satisfy yourself with
  guesses and intuitions (which may be founded on incorrect assumptions
  about the compiler or the hardware).

  .. seealso:: :ref:`Benchmarking Arrow <benchmarks>`

* Try to avoid trying to trick the compiler/interpreter/runtime by writing
  the code in a certain way, unless it's really important.  These tricks
  are generally brittle and dependent on platform details that may become
  obsolete, and they can make code harder to maintain (a common question
  that can block contributors is "what should I do about this weird hack
  that my changes would like to remove"?).

* Avoiding rough edges or degenerate behaviour (such as memory blowups when
  a size estimate is inaccurately large) may be more important than trying to
  improve the common case by a small amount.

Documentation
-------------

These guidelines should ideally apply to both prose documentation and
in-code docstrings.

* Look for ambiguous / poorly informative wording.  For example, *"it is an
  error if ..."* is less informative than either *"An error is raised if ... "*
  or *"Behaviour is undefined if ..."* (the first phrasing doesn't tell the
  reader what actually *happens* on such an error).

* When reviewing documentation changes (or prose snippets, in general),
  be mindful about spelling, grammar, expression, and concision.  Clear
  communication is essential for effective collaboration with people
  from a wide range of backgrounds, and contributes to better documentation.

* Some contributors do not have English as a native language (and perhaps
  neither do you).  It is advised to help them and/or ask for external help
  if needed.

* Cross-linking increases the global value of documentation.  Sphinx especially
  has great cross-linking capabilities (including topic references, glossary
  terms, API references), so be sure to make use of them!

Testing
-------

* When adding an API, all nominal cases should have test cases.  Does a function
  allow null values? Then null values should be tested (alongside non-null
  values, of course). Does a function allow different input types? etc.

* If some aspect of a functionality is delicate (either by definition or
  as an implementation detail), it should be tested.

* Corner cases should be exercised, especially in low-level implementation
  languages such as C++.  Examples: empty arrays, zero-chunk arrays, arrays
  with only nulls, etc.

* Stress tests can be useful, for example to uncover synchronizations bugs
  if non-trivial parallelization is being added, or to validate a computational
  argument against a slow and straightforward reference implementation.

* A mitigating concern, however, is the overall cost of running the test
  suite.  Continuous Integration (CI) runtimes can be painfully long and
  we should be wary of increasing them too much.  Sometimes it is
  worthwhile to fine-tune testing parameters to balance the usefulness
  of tests against the cost of running them (especially where stress tests
  are involved, since they tend to imply execution over large datasets).


Social aspects
==============

* Reviewing is a communication between the contributor and the reviewer.
  Avoid letting questions or comments remain unanswered for too long
  ("too long" is of course very subjective, but two weeks can be a reasonable
  heuristic).  If you cannot allocate time soon, do say it explicitly.
  If you don't have the answer to a question, do say it explicitly.
  Saying *"I don't have time immediately but I will come back later,
  feel free to ping if I seem to have forgotten"* or *"Sorry, I am out of
  my depth here"* is always better than saying nothing and leaving the
  other person wondering.

* If you know someone who has the competence to help on a blocking issue
  and past experience suggests they may be willing to do so, feel free to
  add them to the discussion (for example by gently pinging their GitHub
  handle).

* If the contributor has stopped giving feedback or updating their PR,
  perhaps they're not interested anymore, but perhaps also they're stuck
  on some issue and feel unable to push their contribution any further.
  Don't hesitate to ask (*"I see this PR hasn't seen any updates recently,
  are you stuck on something? Do you need any help?"*).

* If the contribution is genuinely desirable and the contributor is not making
  any progress, it is also possible to take it up.  Out of politeness,
  it is however better to ask the contributor first.

* Some contributors are looking for a quick fix to a specific problem and
  don't want to spend too much time on it.  Others on the contrary are eager
  to learn and improve their contribution to make it conform to the
  project's standards.  The latter kind of contributors are especially
  valuable as they may become long-term contributors or even committers
  to the project.

* Some contributors may respond *"I will fix it later, can we merge anyway?"*
  when a problem is pointed out to them.  Unfortunately, whether the fix will
  really be contributed soon in a later PR is difficult to predict or enforce.
  If the contributor has previously demonstrated that they are reliable,
  it may be acceptable to do as suggested.  Otherwise, it is better to
  decline the suggestion.

* If a PR is generally ready for merge apart from trivial or uncontroversial
  concerns, the reviewer may decide to push changes themselves to the
  PR instead of asking the contributor to make the changes.

* Ideally, contributing code should be a rewarding process.  Of course,
  it will not always be, but we should strive to reduce contributor
  frustration while keeping the above issues in mind.

* Like any communication, code reviews are governed by the Apache
  `Code of Conduct <https://www.apache.org/foundation/policies/conduct.html>`_.
  This applies to both reviewers and contributors.


Labelling
=========

While reviewing PRs, we should try to identify whether the corresponding issue
needs to be marked with one or both of the following issue labels:

* **Critical Fix**: The change fixes either: (a) a security vulnerability;
  (b) a bug that causes incorrect or invalid data to be produced;
  or (c) a bug that causes a crash (while the API contract is upheld).
  This is intended to mark fixes to issues that may affect users without their
  knowledge. For this reason, fixing bugs that cause errors don't count, since
  those bugs are usually obvious. Bugs that cause crashes are considered critical
  because they are a possible vector of Denial-of-Service attacks.
* **Breaking Change**: The change breaks backwards compatibility in a public API.
  For changes in C++, this does not include changes that simply break ABI
  compatibility, except for the few places where we do guarantee ABI
  compatibility (such as C Data Interface). Experimental APIs are *not*
  exempt from this; they are just more likely to be associated with this tag.

Breaking changes and critical fixes are separate: breaking changes alter the
API contract, while critical fixes make the implementation align with the
existing API contract. For example, fixing a bug that caused a Parquet reader
to skip rows containing the number 42 is a critical fix but not a breaking change,
since the row skipping wasn't behavior a reasonable user would rely on.

These labels are used in the release to highlight changes that users ought to be
aware of when they consider upgrading library versions. Breaking changes help
identify reasons when users may wish to wait to upgrade until they have time to
adapt their code, while critical fixes highlight the risk in *not* upgrading.

In addition, we use the following labels to indicate priority:

* **Priority: Blocker**: Indicates the changes **must** be merged before the next
  release can happen. This includes fixes to test or packaging failures that
  would prevent the release from succeeding final packaging or verification.
* **Priority: Critical**: Indicates issues that are high priority. This is a
  superset of issues marked "Critical Fix", as it also contains certain fixes
  to issues causing errors and crashes.

Collaborators
=============

The collaborator role allows users to be given triage role to be able to help triaging
issues. The role allows users to label and assign issues.

A user can ask to become a collaborator or can be proposed by another community member
when they have been collaborating in the project for a period of time. Collaborations
can be but are not limited to: creating PRs, answering questions, creating
issues, reviewing PRs, etcetera.

In order to propose someone as a collaborator you must create a PR adding the user to
the collaborators list on `.asf.yaml <https://github.com/apache/arrow/blob/main/.asf.yaml>`_.
Committers can review the past collaborations for the user and approve.

Collaborators that have been inactive for a period of time can be removed from the
list of collaborators.