File: overview.rst

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (201 lines) | stat: -rw-r--r-- 8,222 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. _contrib-overview:

*********************
Contributing Overview
*********************

.. _git-conventions:

Local git conventions
=====================

If you are tracking the Arrow source repository locally, here is a
checklist for using ``git``:

* Work off of your **personal fork** of ``apache/arrow`` and submit pull requests
  "upstream".
* Keep your fork's **main branch synced** with ``upstream/main``.
* **Develop on branches**, rather than your own "main" branch.
* It does not matter what you call your branch. Some people like to use the GitHub
  issue number as branch name, others use descriptive names.
* **Sync your branch** with ``upstream/main`` **regularly**, as many commits are
  merged to main every day.
* It is recommended to use ``git rebase`` rather than ``git merge``.
* In case there are conflicts, and your local commit history has multiple commits,
  you may simplify the conflict resolution process by **squashing your local commits
  into a single commit**. Preserving the commit history isn't as important because
  when your feature branch is merged upstream, a squash happens automatically.

  .. dropdown:: How to squash local commits?
    :animate: fade-in-slide-down
    :class-container: sd-shadow-none

    Abort the rebase with:

    .. code:: console

       $ git rebase --abort

    Following which, the local commits can be squashed interactively by running:

    .. code:: console

       $ git rebase --interactive ORIG_HEAD~n

    Where ``n`` is the number of commits you have in your local branch.  After the squash,
    you can try the merge again, and this time conflict resolution should be relatively
    straightforward.

    Once you have an updated local copy, you can push to your remote repo.  Note, since your
    remote repo still holds the old history, you would need to do a force push.  Most pushes
    should use ``--force-with-lease``:

    .. code:: console

       $ git push --force-with-lease origin branch

    The option ``--force-with-lease`` will fail if the remote has commits that are not available
    locally, for example if additional commits have been made by a colleague.  By using
    ``--force-with-lease`` instead of ``--force``, you ensure those commits are not overwritten
    and can fetch those changes if desired.

  .. dropdown:: Setting rebase to be default
    :animate: fade-in-slide-down
    :class-container: sd-shadow-none

    If you set the following in your repo's ``.git/config``, the ``--rebase`` option can be
    omitted from the ``git pull`` command, as it is implied by default.

    .. code:: console

       [pull]
             rebase = true


.. _pull-request-and-review:

Pull request and review
=======================

When contributing a patch, use this list as a checklist of Apache Arrow workflow:

* Submit the patch as a **GitHub pull request** against the **main branch**.
* So that your pull request syncs with the GitHub issue, **prefix your pull request
  title with the GitHub issue id** (ex:
  `GH-14866: [C++] Remove internal GroupBy implementation <https://github.com/apache/arrow/pull/14867>`_).
* Give the pull request a **clear, brief description**: when the pull request is
  merged, this will be retained in the extended commit message.
* Make sure that your code **passes the unit tests**. You can find instructions how
  to run the unit tests for each Arrow component in its respective README file.

Core developers and others with a stake in the part of the project your change
affects will review, request changes, and hopefully indicate their approval
in the end. To make the review process smooth for everyone, try to

* **Break your work into small, single-purpose patches if possible.**

  It’s much harder to merge in a large change with a lot of disjoint features,
  and particularly if you're new to the project, smaller changes are much easier
  for maintainers to accept.

* **Add new unit tests for your code.**
* **Follow the style guides** for the part(s) of the project you're modifying.

  Some languages (C++ and Python, for example) run a lint check in
  continuous integration. For all languages, see their respective developer
  documentation and READMEs for style guidance.

* Try to make it look as if the codebase has a single author,
  and emulate any conventions you see, whether or not they are officially
  documented or checked.

When tests are passing and the pull request has been approved by the interested
parties, a `committer <https://arrow.apache.org/committers/>`_
will merge the pull request. This is done with a
**command-line utility that does a squash merge**.

.. dropdown:: Details on squash merge
  :animate: fade-in-slide-down
  :class-container: sd-shadow-none

  A pull request is merged with a squash merge so that all of your commits will be
  registered as a single commit to the main branch; this simplifies the
  connection between GitHub issues and commits, makes it easier to bisect
  history to identify where changes were introduced, and helps us be able to
  cherry-pick individual patches onto a maintenance branch.

  Your pull request will appear in the GitHub interface to have been "merged".
  In the commit message of that commit, the merge tool adds the pull request
  description, a link back to the pull request, and attribution to the contributor
  and any co-authors.

.. Section on Experimental repositories:

.. include:: experimental_repos.rst

.. _specific-features:

Guidance for specific features
==============================

From time to time the community has discussions on specific types of features
and improvements that they expect to support.  This section outlines decisions
that have been made in this regard.

Endianness
++++++++++

The Arrow format allows setting endianness.  Due to the popularity of
little endian architectures most of implementation assume little endian by
default. There has been some  effort to support big endian platforms as well.
Based on a `mailing-list discussion
<https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3cCAK7Z5T--HHhr9Dy43PYhD6m-XoU4qoGwQVLwZsG-kOxXjPTyZA@mail.gmail.com%3e>`__,
the requirements for a new platform are:

1. A robust (non-flaky, returning results in a reasonable time) Continuous
   Integration setup.
2. Benchmarks for performance critical parts of the code to demonstrate
   no regression.

Furthermore, for big-endian support, there are two levels that an
implementation can support:

1. Native endianness (all Arrow communication happens with processes of the
   same endianness).  This includes ancillary functionality such as reading
   and writing various file formats, such as Parquet.
2. Cross endian support (implementations will do byte reordering when
   appropriate for :ref:`IPC <format-ipc>` and :ref:`Flight <flight-rpc>`
   messages).

The decision on what level to support is based on maintainers' preferences for
complexity and technical risk.  In general all implementations should be open
to native endianness support (provided the CI and performance requirements
are met).  Cross endianness support is a question for individual maintainers.

The current implementations aiming for cross endian support are:

1. C++

Implementations that do not intend to implement cross endian support:

1. Java

For other libraries, a discussion to gather consensus on the mailing-list
should be had before submitting PRs.