File: group-migration-issues.md

package info (click to toggle)
chromium 138.0.7204.183-1
links: PTS, VCS
area: main
in suites: trixie
size: 6,071,908 kB
sloc: cpp: 34,937,088; ansic: 7,176,967; javascript: 4,110,704; python: 1,419,953; asm: 946,768; xml: 739,971; pascal: 187,324; sh: 89,623; perl: 88,663; objc: 79,944; sql: 50,304; cs: 41,786; fortran: 24,137; makefile: 21,806; php: 13,980; tcl: 13,166; yacc: 8,925; ruby: 7,485; awk: 3,720; lisp: 3,096; lex: 1,327; ada: 727; jsp: 228; sed: 36
file content (401 lines) | stat: -rw-r--r-- 14,194 bytes
parent folder | download | duplicates (6)
# Editions: Group Migration Issues

**Authors**: [@mkruskal-google](https://github.com/mkruskal-google)

## Summary

Address some unexpected issues in delimited encoding in edition 2023 before its
OSS release.

## Background

Joshua Humphries reported some well-timed
[issues](https://github.com/protocolbuffers/protobuf/issues/16239) discovered
while experimenting with our early release of Edition 2023. He discovered that
our new message encoding feature piggybacked a bit too much on the old group
logic, and actually ended up being virtually useless in general.

None of our testing or migrations caught this because they were heavily focused
on *preserving* old behavior (which is the primary goal of edition 2023).
Delimited messages structured exactly like proto2 groups (e.g. message and field
in the same scope with matching names) continued to work exactly as before,
making it seem like everything was fine.

All of this is especially problematic in light of *Submessages: In Pursuit of a
More Perfect Encoding* (not available externally yet), which intends to migrate the
ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature
as a migration tool to eliminate a deprecated syntax is one thing, but trying to
push the ecosystem to it is especially bad.

## Overview

The problems here stem from the fact that before edition 2023, the field and
type name of group fields was guaranteed to always be unique and intuitive.
Proto2 splits groups into a synthetic nested message with a type name equivalent
to the group specification (required to be capitalized), and a field name that's
fully lowercased. For example,

```
optional group MyGroup = 1 { ... }
```

would become:

```
message MyGroup { ... }
optional MyGroup mygroup = 1;
```

The casing here is very important, since the transformation is irreversible. We
can't recover the group name from the field name in general, only if the group
is a single word.

The problem under edition 2023 is that we've removed the generation of
synchronized synthetic messages from the language. Users now explicitly define
messages, and any message field can be marked `DELIMITED`. This means that
anyone assuming that the type and field name are synchronized could now be
broken.

### Codegen

While using the field name for generated APIs required less special-casing in
the generators, the field name ends up producing slightly-less-readable APIs for
multi-word camelcased groups. The result is that we see a fairly random-seeming
mix in different generators. Using protoc-explorer (not available externally),
we find the following:

<table>
  <tr>
   <td><strong>Language</strong>
   </td>
   <td><strong>Generated APIs</strong>
   </td>
   <td><strong>Example proto2 getter</strong>
   </td>
  </tr>
  <tr>
   <td>C++
   </td>
   <td>field
   </td>
   <td><code>MyGroup mygroup()</code>
   </td>
  </tr>
  <tr>
   <td>Java (all)
   </td>
   <td>message
   </td>
   <td><code>MyGroup getMyGroup()</code>
   </td>
  </tr>
  <tr>
   <td>Python
   </td>
   <td>field
   </td>
   <td><code>mygroup</code>
   </td>
  </tr>
  <tr>
   <td>Go (all)
   </td>
   <td>field
   </td>
   <td><code>GetMygroup() *Foo_MyGroup</code>
   </td>
  </tr>
  <tr>
   <td>Dart V1
   </td>
   <td>field/message*
   </td>
   <td><code>get mygroup</code>
   </td>
  </tr>
  <tr>
   <td>upb **
   </td>
   <td>field
   </td>
   <td><code>Foo_mygroup()</code>
   </td>
  </tr>
  <tr>
   <td>Objective-c
   </td>
   <td>message
   </td>
   <td><code>MyGroup* myGroup</code>
   </td>
  </tr>
  <tr>
   <td>Swift
   </td>
   <td>message
   </td>
   <td><code>MyGroup myGroup</code>
   </td>
  </tr>
  <tr>
   <td>C#
   </td>
   <td>field/message*
   </td>
   <td><code>MyGroup Mygroup</code>
   </td>
  </tr>
</table>

\* This codegen difference was [caught](cl/611144002) during the implementation
and intentionally "fixed" in Edition 2023. \
\*\* This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.) \
† Extensions use field

In the Dart V1 implementation, we decided to intentionally introduce a behavior
change on editions upgrades. It was determined that this only affected a handful
of protos in google3, and could probably be manually fixed as-needed. Java's
handling changes the story significantly, since over 50% of protos in google3
produce generated Java code. Objective-C is also noteworthy since we open-source
it, and Swift because it's widely used in OSS and we don't own it.

While the editions upgrade is still non-breaking, it means that the generated
APIs could have very surprising spellings and may not be unique. For example,
using the same type for two delimited fields in the same containing message will
create two sets of generated APIs with the same name in some languages!

### Text Format

Our "official"
[draft specification](https://protobuf.dev/reference/protobuf/textformat-spec/)
of text-format explicitly states that group messages are encoded by the *message
name*, rather than the lowercases field name. A group `MyGroup` will be
serialized as:

```
MyGroup {
  ...
}
```

In C++, we always serialize the message name and have special handling to only
accept the message name in parsing. We also have conformance tests locking down
the positive path here (i.e. using the message name round-trip). The negative
path (i.e. failing to accept the field name) doesn't have a conformance test,
but C++/Java/Python all agree and there's no known case that doesn't.

To make things even stranger, for *extensions* (group fields extending other
messages), we always use the field name for groups. So as far as group
extensions are concerned, there's no problem for editions.

There are a few problems with non-extension group fields in editions:

*   Refactoring the message name will change any text-format output
*   New delimited fields will have unexpected text-format output, that *could*
    conflict with other fields
*   Text parsers will expect the message name, which is surprising and could be
    impossible to specify uniquely

## Recommendation

Clearly the end-state we want is for the field name to be used in all generated
APIs, and for text-format serialization/parsing. The only questions are: how do
we get there and can/should we do it in time for the 2023 release in 27.0 next
month?

We propose a combination of the alternatives listed below.
[Smooth Extension](#smooth-extension) seems like the best short-term path
forward to unblock the delimited migration. It *mostly* solves the problem and
doesn't require any new features. The necessary changes for this approach have
already been prepared, along with new conformance tests to lock down the
behavior changes.

[Global Feature](#global-feature) is a good long-term mitigation for tech debt
we're leaving behind with *Smooth Extension*. Ultimately we would like to remove
any labeling of fields by their type, and editions provides a good mechanism to
do this. Alternatively, we could implement [aliases](#aliases) and use that to
unify this old behavior and avoid a new feature. Either of these options will be
the next step after the release of 2023, with aliases being preferred as long as
the timing works out.

If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available
externally) is the quickest path forward to unblock the release of edition 2023.
It has a lot of downsides though, and will block any migration towards delimited
encoding until edition 2024 has started rolling out.

## Alternatives

### Smooth Extension {#smooth-extension}

Instead of trying to change the existing behavior, we could expand the current
spec to try to cover both proto2 and editions. We would define a "group-like"
concept, which applies to all fields which:

*   Have `DELIMITED` encoding
*   Have a type corresponding to a nested message directly under its containing
    message
*   Have a name corresponding to its lowercased type name.

Note that proto2 groups will *always* be "group-like."

For any group-like field we will use the old proto2 semantics, whatever they are
today. Otherwise, we will treat them as regular fields for both codegen and
text-format. This means that *most* new cases of delimited encoding will have
the desired behavior, while *all* old groups will continue to function. The main
exception here is that users will see the unexpected proto2 behavior if they
have message/field names that *happen* to match.

While the old behavior will result in some unexpected capitalization when it's
hit, it's mostly safe. Because of 2 and 3 (and the fact that we disallow
duplicate field names), we can guarantee that in both codegen and text encoding
there will never be any conflicting symbols. There can never be two delimited
fields of the same type using the old behavior, and no other messages or fields
will exist with either spelling.

Additionally, we will update the text parsers to accept **both** the old
message-based spelling and the new field-based spelling for group-like fields.
This will at least prevent parsing failures if users hit this unexpected change
in behavior.

#### Pros

*   Fully supports old proto2 behavior
*   Treats most new editions fields correctly
*   Doesn't allow for any of the problematic cases we see today
*   By updating the parsers to accept both, we have a migration path to change
    the "wire"-format
*   Decoupled from editions launch (since it's a non-breaking change w/o a
    feature)

#### Cons

*   Requires coordinated changes in every editions-compatible runtime (and many
    generators)
*   Keeps the old proto2 behavior around indefinitely, with no path to remove it
*   Plants surprising edge case for users if they happen to name their
    message/fields a certain way

### Global Feature {#global-feature}

The simplest answer here is to introduce a new global message feature
`legacy_group_handling` to control all the changes we'd like. This will only be
applicable to group-like fields (see
[Smooth Extension](?tab=t.0#heading=h.blnhard1tpyx)). With this feature enabled,
these fields will always use their message name for text-format. Each
non-conformant language could also use this feature to gate the codegen rules.

#### Pros

*   Simple boolean to gate all the behavior changes
*   Doesn't require adding language features to a bunch of languages that don't
    have them yet
*   Uses editions to ratchet down the bad behavior

#### Cons

*   It's a little late in the game to be introducing new features to 2023
    (go/edition-lifetimes)
*   Requires coordinated changes in every editions-compatible runtime (and many
    generators)
*   The migration story for users is unclear. Overriding the value of this
    feature is both a "wire"-breaking and API-breaking change they may not be
    able to do easily.
*   With the feature set, users will still see all of the problems we have today

### Feature Suite

An extension of [Global feature](?tab=t.0#heading=h.mvtf74vplkdg) would be to
split the codegen changes out into separate per-language features.

#### Pros

*   Simple booleans to gate all the distinct behavior changes
*   Uses editions to ratchet down the bad behavior
*   Better migration story for users, since it separates API and "wire" breaking
    changes

#### Cons

*   Requires a whole slew of new language features, which typically have a
    difficult first-time setup
*   Requires coordinated changes in every editions-compatible runtime (and many
    generators)
*   Increases the complexity of edition 2023 significantly
*   With the features set, users will still see all of the problems we have
    today

### Nerf Delimited Encoding in 2023

A quick fix to avoid releasing a bad feature would be to simply ban the case
where the message and field names don't match. Adding this validation to protoc
would cover the majority of cases, although we might want additional checks in
every language that supports dynamic messages.

This is a good fallback option if we can't implement anything better before 27.0
is released. It allows us to release editions in a reasonable state, where we
can fix these issues and release a more functional `DELIMITED` feature in 2024.

#### Pros

*   Unblocks editions rollout
*   Easy and safe to implement
*   Avoids rushed implementation of a proper fix
*   Avoids runtime issues with text format
*   Avoids unexpected build breakages post-editions (e.g. renaming the nested
    message)

#### Cons

*   We'd still be releasing a really bad feature. Instead of opening up new
    possibilities, it's just "like groups but worse"
*   We couldn't fix this in 2023 without potential version skew from third party
    plugins. We'd likely have to wait until edition 2024
*   Might requires coordinated changes in a lot of runtimes
*   Doesn't unblock our effort to roll out delimited

### Rename Fields in Editions

While it might be tempting to leverage the edition 2023 upgrade as a place we
can just rename the group field, that doesn't actually work (e.g. rename
`mygroup` to `my_group`). Because so many runtimes already use the *field name*
in generated APIs, they would break under this transformation.

#### Pros

*   Works really well for text-format and some languages

#### Cons

*   Turns 2023 upgrade into a breaking change for many languages

### Aliases {#aliases}

We've discussed aliases a lot mostly in the context of `Any`, but they would be
useful for any encoding scheme that locks down field/message names. If we had a
fully implemented alias system in place, it would be the perfect mitigation
here. Unfortunately, we don't yet and the timeline here is probably too tight to
implement one.

#### Pros

*   Fixes all of the problems mentioned above
*   Allows us to specify the old behavior using the proto language, which allows
    it to be handled by Prototiller

#### Cons

*   We want this to be a real fully thought-out feature, not a hack rushed into
    a tight timeline

### Do Nothing

Doing nothing doesn't actually break anyone, but it is embarrassing.

#### Pros

*   Easy to do

#### Cons

*   Releases a horrible feature full of foot-guns in our first edition
*   Doesn't unblock our effort to roll out delimited