File: java-lite-for-editions.md

package info (click to toggle)
chromium 138.0.7204.183-1
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 6,071,908 kB
  • sloc: cpp: 34,937,088; ansic: 7,176,967; javascript: 4,110,704; python: 1,419,953; asm: 946,768; xml: 739,971; pascal: 187,324; sh: 89,623; perl: 88,663; objc: 79,944; sql: 50,304; cs: 41,786; fortran: 24,137; makefile: 21,806; php: 13,980; tcl: 13,166; yacc: 8,925; ruby: 7,485; awk: 3,720; lisp: 3,096; lex: 1,327; ada: 727; jsp: 228; sed: 36
file content (368 lines) | stat: -rw-r--r-- 11,466 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
# Java Lite For Editions

**Author:** [@zhangskz](https://github.com/zhangskz)

**Approved:** 2023-05-26

## Background

The "Lite" implementation for Java utilizes a custom format for embedding
descriptors motivated by critical code-size and performance requirements for
Android.

The code generator for Java Lite encodes an descriptor-like info string which is
stored into `RawMessageInfo`. This is decoded into `MessageSchema` which serves
as the descriptor-like schema for Java lite for parsing and serialization.

The current implementation makes significant use of an `is_proto3` bit in the
encoding, which is problematic for editions. Note that any parser changes to the
format would also need to maintain backwards compatibility, due to our
guarantees for parsers to remain backwards compatible within a major version.

## Overview

Fortunately, we already have corresponding bits for most
[Editions Zero Features](edition-zero-features.md) in the corresponding
`MessageInfo` field entry encoding.

We will move existing remaining syntax usages reading `is_proto3` to use these
bits. Several other syntax usages need to be made to be editions compatible by
merging implementations.

As new editions features are added that must be represented in `MessageInfo`, we
will eventually need to revamp `MessageInfo` encoding to support these changes.
However, this should be avoidable for Editions Zero.

## Recommendation

### Encoding: Add Is Edition Bit

`RawMessageInfo` should be augmented with an additional `is_edition` bit in
flags' unused bits.

\[0]: flags, flags & 0x1 = is proto2?, flags & 0x2 = is message?, flags &
**0x4 = is edition?**

The decoded `ProtoSyntax` should add a corresponding Editions option based on
this bit.

```
public enum ProtoSyntax
  PROTO2;
  PROTO3;
  EDITIONS;
```

For now, there is no need to explicitly encode the raw editions string or
feature options. These resolved features will be encoded directly in their
corresponding field entries.

### Encoding: Editions Zero Features

Field entries in `RawMessageInfo` already encode bits corresponding to most
***resolved*** Editions Zero features in `GetExperimentalJavaFieldType`. This is
decoded in `fieldTypeWithExtraBits` by reading the corresponding bits.

<table>
  <tr>
   <td><strong>Edition Zero Feature</strong>
   </td>
   <td><strong>Existing Encoding </strong>
   </td>
   <td><strong>Changes</strong>
   </td>
  </tr>
  <tr>
   <td>features.field_presence
   </td>
   <td> <code>kHasHasBit (0x1000)</code>
   </td>
   <td>Keep as-is.
   </td>
  </tr>
  <tr>
   <td>java.legacy_closed_enum
   </td>
   <td><code>kMapWithProto2EnumValue (0x800)</code>
   </td>
   <td>Replace with <code>kLegacyEnumIsClosedBit</code>
<p>
This will now be set for all enum fields, instead of just enum map values.
<p>
We will still need to check syntax in the interim in case of gencode.
   </td>
  </tr>
  <tr>
   <td><em>features.enum_type</em>
   </td>
   <td><em><code>EnumLiteGenerator</code> writes <code>UNRECOGNIZED(-1)</code> value for open enums in gencode.</em>
<p>
<em>This is not encoded in MessageInfo since this is an enum feature.</em>
   </td>
   <td><em>This is not needed in Editions Zero since enum closedness in Java Lite's runtime is dictated per-field by java.legacy_closed_enum. (<a href="edition-zero-feature-enum-field-closedness.md">Edition Zero Feature: Enum Field Closedness</a>), but should be used when Java non-conformance is fixed.</em>
<p>
<em>Note, this is implicitly encoded in kLegacyEnumIsClosedBit if java.legacy_closed_enum is unset since the corresponding FieldDescriptor helper should fall back on the EnumDescriptor.</em>
   </td>
  </tr>
  <tr>
   <td>features.repeated_field_encoding
   </td>
   <td><code>GetExperimentalJavaFieldTypeForPacked</code>
   </td>
   <td>Keep as-is.
   </td>
  </tr>
  <tr>
   <td>features.string_field_validation
   </td>
   <td><code>kUtf8CheckBit (0x200)</code>
   </td>
   <td>Keep as-is.
<p>
HINT does not apply to Java and will have the same behavior as MANDATORY or NONE
   </td>
  </tr>
  <tr>
   <td>features.message_encoding
   </td>
   <td>Not present.
   </td>
   <td>Encode as type group.
<p>
See below.
   </td>
  </tr>
</table>

Several places already use these bits properly, but there are a few syntax
usages in the decoding that should be replaced by checking the corresponding
feature bit.

There are several unused bits that we could use for future field-level features
before breaking the encoding format, but we should not need these for editions
zero.

The results of the `is_proto3` and feature bits only seem to be used within
protobuf, and don't seem to be publicly exposed.

#### features.message_encoding

In the compiler, message fields with `features.message_encoding = DELIMITED`
should be treated as a group *before* encoding message info.

This means that `GetExperimentalJavaFieldTypeForSingular`, should encode the
field's type `GROUP` (17), instead of its actual type `MESSAGE` (9), e.g.

```
int GetExperimentalJavaFieldTypeForSingular(const FieldDescriptor* field) {
  int result = field->type();
  if (result == FieldDescriptor::TYPE_MESSAGE) {
    if (field->isDelimited()) {
      return 17; // GROUP
    }
  }
}
```

`ImmutableMessageFieldLiteGenerator::GenerateFieldInfo` calls this when
generating the message field's field info.

The nested message's `MessageInfo` encoding does not need to be changed as this
is already identical for group and message.

Since each message field will be handled separately, this means that the
post-editions proto file below

```
// foo.proto
edition = "tbd"

message Foo {
  message Bar {
    int32 x = 1;
    repeated int32 y = 2;
  }
  Bar bar = 1 [features.message_encoding = DELIMITED];
  Bar baz = 2; // not DELIMITED

}
```

will be encoded and treated by `MessageSchema` like its pre-editions equivalent
below.

```
message Foo {
  group Bar = 1 {
    int32 x = 1;
    repeated int32 y = 2;
  }
  Bar baz = 2; // not DELIMITED
}
```

We recommended this alternative to minimize changes to the encoding and how
groups are treated.

In a future breaking change, we could consider renaming `FieldType.GROUP` to
`FieldType.MESSAGE_DELIMITED` while preserving the same number and encoding for
clarity. For now, we will leave the naming for this enum as-is.

##### Alternative: Add kIsMessageEncodingDelimitedBit

Alternatively, we could encode `features.message_encoding = DELIMITED` as-is as
type `MESSAGE`. The `MessageInfo` encoding would encode these as a normal
message field, using an unused (0x1100) bit as `kIsMessageEncodingDelimitedBit`.

This could be used to indicate that the message should be parsed/serialized from
the wire-format as if it were a group. This would need to be passed along to
`MessageSchema` which would then handle treating Messages with this bit set as
groups e.g. in `case Message`.

This is less ideal, since it would require handling this in multiple places.

### Unify non-feature syntax usages

There are several places that branch on syntax into separate proto2/proto3
codepaths. These generally duplicate a lot of code and should be unified into a
single syntax-agnostic code path branching on the relevant feature bits.

This code tends to be pretty opaque, so we should document this with comments or
add helpers (e.g. `isEnforceUtf8`) to indicate what feature bits are used as we
make changes here.

<table>
  <tr>
   <td><code>ManifestSchemaFactory.newSchema()</code>
   </td>
   <td>MessageInfo -> Schema
   </td>
   <td>Allow extensions for editions.
   </td>
  </tr>
  <tr>
   <td><code>MessageSchema.getSerializedSize()</code>
   </td>
   <td>Message -> Serialized Size
   </td>
   <td>Unify getSerializedSizeProto2/3
   </td>
  </tr>
  <tr>
   <td><code>MessageSchema.writeTo()</code>
   </td>
   <td>Serialize Message
   </td>
   <td>Unify writeFieldsInAscendingOrderProto2/3
   </td>
  </tr>
  <tr>
   <td><code>MessageSchema.mergeFrom()</code>
   </td>
   <td>Parse Message
   </td>
   <td>Unify parseProto2/3Message
   </td>
  </tr>
  <tr>
   <td><code>DescriptorMessageInfoFactory.convert()</code>
   </td>
   <td>Descriptor -> MessageInfo
   </td>
   <td>Unify convertProto2/3
   </td>
  </tr>
</table>

There is a lot of dead code in Java Lite so several syntax usages can also be
deleted or merged where possible.

## Alternatives

### Alternative 1: Introduce New Backwards-compatible MessageInfo Encoding

Add a new backwards-compatible `MessageInfo` encoding for editions.

The `is_edition` bit could toggle the encoding format being used, where
`is_edition == true` indicates the new encoding format but `is_edition == false`
indicates the old encoding.

This would allow us to encode additional information that the current encoding
format does not currently have available bits to support, such as the editions
string or additional features.

For example, the current encoding format only has a fixed number of available
field entry bits where we could encode new feature bits. We will need to
introduce a new encoding format once we exceed these, or if we want to encode
features at the message level.

In a future major version bump when support for proto2/3 is officially dropped,
we could drop support for the previous encoding format.

The recommendation is to revisit alternative 1 along with alternative 2
post-Editions zero as we need to support additional feature bits.

#### Pros

*   Future-proof for future editions and features

#### Cons

*   Blocks editions zero on more complex encoding changes that won't be used
    yet.
*   Requires more invasive updates to all MessageInfo decodings

### Alternative 2: Move to MiniDescriptor encoding

We could switch Java Lite to use the MiniDescriptor encoding specification.

Like Java Lite, this encoding seems to be optimized to be lightweight and with
minimal descriptor information.

MiniDescriptors do not encode proto2/proto3 syntax currently, which makes it
mostly editions-compatible. MiniDescriptors encode FieldModifier/MessageModifier
bits that correspond to some editions zero similarly to the Java Lite field
feature bits, and can be augmented to support additional features.

Supposedly, this encoding format *should* support an arbitrary number of
modifier bits, but this needs to be double-checked to verify there isn't a
similar hard limit to the number of features.

It is unclear whether this is sufficiently optimized for Android's needs and how
compatible this would be with Java Lite's Schemas.

The recommendation is to revisit alternative 2 along with alternative 1
post-Editions zero as we need to support additional feature bits.

#### Pros

*   Unify implementations for lower long-term maintenance cost

*   MiniDescriptor encoding will eventually need to be updated for editions
    anyways.

#### Cons

*   Blocks editions zero on more complex encoding changes that aren't necessary.

*   Requires even more invasive updates to all MessageInfo decodings

*   Probably requires major version bumps to break compatibility

*   Unknown code size /schema compatibility constraints that would need to be
    explored.

*   There are a few possible changes to MiniDescriptors on the table that we
    should wait to settle before bringing on additional implementations.

### Alternative 3: Do Nothing

Doing nothing is always an alternative. Describe the pros and cons of it.

#### Pros

*   No work

#### Cons

*   Editions is blocked since Java Lite protos are stuck in the past