File: 0004-predicate-regex-support.md

package info (click to toggle)
swiftlang 6.0.3-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,519,992 kB
  • sloc: cpp: 9,107,863; ansic: 2,040,022; asm: 1,135,751; python: 296,500; objc: 82,456; f90: 60,502; lisp: 34,951; pascal: 19,946; sh: 18,133; perl: 7,482; ml: 4,937; javascript: 4,117; makefile: 3,840; awk: 3,535; xml: 914; fortran: 619; cs: 573; ruby: 573
file content (132 lines) | stat: -rw-r--r-- 9,224 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# `Predicate` `Regex` Support

* Proposal: SF-0004
* Author(s): [Jeremy Schonfeld](https://github.com/jmschonfeld)
* Review Manager: [Charles Hu](https://github.com/iCharlesHu)
* Status: **Accepted**
* Implementation: [apple/swift-foundation#380](https://github.com/apple/swift-foundation/pull/380)

## Introduction/Motivation

`NSPredicate` supports complex string pattern matching via regular expression support. For example, you could write an `NSPredicate` such as `NSPredicate(format: "zipcode MATCHES %@", "\\d{5}(-\\d{4})?")` in order to match records where the `zipcode` string property is a valid US postal code. The new swift `Predicate` type supports basic string matching using functions/operators such as `==`, `contains`, `localizedStandardContains`, `localizedCompare`, and `caseInsensitiveCompare`, however `Predicate` does not currently support complex pattern matching operations such as regular expression matching. 

## Proposed solution and example

In order to help continue to achieve feature parity with `NSPredicate` and ease `Predicate` adoption for developers with complex matching logic, we'd like to add regex support to `Predicate`. We propose adding new APIs to allow developers to use the new swift-designed `Regex` type within `Predicates`. For example:

```swift
let regex = Regex {
	Anchor.startOfSubject
	Repeat(.digit, count: 5)
	Optionally {
		"-"
		Repeat(.digit, count: 4)
	}
	Anchor.endOfSubject
}

let predicate = #Predicate<Address> {
	$0.zipcode.contains(regex)
}

// - OR -

let predicate = #Predicate<Address> {
	$0.zipcode.contains(/^\d{5}(-\d{4})?$/)
}
```

## Detailed design

We propose adding the following APIs to support using regular expressions within predicates:

```swift
extension PredicateExpressions {
	@available(FoundationPreview 0.4, *)
	public struct StringContainsRegex<
		Subject : PredicateExpression,
		Regex : PredicateExpression
	> : PredicateExpression, CustomStringConvertible
	where
		Subject.Output : BidirectionalCollection,
		Subject.Output.SubSequence == Substring,
		Regex.Output : RegexComponent
	{
		public typealias Output = Bool
		
		public let subject: Subject
		public let regex: Regex
		
		public init(subject: Subject, regex: Regex)
	}
	
	@available(FoundationPreview 0.4, *)
	public func build_contains<Subject, Regex>(_ subject: Subject, _ regex: Regex) -> StringContainsRegex<Subject, Regex>
}

@available(FoundationPreview 0.4, *)
extension PredicateExpressions.StringContainsRegex : Sendable where Subject : Sendable, Regex : Sendable {}

@available(FoundationPreview 0.4, *)
extension PredicateExpressions.StringContainsRegex : Codable where Subject : Codable, Regex : Codable {}

@available(FoundationPreview 0.4, *)
extension PredicateExpressions.StringContainsRegex : StandardPredicateExpression where Subject : StandardPredicateExpression, Regex : StandardPredicateExpression {}
```

Additionally, we will add the following APIs to support storing a predicate-supported regex constant value:

```swift
extension PredicateExpressions {
	@available(FoundationPreview 0.4, *)
	public struct PredicateRegex : Sendable, Codable, RegexComponent, CustomStringConvertible {
		var regex: Regex<AnyRegexOutput> { get }
		var stringRepresentation: String { get }
		
		public init?(_ component: some RegexComponent)
	}
	
	@available(FoundationPreview 0.4, *)
	public func build_Arg(_ component: some RegexComponent) -> Value<PredicateRegex>
}
```

This `PredicateRegex` type will be the `Codable & Sendable` storage for an underlying `RegexComponent`. Rather than storing the `RegexComponent` (which is not `Codable & Sendable`) directly in a `PredicateExpressions.Value`, this `build_Arg` overload allows us to store it inside of our wrapper type. We cannot catch all cases of unsupported regular expressions at runtime, so the `build_Arg` overload will `fatalError` for cases where the developer has constructed a non-representable regex. The `PredicateRegex` initializer is failable allowing developers performing manual predicate construction to determine appropriate behavior for non-representable regular expressions. We support all regular expressions that can be transformed to a textual representation; unsupported expressions include those built with capture transform closures or custom parsers.

_Note: The syntax returned by the `stringRepresentation` property will follow the Swift regex literal syntax as defined by [SE-0355](https://github.com/apple/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#syntax) which is a syntactic "superset" of a set of popular regular expression engines._

## Source compatibility

The proposed changes are additive and there is no impact expected on existing source.

## Implications on adoption

The new API has an availability of FoundationPreview 0.4 or later.

## Alternatives considered

### Separate `Codable & Sendable` `Regex` type

Currently, all regular expressions are represented by the `Regex` type (and/or `RegexComponent` protocol) which are neither `Codable` nor `Sendable`. We could introduce a separate type/protocol that has a `Codable & Sendable` requirement (as well as a requirement to convert to a textual representation / to be introspectable), however this requires a considerable amount of new, duplicated APIs and would introduce a number of questions around which type heirarchy a given regex construction should produce. Due to the amount of effort and uncertainty around whether we can establish a fully statically-checked approach, we've decided the best option is to validate whether a regular expression is supported at runtime. We expect the overwhelming majority of expressions used in predicates will be supported which should minimize this impact on the developer experience. If we decide to create a new `Regex` type in the future that matches these requirements, we can add new APIs to predicate to support this type as well.

### Supporting a whole match in addition to `contains`

To determine whether a string fully matches a regular expression, this API requires the use of start/end anchors (`^`/`$`) with the `contains` function in order to achieve a full-string match. As it stands today, there is no API that returns a `Bool` value as to whether a string has a whole match. We could instead choose to support the existing `wholeMatch` API which returns a `Regex.Match?`, for example:

```swift
let predicate = #Predicate<Address> {
	$0.zipcode.wholeMatch(/\d{5}(-\d{4})?/) != nil
}
```

However, this API would be rather difficult for developers to discover and use. The `Regex.Match` type is neither `Codable` nor `Sendable`, so developers would only be able to compare against a `nil` value. Additionally, it may also be tempting for developers to access various properties on `Regex.Match` such as `output`, `range`, or its captures which would not be supported in any SwiftData query or `NSPredicate` conversion. For this reason, I've only proposed support for the `contains` function which developers can add start/end anchors to in order to accomplish the behavior of a whole match. If a `Bool`-returning whole match function were added to the standard library in the future, we could choose to add support for that in addition to the existing support for the `contains` function.

### Alternatives to `fatalError` in the new `build_Arg` overload

In the new `build_Arg` overload for regex constants, we will `fatalError` if provided a regex that is not supported by `Predicate` (see details above). There is not an alternative to throw an `Error` here because `Predicate` construction is non-throwing and thus `build_Arg` must also be non-throwing. A possible alternative to this `fatalError` could be to allow any regex to be included in a predicate, but `throw` during evaluation of this predicate. However, this approach has a handful of downfalls detailed below:

1. Non-supported regex components may not be `Sendable` and including them could allow for inclusion of non-`Sendable` information within a `Predicate`. While we could take care to check for support before ever using or exposing the value, this could be prone to violating the `Sendable` contract for swift concurrency support
2. `Predicate` evaluation may take place a far distance from where the `Predicate` was constructed (potentially even in a different library or process). While throwing an error may be more resilient than a `fatalError`, increasing the distance between where the mistake (the invalid regex) was made and where the mistake is reported makes the issue harder to debug and the library/process that does encounter the failure may not be the most apt to adress the issue.
3. Doing so would also lead to more hoops to jump through for predicate inspection. With the current proposal, all regular expressions that a `Predicate` can contain will always be able to produce a `String` of its contents for inspection/usage, however pushing this error on to evaluation time would also require pushing this error on to each predicate conversion routine which may also be unfavorable.

For these reasons, I've chosen the approach of a `fatalError` during construction to call out the developer error in the best way we're able.