File: Tokenizer.pm

package info (click to toggle)
libsql-splitstatement-perl 1.00023-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 484 kB
  • sloc: perl: 3,384; sql: 1,478; makefile: 2
file content (169 lines) | stat: -rw-r--r-- 4,053 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
use strict;
use warnings;
package SQL::SplitStatement::Tokenizer;


use Exporter;

our @ISA = qw(Exporter);

our @EXPORT_OK= qw(tokenize_sql);

our $VERSION = '1.00023';

my $re= qr{
    (
        (?:--|\#)[\ \t\S]*      # single line comments
        |
        (?:<>|<=>|>=|<=|==|=|!=|!|<<|>>|<|>|\|\||\||&&|&|-|\+|\*(?!/)|/(?!\*)|\%|~|\^|\?)
                                # operators and tests
        |
        [\[\]\(\)\{\},;.]            # punctuation (parenthesis, comma)
        |
        \'\'(?!\')              # empty single quoted string
        |
        \"\"(?!\"")             # empty double quoted string
        |
        "(?>(?:(?>[^"\\]+)|""|\\.)*)+"
                                # anything inside double quotes, ungreedy
        |
        `(?>(?:(?>[^`\\]+)|``|\\.)*)+`
                                # anything inside backticks quotes, ungreedy
        |
        '(?>(?:(?>[^'\\]+)|''|\\.)*)+'
                                # anything inside single quotes, ungreedy.
        |
        /\*[\ \t\r\n\S]*?\*/      # C style comments
        |
        (?:[\w:@]+(?:\.(?:\w+|\*)?)*)
                                # words, standard named placeholders, db.table.*, db.*
        |
        (?: \$_\$ | \$\d+ | \${1,2} )
                                # dollar expressions - eg $_$ $3 $$
        |
        \n                      # newline
        |
        [\t\ ]+                 # any kind of white spaces
    )
}smx;

sub tokenize_sql {
    my ( $query, $remove_white_tokens )= @_;

    my @query= $query =~ m{$re}smxg;

    if ($remove_white_tokens) {
        @query= grep( !/^[\s\n\r]*$/, @query );
    }

    return wantarray ? @query : \@query;
}

1;

=pod

=head1 NAME

SQL::SplitStatement::Tokenizer - A simple SQL tokenizer.

=head1 SYNOPSIS

 use SQL::SplitStatement::Tokenizer qw(tokenize_sql);

 my $query= q{SELECT 1 + 1};
 my @tokens= tokenize_sql($query);

 # @tokens now contains ('SELECT', ' ', '1', ' ', '+', ' ', '1')

=head1 DESCRIPTION

SQL::SplitStatement::Tokenizer is a simple tokenizer for SQL queries. It does
not claim to be a parser or query verifier. It just creates sane tokens from a
valid SQL query.

It supports SQL with comments like:

 -- This query is used to insert a message into
 -- logs table
 INSERT INTO log (application, message) VALUES (?, ?)

Also supports C<''>, C<""> and C<\'> escaping methods, so tokenizing queries
like the one below should not be a problem:

 INSERT INTO log (application, message)
 VALUES ('myapp', 'Hey, this is a ''single quoted string''!')

=head1 API

=over 4

=item tokenize_sql

    use SQL::SplitStatement::Tokenizer qw(tokenize_sql);

    my @tokens = tokenize_sql($query);
    my $tokens = tokenize_sql($query);

    $tokens = tokenize_sql( $query, $remove_white_tokens );

C<tokenize_sql> can be imported to current namespace on request. It receives a
SQL query, and returns an array of tokens if called in list context, or an
arrayref if called in scalar context.


If C<$remove_white_tokens> is true, white spaces only tokens will be removed from
result.

=back

=head1 ACKNOWLEDGEMENTS

=over 4

=item 

Igor Sutton Lopes for writing SQL::Tokenizer, which this was forked from.

=item

Evan Harris, for implementing Shell comment style and SQL operators.

=item

Charlie Hills, for spotting a lot of important issues I haven't thought.

=item

Jonas Kramer, for fixing MySQL quoted strings and treating dot as punctuation character correctly.

=item

Emanuele Zeppieri, for asking to fix SQL::Tokenizer to support dollars as well.

=item

Nigel Metheringham, for extending the dollar signal support.

=item

Devin Withers, for making it not choke on CR+LF in comments.

=item

Luc Lanthier, for simplifying the regex and make it not choke on backslashes.

=back

=head1 AUTHOR

Copyright (c) 2007, 2008, 2009, 2010, 2011 Igor Sutton Lopes "<IZUT@cpan.org>". All rights
reserved.

Copyright (c) 2021 Veesh Goldman "<veesh@cpan.org>"

This module is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.

=cut