File: LoadLines.pm

package info (click to toggle)
libfile-loadlines-perl 1.047-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 204 kB
  • sloc: perl: 545; makefile: 2
file content (399 lines) | stat: -rw-r--r-- 10,898 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
#! perl

package File::LoadLines;

use warnings;
use strict;
use Exporter qw(import);
our @EXPORT = qw( loadlines );
our @EXPORT_OK = qw( loadblob );
use Encode;
use Carp;
use utf8;

=head1 NAME

File::LoadLines - Load lines from files and network 

=cut

our $VERSION = '1.047';

=head1 SYNOPSIS

    use File::LoadLines;
    my @lines = loadlines("mydata.txt");

    use File::LoadLines qw(loadblob);
    my $img = loadblob("https://img.shields.io/badge/Language-Perl-blue");

=head1 DESCRIPTION

File::LoadLines provides an easy way to load the contents of a text
file into an array of lines. It is intended for small to moderate size files
like config files that are often produced by weird tools (and users).

It will transparently fetch data from the network if the provided file
name is a URL.

File::LoadLines automatically handles ASCII, Latin-1 and UTF-8 text.
When the file has a BOM, it handles UTF-8, UTF-16 LE and BE, and
UTF-32 LE and BE.

Recognized line terminators are NL (Unix, Linux), CRLF (DOS, Windows)
and CR (Mac)

Function loadblob(), exported on depand, fetches the content and
returns it without processing, equivalent to File::Slurp and ilk.

=head1 EXPORT

By default the function loadlines() is exported.

=head1 FUNCTIONS

=head2 loadlines

    @lines = loadlines("mydata.txt");
    @lines = loadlines("mydata.txt", $options);

The file is opened, read, decoded and split into lines
that are returned in the result array. Line terminators are removed.

In scalar context, returns an array reference.

The first argument may be the name of a file, an opened file handle,
or a reference to a string that contains the data.
The name of a file on disk may start with C<"file://">, this is ignored.
If the name starts with C<"http:"> or C<"https:"> the data will be
retrieved using LWP.
L<Data URLs|https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs> like C<"data:text/plain;base64,SGVsbG8sIFdvcmxkIQ=="> are
also supported.

The second argument can be used to influence the behaviour.
It is a hash reference of option settings.

Note that loadlines() is a I<slurper>, it reads the whole file into
memory and, for splitting, requires temporarily memory for twice the
size of the file.

=over

=item split

Enabled by default.

The data is split into lines and returned as an array (in list
context) or as an array reference (in scalar context).

If set to zero, the data is not split into lines but returned as a
single string.

=item chomp

Enabled by default.

Line terminators are removed from the resultant lines.

If set to zero, the line terminators are not removed.

=item encoding

If specified, loadlines() will use this encoding to decode the file
data if it cannot automatically detect the encoding.

If you pass an options hash, File::LoadLines will set C<encoding> to
the encoding it detected and used for this file data.

=item blob

If specified, the data read is not touched but returned exactly as read.

C<blob> overrules C<split> and C<chomp>.

=item fail

If specified, it should be either C<"hard"> or C<"soft">.

If C<"hard">, read errors are signalled using croak exceptions.
This is the default.

If set to C<"soft">, loadlines() will return an empty result and set
the error message in the options hash with key C<"error">.

=back

=cut

sub loadlines {
    my ( $filename, $options ) = @_;
    croak("Missing filename.\n") unless defined $filename;
    croak("Invalid options.\n")  if (defined $options && (ref($options) ne "HASH"));

    $options->{blob}  //= 0;
    $options->{split} //= !$options->{blob};
    $options->{chomp} //= !$options->{blob};
    $options->{fail}  //= "hard";

    my $data;			# slurped file data
    my $encoded;		# already encoded

    # Gather data from the input.
    if ( ref($filename) ) {
	if ( ref($filename) eq 'GLOB' || ref($filename) eq 'IO::File' ) {
	    binmode( $filename, ':raw' );
	    $data = do { local $/; <$filename> };
	    $filename = "__GLOB__";
	}
	else {
	    $data = $$filename;
	    $filename = "__STRING__";
	    $encoded++;
	}
    }
    elsif ( $filename eq '-' ) {
	$filename = "__STDIN__";
	binmode( STDIN, ':raw' );
	$data = do { local $/; <STDIN> };
    }
    elsif ( $filename =~ /^https?:/ ) {
	require LWP::UserAgent;
	my $ua = LWP::UserAgent->new( timeout => 20 );
	my $res = $ua->get($filename);
	if ( $res->is_success ) {
	    $data = $res->decoded_content;
	}
	elsif ( $options->{fail} eq "soft" ) {
	    $options->{error} = $res->status_line;
	    return;
	}
	else {
	    croak("$filename: ", $res->status_line);
	}
    }
    elsif ( $filename =~ /^data:/ ) {
	unless ( $filename =~ m! ^ data:
				 (?<mediatype> .*? )
				 ,
				 (?<data>      .*  ) $
			  !sx ) {
	    if ( $options->{fail} eq "soft" ) {
		$options->{error} = "Malformed inline data";
		return;
	    }
	    else {
		croak("Malformed inline data");
	    }
	}
	$data = $+{data};
	$filename = "__DATA__";
	my $mediatype = $+{mediatype};
	my $enc = "";
	if ( $mediatype && $mediatype =~ /^(.*);base64$/ ) {
	    $mediatype = $1;
	    $enc = "base64";
	}
	$options->{mediatype} = $mediatype if $mediatype;
	if ( ! $enc ) {
	    # URL encoded.
	    $data = $+{data};
	    $data =~ s/\%([0-9a-f][0-9a-f])/chr(hex($1))/ige;
	}
	else {
	    # Base64.
	    require MIME::Base64;
	    $data = MIME::Base64::decode($data);
	}
	if ( $mediatype && $mediatype =~ /;charset=([^;]*)/ ) {
	    $data = decode( $1, $data );
	    $options->{encoding} = $1;
	    $encoded++;
	}
    }
    else {
	my $name = $filename;
	$name =~ s;^file://;;;
	$filename = decode_utf8($name);
	# On MS Windows, non-latin (wide) filenames need special treatment.
	if ( $filename ne $name && $^O =~ /mswin/i ) {
	    require Win32API::File;
	    my $fn = encode('UTF-16LE', "$filename").chr(0).chr(0);
	    my $fh = Win32API::File::CreateFileW
	      ( $fn, Win32API::File::FILE_READ_DATA(), 0, [],
		Win32API::File::OPEN_EXISTING(), 0, []);
	    croak("$filename: $^E (Win32)\n") if $^E;
	    unless ( Win32API::File::OsFHandleOpen( 'FILE', $fh, "r") ) {
		$options->{error} = "$!", return if $options->{fail} eq "soft";
		croak("$filename: $!\n");
	    }
	    binmode FILE => ':raw';
	    $data = do { local $/; readline(\*FILE) };
	    # warn("$filename³: len=", length($data), "\n");
	    close(FILE);
	}
	else {
	    my $f;
	    unless ( open( $f, '<:raw', $filename ) ) {
		$options->{error} = "$!", return
		  if $options->{fail} eq "soft";
		croak("$name: $!\n");
	    }
	    $data = do { local $/; <$f> };
	}
    }
    $options->{_filesource} = $filename if $options;

    my $name = encode_utf8($filename);
    if ( $options->{blob} ) {
	# Do not touch.
	$options->{encoding} = 'Blob';
    }
    elsif ( $encoded ) {
	# Nothing to do, already dealt with.
	$options->{encoding} //= 'Perl';
    }

    # Detect Byte Order Mark.
    elsif ( $data =~ /^\xEF\xBB\xBF/ ) {
	warn("$name is UTF-8 (BOM)\n") if $options->{debug};
	$options->{encoding} = 'UTF-8';
	$data = decode( "UTF-8", substr($data, 3) );
    }
    elsif ( $data =~ /^\xFE\xFF/ ) {
	warn("$name is UTF-16BE (BOM)\n") if $options->{debug};
	$options->{encoding} = 'UTF-16BE';
	$data = decode( "UTF-16BE", substr($data, 2) );
    }
    elsif ( $data =~ /^\xFF\xFE\x00\x00/ ) {
	warn("$name is UTF-32LE (BOM)\n") if $options->{debug};
	$options->{encoding} = 'UTF-32LE';
	$data = decode( "UTF-32LE", substr($data, 4) );
    }
    elsif ( $data =~ /^\xFF\xFE/ ) {
	warn("$name is UTF-16LE (BOM)\n") if $options->{debug};
	$options->{encoding} = 'UTF-16LE';
	$data = decode( "UTF-16LE", substr($data, 2) );
    }
    elsif ( $data =~ /^\x00\x00\xFE\xFF/ ) {
	warn("$name is UTF-32BE (BOM)\n") if $options->{debug};
	$options->{encoding} = 'UTF-32BE';
	$data = decode( "UTF-32BE", substr($data, 4) );
    }

    # No BOM, did user specify an encoding?
    elsif ( $options->{encoding} ) {
	warn("$name is ", $options->{encoding}, " (fallback)\n")
	  if $options->{debug};
	$data = decode( $options->{encoding}, $data, 1 );
    }

    # Try UTF8, fallback to ISO-8895.1.
    else {
	my $d = eval { decode( "UTF-8", $data, 1 ) };
	if ( $@ ) {
	    warn("$name is ISO-8859.1 (assumed)\n") if $options->{debug};
	    $options->{encoding} = 'ISO-8859-1';
	    $data = decode( "iso-8859-1", $data );
	}
	elsif ( $d !~ /[^[:ascii:]]/ ) {
	    warn("$name is ASCII (detected)\n") if $options->{debug};
	    $options->{encoding} = 'ASCII';
	    $data = $d;
	}
	else {
	    warn("$name is UTF-8 (detected)\n") if $options->{debug};
	    $options->{encoding} = 'UTF-8';
	    $data = $d;
	}
    }

    # This can be used to add line continuation or comment stripping.
    if ( $options->{strip} ) {
	$data =~ s/$options->{strip}//g;
    }

    return $data unless $options->{split};

    # Split in lines;
    my @lines;
    if ( $options->{chomp} ) {
	# Unless empty, make sure there is a final newline.
	$data .= "\n" if $data =~ /.(?!\r\n|\n|\r)\z/;
	# We need to maintain trailing newlines.
	push( @lines, $1 ) while $data =~ /(.*?)(?:\r\n|\n|\r)/g;
    }
    else {
	push( @lines, $1 ) while $data =~ /(.*?(?:\r\n|\n|\r))/g;
	# In case the last line has no terminator.
	push( @lines, $1 ) if $data =~ /(?:\r\n|\n|\r)([^\r\n]+)\z/;
    }
    undef $data;
    return wantarray ? @lines : \@lines;
}

=head2 loadblob

    use File::LoadLines qw(loadblob);
    $rawdata = loadblob("raw.dat");
    $rawdata = loadblob("raw.dat", $options);

This is equivalent to calling loadlines() with C<< blob=>1 >> in the options.

=cut

sub loadblob {
    my ( $filename, $options ) = @_;
    croak("Missing filename.\n") unless defined $filename;
    croak("Invalid options.\n")
      if defined($options) && ref($options) ne "HASH";
    $options //= {};
    $options->{blob} = 1;
    loadlines( $filename, $options );
}

=head1 SEE ALSO

There are currently no other modules that handle BOM detection and
line splitting.

I have a faint hope that future versions of Perl and Raku will deal
with this transparently, but I fear the worst.

=head1 HINTS

When you have raw file data (e.g. from a zip), you can use loadlines()
to decode and unpack:

    open( my $data, '<', \$contents );
    $lines = loadlines( $data, $options );

There is no hard requirement on LWP. If you want to use transparent
fetching of data over the network please make sure LWP::UserAgent is
available.

=head1 AUTHOR

Johan Vromans, C<< <JV at cpan.org> >>

=head1 SUPPORT AND DOCUMENTATION

Development of this module takes place on GitHub:
https://github.com/sciurius/perl-File-LoadLines.

You can find documentation for this module with the perldoc command.

    perldoc File::LoadLines

Please report any bugs or feature requests using the issue tracker on
GitHub.

=head1 COPYRIGHT & LICENSE

Copyright 2018,2020,2024 Johan Vromans, all rights reserved.

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut

1; # End of File::LoadLines