Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/numpy/lib/format.py: 12%
272 statements
« prev ^ index » next coverage.py v7.4.0, created at 2024-01-03 07:57 +0000
« prev ^ index » next coverage.py v7.4.0, created at 2024-01-03 07:57 +0000
1"""
2Binary serialization
4NPY format
5==========
7A simple format for saving numpy arrays to disk with the full
8information about them.
10The ``.npy`` format is the standard binary file format in NumPy for
11persisting a *single* arbitrary NumPy array on disk. The format stores all
12of the shape and dtype information necessary to reconstruct the array
13correctly even on another machine with a different architecture.
14The format is designed to be as simple as possible while achieving
15its limited goals.
17The ``.npz`` format is the standard format for persisting *multiple* NumPy
18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy``
19files, one for each array.
21Capabilities
22------------
24- Can represent all NumPy arrays including nested record arrays and
25 object arrays.
27- Represents the data in its native binary form.
29- Supports Fortran-contiguous arrays directly.
31- Stores all of the necessary information to reconstruct the array
32 including shape and dtype on a machine of a different
33 architecture. Both little-endian and big-endian arrays are
34 supported, and a file with little-endian numbers will yield
35 a little-endian array on any machine reading the file. The
36 types are described in terms of their actual sizes. For example,
37 if a machine with a 64-bit C "long int" writes out an array with
38 "long ints", a reading machine with 32-bit C "long ints" will yield
39 an array with 64-bit integers.
41- Is straightforward to reverse engineer. Datasets often live longer than
42 the programs that created them. A competent developer should be
43 able to create a solution in their preferred programming language to
44 read most ``.npy`` files that they have been given without much
45 documentation.
47- Allows memory-mapping of the data. See `open_memmap`.
49- Can be read from a filelike stream object instead of an actual file.
51- Stores object arrays, i.e. arrays containing elements that are arbitrary
52 Python objects. Files with object arrays are not to be mmapable, but
53 can be read and written to disk.
55Limitations
56-----------
58- Arbitrary subclasses of numpy.ndarray are not completely preserved.
59 Subclasses will be accepted for writing, but only the array data will
60 be written out. A regular numpy.ndarray object will be created
61 upon reading the file.
63.. warning::
65 Due to limitations in the interpretation of structured dtypes, dtypes
66 with fields with empty names will have the names replaced by 'f0', 'f1',
67 etc. Such arrays will not round-trip through the format entirely
68 accurately. The data is intact; only the field names will differ. We are
69 working on a fix for this. This fix will not require a change in the
70 file format. The arrays with such structures can still be saved and
71 restored, and the correct dtype may be restored by using the
72 ``loadedarray.view(correct_dtype)`` method.
74File extensions
75---------------
77We recommend using the ``.npy`` and ``.npz`` extensions for files saved
78in this format. This is by no means a requirement; applications may wish
79to use these file formats but use an extension specific to the
80application. In the absence of an obvious alternative, however,
81we suggest using ``.npy`` and ``.npz``.
83Version numbering
84-----------------
86The version numbering of these formats is independent of NumPy version
87numbering. If the format is upgraded, the code in `numpy.io` will still
88be able to read and write Version 1.0 files.
90Format Version 1.0
91------------------
93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``.
95The next 1 byte is an unsigned byte: the major version number of the file
96format, e.g. ``\\x01``.
98The next 1 byte is an unsigned byte: the minor version number of the file
99format, e.g. ``\\x00``. Note: the version of the file format is not tied
100to the version of the numpy package.
102The next 2 bytes form a little-endian unsigned short int: the length of
103the header data HEADER_LEN.
105The next HEADER_LEN bytes form the header data describing the array's
106format. It is an ASCII string which contains a Python literal expression
107of a dictionary. It is terminated by a newline (``\\n``) and padded with
108spaces (``\\x20``) to make the total of
109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible
110by 64 for alignment purposes.
112The dictionary contains three keys:
114 "descr" : dtype.descr
115 An object that can be passed as an argument to the `numpy.dtype`
116 constructor to create the array's dtype.
117 "fortran_order" : bool
118 Whether the array data is Fortran-contiguous or not. Since
119 Fortran-contiguous arrays are a common form of non-C-contiguity,
120 we allow them to be written directly to disk for efficiency.
121 "shape" : tuple of int
122 The shape of the array.
124For repeatability and readability, the dictionary keys are sorted in
125alphabetic order. This is for convenience only. A writer SHOULD implement
126this if possible. A reader MUST NOT depend on this.
128Following the header comes the array data. If the dtype contains Python
129objects (i.e. ``dtype.hasobject is True``), then the data is a Python
130pickle of the array. Otherwise the data is the contiguous (either C-
131or Fortran-, depending on ``fortran_order``) bytes of the array.
132Consumers can figure out the number of bytes by multiplying the number
133of elements given by the shape (noting that ``shape=()`` means there is
1341 element) by ``dtype.itemsize``.
136Format Version 2.0
137------------------
139The version 1.0 format only allowed the array header to have a total size of
14065535 bytes. This can be exceeded by structured arrays with a large number of
141columns. The version 2.0 format extends the header size to 4 GiB.
142`numpy.save` will automatically save in 2.0 format if the data requires it,
143else it will always use the more compatible 1.0 format.
145The description of the fourth element of the header therefore has become:
146"The next 4 bytes form a little-endian unsigned int: the length of the header
147data HEADER_LEN."
149Format Version 3.0
150------------------
152This version replaces the ASCII string (which in practice was latin1) with
153a utf8-encoded string, so supports structured types with any unicode field
154names.
156Notes
157-----
158The ``.npy`` format, including motivation for creating it and a comparison of
159alternatives, is described in the
160:doc:`"npy-format" NEP <neps:nep-0001-npy-format>`, however details have
161evolved with time and this document is more current.
163"""
164import numpy
165import warnings
166from numpy.lib.utils import safe_eval
167from numpy.compat import (
168 isfileobj, os_fspath, pickle
169 )
172__all__ = []
175EXPECTED_KEYS = {'descr', 'fortran_order', 'shape'}
176MAGIC_PREFIX = b'\x93NUMPY'
177MAGIC_LEN = len(MAGIC_PREFIX) + 2
178ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096
179BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes
180# allow growth within the address space of a 64 bit machine along one axis
181GROWTH_AXIS_MAX_DIGITS = 21 # = len(str(8*2**64-1)) hypothetical int1 dtype
183# difference between version 1.0 and 2.0 is a 4 byte (I) header length
184# instead of 2 bytes (H) allowing storage of large structured arrays
185_header_size_info = {
186 (1, 0): ('<H', 'latin1'),
187 (2, 0): ('<I', 'latin1'),
188 (3, 0): ('<I', 'utf8'),
189}
191# Python's literal_eval is not actually safe for large inputs, since parsing
192# may become slow or even cause interpreter crashes.
193# This is an arbitrary, low limit which should make it safe in practice.
194_MAX_HEADER_SIZE = 10000
196def _check_version(version):
197 if version not in [(1, 0), (2, 0), (3, 0), None]:
198 msg = "we only support format version (1,0), (2,0), and (3,0), not %s"
199 raise ValueError(msg % (version,))
201def magic(major, minor):
202 """ Return the magic string for the given file format version.
204 Parameters
205 ----------
206 major : int in [0, 255]
207 minor : int in [0, 255]
209 Returns
210 -------
211 magic : str
213 Raises
214 ------
215 ValueError if the version cannot be formatted.
216 """
217 if major < 0 or major > 255:
218 raise ValueError("major version must be 0 <= major < 256")
219 if minor < 0 or minor > 255:
220 raise ValueError("minor version must be 0 <= minor < 256")
221 return MAGIC_PREFIX + bytes([major, minor])
223def read_magic(fp):
224 """ Read the magic string to get the version of the file format.
226 Parameters
227 ----------
228 fp : filelike object
230 Returns
231 -------
232 major : int
233 minor : int
234 """
235 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string")
236 if magic_str[:-2] != MAGIC_PREFIX:
237 msg = "the magic string is not correct; expected %r, got %r"
238 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2]))
239 major, minor = magic_str[-2:]
240 return major, minor
242def _has_metadata(dt):
243 if dt.metadata is not None:
244 return True
245 elif dt.names is not None:
246 return any(_has_metadata(dt[k]) for k in dt.names)
247 elif dt.subdtype is not None:
248 return _has_metadata(dt.base)
249 else:
250 return False
252def dtype_to_descr(dtype):
253 """
254 Get a serializable descriptor from the dtype.
256 The .descr attribute of a dtype object cannot be round-tripped through
257 the dtype() constructor. Simple types, like dtype('float32'), have
258 a descr which looks like a record array with one field with '' as
259 a name. The dtype() constructor interprets this as a request to give
260 a default name. Instead, we construct descriptor that can be passed to
261 dtype().
263 Parameters
264 ----------
265 dtype : dtype
266 The dtype of the array that will be written to disk.
268 Returns
269 -------
270 descr : object
271 An object that can be passed to `numpy.dtype()` in order to
272 replicate the input dtype.
274 """
275 if _has_metadata(dtype):
276 warnings.warn("metadata on a dtype may be saved or ignored, but will "
277 "raise if saved when read. Use another form of storage.",
278 UserWarning, stacklevel=2)
279 if dtype.names is not None:
280 # This is a record array. The .descr is fine. XXX: parts of the
281 # record array with an empty name, like padding bytes, still get
282 # fiddled with. This needs to be fixed in the C implementation of
283 # dtype().
284 return dtype.descr
285 else:
286 return dtype.str
288def descr_to_dtype(descr):
289 """
290 Returns a dtype based off the given description.
292 This is essentially the reverse of `dtype_to_descr()`. It will remove
293 the valueless padding fields created by, i.e. simple fields like
294 dtype('float32'), and then convert the description to its corresponding
295 dtype.
297 Parameters
298 ----------
299 descr : object
300 The object retrieved by dtype.descr. Can be passed to
301 `numpy.dtype()` in order to replicate the input dtype.
303 Returns
304 -------
305 dtype : dtype
306 The dtype constructed by the description.
308 """
309 if isinstance(descr, str):
310 # No padding removal needed
311 return numpy.dtype(descr)
312 elif isinstance(descr, tuple):
313 # subtype, will always have a shape descr[1]
314 dt = descr_to_dtype(descr[0])
315 return numpy.dtype((dt, descr[1]))
317 titles = []
318 names = []
319 formats = []
320 offsets = []
321 offset = 0
322 for field in descr:
323 if len(field) == 2:
324 name, descr_str = field
325 dt = descr_to_dtype(descr_str)
326 else:
327 name, descr_str, shape = field
328 dt = numpy.dtype((descr_to_dtype(descr_str), shape))
330 # Ignore padding bytes, which will be void bytes with '' as name
331 # Once support for blank names is removed, only "if name == ''" needed)
332 is_pad = (name == '' and dt.type is numpy.void and dt.names is None)
333 if not is_pad:
334 title, name = name if isinstance(name, tuple) else (None, name)
335 titles.append(title)
336 names.append(name)
337 formats.append(dt)
338 offsets.append(offset)
339 offset += dt.itemsize
341 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles,
342 'offsets': offsets, 'itemsize': offset})
344def header_data_from_array_1_0(array):
345 """ Get the dictionary of header metadata from a numpy.ndarray.
347 Parameters
348 ----------
349 array : numpy.ndarray
351 Returns
352 -------
353 d : dict
354 This has the appropriate entries for writing its string representation
355 to the header of the file.
356 """
357 d = {'shape': array.shape}
358 if array.flags.c_contiguous:
359 d['fortran_order'] = False
360 elif array.flags.f_contiguous:
361 d['fortran_order'] = True
362 else:
363 # Totally non-contiguous data. We will have to make it C-contiguous
364 # before writing. Note that we need to test for C_CONTIGUOUS first
365 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS.
366 d['fortran_order'] = False
368 d['descr'] = dtype_to_descr(array.dtype)
369 return d
372def _wrap_header(header, version):
373 """
374 Takes a stringified header, and attaches the prefix and padding to it
375 """
376 import struct
377 assert version is not None
378 fmt, encoding = _header_size_info[version]
379 header = header.encode(encoding)
380 hlen = len(header) + 1
381 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN)
382 try:
383 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen)
384 except struct.error:
385 msg = "Header length {} too big for version={}".format(hlen, version)
386 raise ValueError(msg) from None
388 # Pad the header with spaces and a final newline such that the magic
389 # string, the header-length short and the header are aligned on a
390 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes
391 # aligned up to ARRAY_ALIGN on systems like Linux where mmap()
392 # offset must be page-aligned (i.e. the beginning of the file).
393 return header_prefix + header + b' '*padlen + b'\n'
396def _wrap_header_guess_version(header):
397 """
398 Like `_wrap_header`, but chooses an appropriate version given the contents
399 """
400 try:
401 return _wrap_header(header, (1, 0))
402 except ValueError:
403 pass
405 try:
406 ret = _wrap_header(header, (2, 0))
407 except UnicodeEncodeError:
408 pass
409 else:
410 warnings.warn("Stored array in format 2.0. It can only be"
411 "read by NumPy >= 1.9", UserWarning, stacklevel=2)
412 return ret
414 header = _wrap_header(header, (3, 0))
415 warnings.warn("Stored array in format 3.0. It can only be "
416 "read by NumPy >= 1.17", UserWarning, stacklevel=2)
417 return header
420def _write_array_header(fp, d, version=None):
421 """ Write the header for an array and returns the version used
423 Parameters
424 ----------
425 fp : filelike object
426 d : dict
427 This has the appropriate entries for writing its string representation
428 to the header of the file.
429 version : tuple or None
430 None means use oldest that works. Providing an explicit version will
431 raise a ValueError if the format does not allow saving this data.
432 Default: None
433 """
434 header = ["{"]
435 for key, value in sorted(d.items()):
436 # Need to use repr here, since we eval these when reading
437 header.append("'%s': %s, " % (key, repr(value)))
438 header.append("}")
439 header = "".join(header)
441 # Add some spare space so that the array header can be modified in-place
442 # when changing the array size, e.g. when growing it by appending data at
443 # the end.
444 shape = d['shape']
445 header += " " * ((GROWTH_AXIS_MAX_DIGITS - len(repr(
446 shape[-1 if d['fortran_order'] else 0]
447 ))) if len(shape) > 0 else 0)
449 if version is None:
450 header = _wrap_header_guess_version(header)
451 else:
452 header = _wrap_header(header, version)
453 fp.write(header)
455def write_array_header_1_0(fp, d):
456 """ Write the header for an array using the 1.0 format.
458 Parameters
459 ----------
460 fp : filelike object
461 d : dict
462 This has the appropriate entries for writing its string
463 representation to the header of the file.
464 """
465 _write_array_header(fp, d, (1, 0))
468def write_array_header_2_0(fp, d):
469 """ Write the header for an array using the 2.0 format.
470 The 2.0 format allows storing very large structured arrays.
472 .. versionadded:: 1.9.0
474 Parameters
475 ----------
476 fp : filelike object
477 d : dict
478 This has the appropriate entries for writing its string
479 representation to the header of the file.
480 """
481 _write_array_header(fp, d, (2, 0))
483def read_array_header_1_0(fp, max_header_size=_MAX_HEADER_SIZE):
484 """
485 Read an array header from a filelike object using the 1.0 file format
486 version.
488 This will leave the file object located just after the header.
490 Parameters
491 ----------
492 fp : filelike object
493 A file object or something with a `.read()` method like a file.
495 Returns
496 -------
497 shape : tuple of int
498 The shape of the array.
499 fortran_order : bool
500 The array data will be written out directly if it is either
501 C-contiguous or Fortran-contiguous. Otherwise, it will be made
502 contiguous before writing it out.
503 dtype : dtype
504 The dtype of the file's data.
505 max_header_size : int, optional
506 Maximum allowed size of the header. Large headers may not be safe
507 to load securely and thus require explicitly passing a larger value.
508 See :py:meth:`ast.literal_eval()` for details.
510 Raises
511 ------
512 ValueError
513 If the data is invalid.
515 """
516 return _read_array_header(
517 fp, version=(1, 0), max_header_size=max_header_size)
519def read_array_header_2_0(fp, max_header_size=_MAX_HEADER_SIZE):
520 """
521 Read an array header from a filelike object using the 2.0 file format
522 version.
524 This will leave the file object located just after the header.
526 .. versionadded:: 1.9.0
528 Parameters
529 ----------
530 fp : filelike object
531 A file object or something with a `.read()` method like a file.
532 max_header_size : int, optional
533 Maximum allowed size of the header. Large headers may not be safe
534 to load securely and thus require explicitly passing a larger value.
535 See :py:meth:`ast.literal_eval()` for details.
537 Returns
538 -------
539 shape : tuple of int
540 The shape of the array.
541 fortran_order : bool
542 The array data will be written out directly if it is either
543 C-contiguous or Fortran-contiguous. Otherwise, it will be made
544 contiguous before writing it out.
545 dtype : dtype
546 The dtype of the file's data.
548 Raises
549 ------
550 ValueError
551 If the data is invalid.
553 """
554 return _read_array_header(
555 fp, version=(2, 0), max_header_size=max_header_size)
558def _filter_header(s):
559 """Clean up 'L' in npz header ints.
561 Cleans up the 'L' in strings representing integers. Needed to allow npz
562 headers produced in Python2 to be read in Python3.
564 Parameters
565 ----------
566 s : string
567 Npy file header.
569 Returns
570 -------
571 header : str
572 Cleaned up header.
574 """
575 import tokenize
576 from io import StringIO
578 tokens = []
579 last_token_was_number = False
580 for token in tokenize.generate_tokens(StringIO(s).readline):
581 token_type = token[0]
582 token_string = token[1]
583 if (last_token_was_number and
584 token_type == tokenize.NAME and
585 token_string == "L"):
586 continue
587 else:
588 tokens.append(token)
589 last_token_was_number = (token_type == tokenize.NUMBER)
590 return tokenize.untokenize(tokens)
593def _read_array_header(fp, version, max_header_size=_MAX_HEADER_SIZE):
594 """
595 see read_array_header_1_0
596 """
597 # Read an unsigned, little-endian short int which has the length of the
598 # header.
599 import struct
600 hinfo = _header_size_info.get(version)
601 if hinfo is None:
602 raise ValueError("Invalid version {!r}".format(version))
603 hlength_type, encoding = hinfo
605 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length")
606 header_length = struct.unpack(hlength_type, hlength_str)[0]
607 header = _read_bytes(fp, header_length, "array header")
608 header = header.decode(encoding)
609 if len(header) > max_header_size:
610 raise ValueError(
611 f"Header info length ({len(header)}) is large and may not be safe "
612 "to load securely.\n"
613 "To allow loading, adjust `max_header_size` or fully trust "
614 "the `.npy` file using `allow_pickle=True`.\n"
615 "For safety against large resource use or crashes, sandboxing "
616 "may be necessary.")
618 # The header is a pretty-printed string representation of a literal
619 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte
620 # boundary. The keys are strings.
621 # "shape" : tuple of int
622 # "fortran_order" : bool
623 # "descr" : dtype.descr
624 # Versions (2, 0) and (1, 0) could have been created by a Python 2
625 # implementation before header filtering was implemented.
626 if version <= (2, 0):
627 header = _filter_header(header)
628 try:
629 d = safe_eval(header)
630 except SyntaxError as e:
631 msg = "Cannot parse header: {!r}"
632 raise ValueError(msg.format(header)) from e
633 if not isinstance(d, dict):
634 msg = "Header is not a dictionary: {!r}"
635 raise ValueError(msg.format(d))
637 if EXPECTED_KEYS != d.keys():
638 keys = sorted(d.keys())
639 msg = "Header does not contain the correct keys: {!r}"
640 raise ValueError(msg.format(keys))
642 # Sanity-check the values.
643 if (not isinstance(d['shape'], tuple) or
644 not all(isinstance(x, int) for x in d['shape'])):
645 msg = "shape is not valid: {!r}"
646 raise ValueError(msg.format(d['shape']))
647 if not isinstance(d['fortran_order'], bool):
648 msg = "fortran_order is not a valid bool: {!r}"
649 raise ValueError(msg.format(d['fortran_order']))
650 try:
651 dtype = descr_to_dtype(d['descr'])
652 except TypeError as e:
653 msg = "descr is not a valid dtype descriptor: {!r}"
654 raise ValueError(msg.format(d['descr'])) from e
656 return d['shape'], d['fortran_order'], dtype
658def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None):
659 """
660 Write an array to an NPY file, including a header.
662 If the array is neither C-contiguous nor Fortran-contiguous AND the
663 file_like object is not a real file object, this function will have to
664 copy data in memory.
666 Parameters
667 ----------
668 fp : file_like object
669 An open, writable file object, or similar object with a
670 ``.write()`` method.
671 array : ndarray
672 The array to write to disk.
673 version : (int, int) or None, optional
674 The version number of the format. None means use the oldest
675 supported version that is able to store the data. Default: None
676 allow_pickle : bool, optional
677 Whether to allow writing pickled data. Default: True
678 pickle_kwargs : dict, optional
679 Additional keyword arguments to pass to pickle.dump, excluding
680 'protocol'. These are only useful when pickling objects in object
681 arrays on Python 3 to Python 2 compatible format.
683 Raises
684 ------
685 ValueError
686 If the array cannot be persisted. This includes the case of
687 allow_pickle=False and array being an object array.
688 Various other errors
689 If the array contains Python objects as part of its dtype, the
690 process of pickling them may raise various errors if the objects
691 are not picklable.
693 """
694 _check_version(version)
695 _write_array_header(fp, header_data_from_array_1_0(array), version)
697 if array.itemsize == 0:
698 buffersize = 0
699 else:
700 # Set buffer size to 16 MiB to hide the Python loop overhead.
701 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1)
703 if array.dtype.hasobject:
704 # We contain Python objects so we cannot write out the data
705 # directly. Instead, we will pickle it out
706 if not allow_pickle:
707 raise ValueError("Object arrays cannot be saved when "
708 "allow_pickle=False")
709 if pickle_kwargs is None:
710 pickle_kwargs = {}
711 pickle.dump(array, fp, protocol=3, **pickle_kwargs)
712 elif array.flags.f_contiguous and not array.flags.c_contiguous:
713 if isfileobj(fp):
714 array.T.tofile(fp)
715 else:
716 for chunk in numpy.nditer(
717 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
718 buffersize=buffersize, order='F'):
719 fp.write(chunk.tobytes('C'))
720 else:
721 if isfileobj(fp):
722 array.tofile(fp)
723 else:
724 for chunk in numpy.nditer(
725 array, flags=['external_loop', 'buffered', 'zerosize_ok'],
726 buffersize=buffersize, order='C'):
727 fp.write(chunk.tobytes('C'))
730def read_array(fp, allow_pickle=False, pickle_kwargs=None, *,
731 max_header_size=_MAX_HEADER_SIZE):
732 """
733 Read an array from an NPY file.
735 Parameters
736 ----------
737 fp : file_like object
738 If this is not a real file object, then this may take extra memory
739 and time.
740 allow_pickle : bool, optional
741 Whether to allow writing pickled data. Default: False
743 .. versionchanged:: 1.16.3
744 Made default False in response to CVE-2019-6446.
746 pickle_kwargs : dict
747 Additional keyword arguments to pass to pickle.load. These are only
748 useful when loading object arrays saved on Python 2 when using
749 Python 3.
750 max_header_size : int, optional
751 Maximum allowed size of the header. Large headers may not be safe
752 to load securely and thus require explicitly passing a larger value.
753 See :py:meth:`ast.literal_eval()` for details.
754 This option is ignored when `allow_pickle` is passed. In that case
755 the file is by definition trusted and the limit is unnecessary.
757 Returns
758 -------
759 array : ndarray
760 The array from the data on disk.
762 Raises
763 ------
764 ValueError
765 If the data is invalid, or allow_pickle=False and the file contains
766 an object array.
768 """
769 if allow_pickle:
770 # Effectively ignore max_header_size, since `allow_pickle` indicates
771 # that the input is fully trusted.
772 max_header_size = 2**64
774 version = read_magic(fp)
775 _check_version(version)
776 shape, fortran_order, dtype = _read_array_header(
777 fp, version, max_header_size=max_header_size)
778 if len(shape) == 0:
779 count = 1
780 else:
781 count = numpy.multiply.reduce(shape, dtype=numpy.int64)
783 # Now read the actual data.
784 if dtype.hasobject:
785 # The array contained Python objects. We need to unpickle the data.
786 if not allow_pickle:
787 raise ValueError("Object arrays cannot be loaded when "
788 "allow_pickle=False")
789 if pickle_kwargs is None:
790 pickle_kwargs = {}
791 try:
792 array = pickle.load(fp, **pickle_kwargs)
793 except UnicodeError as err:
794 # Friendlier error message
795 raise UnicodeError("Unpickling a python object failed: %r\n"
796 "You may need to pass the encoding= option "
797 "to numpy.load" % (err,)) from err
798 else:
799 if isfileobj(fp):
800 # We can use the fast fromfile() function.
801 array = numpy.fromfile(fp, dtype=dtype, count=count)
802 else:
803 # This is not a real file. We have to read it the
804 # memory-intensive way.
805 # crc32 module fails on reads greater than 2 ** 32 bytes,
806 # breaking large reads from gzip streams. Chunk reads to
807 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead
808 # of the read. In non-chunked case count < max_read_count, so
809 # only one read is performed.
811 # Use np.ndarray instead of np.empty since the latter does
812 # not correctly instantiate zero-width string dtypes; see
813 # https://github.com/numpy/numpy/pull/6430
814 array = numpy.ndarray(count, dtype=dtype)
816 if dtype.itemsize > 0:
817 # If dtype.itemsize == 0 then there's nothing more to read
818 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize)
820 for i in range(0, count, max_read_count):
821 read_count = min(max_read_count, count - i)
822 read_size = int(read_count * dtype.itemsize)
823 data = _read_bytes(fp, read_size, "array data")
824 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype,
825 count=read_count)
827 if fortran_order:
828 array.shape = shape[::-1]
829 array = array.transpose()
830 else:
831 array.shape = shape
833 return array
836def open_memmap(filename, mode='r+', dtype=None, shape=None,
837 fortran_order=False, version=None, *,
838 max_header_size=_MAX_HEADER_SIZE):
839 """
840 Open a .npy file as a memory-mapped array.
842 This may be used to read an existing file or create a new one.
844 Parameters
845 ----------
846 filename : str or path-like
847 The name of the file on disk. This may *not* be a file-like
848 object.
849 mode : str, optional
850 The mode in which to open the file; the default is 'r+'. In
851 addition to the standard file modes, 'c' is also accepted to mean
852 "copy on write." See `memmap` for the available mode strings.
853 dtype : data-type, optional
854 The data type of the array if we are creating a new file in "write"
855 mode, if not, `dtype` is ignored. The default value is None, which
856 results in a data-type of `float64`.
857 shape : tuple of int
858 The shape of the array if we are creating a new file in "write"
859 mode, in which case this parameter is required. Otherwise, this
860 parameter is ignored and is thus optional.
861 fortran_order : bool, optional
862 Whether the array should be Fortran-contiguous (True) or
863 C-contiguous (False, the default) if we are creating a new file in
864 "write" mode.
865 version : tuple of int (major, minor) or None
866 If the mode is a "write" mode, then this is the version of the file
867 format used to create the file. None means use the oldest
868 supported version that is able to store the data. Default: None
869 max_header_size : int, optional
870 Maximum allowed size of the header. Large headers may not be safe
871 to load securely and thus require explicitly passing a larger value.
872 See :py:meth:`ast.literal_eval()` for details.
874 Returns
875 -------
876 marray : memmap
877 The memory-mapped array.
879 Raises
880 ------
881 ValueError
882 If the data or the mode is invalid.
883 OSError
884 If the file is not found or cannot be opened correctly.
886 See Also
887 --------
888 numpy.memmap
890 """
891 if isfileobj(filename):
892 raise ValueError("Filename must be a string or a path-like object."
893 " Memmap cannot use existing file handles.")
895 if 'w' in mode:
896 # We are creating the file, not reading it.
897 # Check if we ought to create the file.
898 _check_version(version)
899 # Ensure that the given dtype is an authentic dtype object rather
900 # than just something that can be interpreted as a dtype object.
901 dtype = numpy.dtype(dtype)
902 if dtype.hasobject:
903 msg = "Array can't be memory-mapped: Python objects in dtype."
904 raise ValueError(msg)
905 d = dict(
906 descr=dtype_to_descr(dtype),
907 fortran_order=fortran_order,
908 shape=shape,
909 )
910 # If we got here, then it should be safe to create the file.
911 with open(os_fspath(filename), mode+'b') as fp:
912 _write_array_header(fp, d, version)
913 offset = fp.tell()
914 else:
915 # Read the header of the file first.
916 with open(os_fspath(filename), 'rb') as fp:
917 version = read_magic(fp)
918 _check_version(version)
920 shape, fortran_order, dtype = _read_array_header(
921 fp, version, max_header_size=max_header_size)
922 if dtype.hasobject:
923 msg = "Array can't be memory-mapped: Python objects in dtype."
924 raise ValueError(msg)
925 offset = fp.tell()
927 if fortran_order:
928 order = 'F'
929 else:
930 order = 'C'
932 # We need to change a write-only mode to a read-write mode since we've
933 # already written data to the file.
934 if mode == 'w+':
935 mode = 'r+'
937 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order,
938 mode=mode, offset=offset)
940 return marray
943def _read_bytes(fp, size, error_template="ran out of data"):
944 """
945 Read from file-like object until size bytes are read.
946 Raises ValueError if not EOF is encountered before size bytes are read.
947 Non-blocking objects only supported if they derive from io objects.
949 Required as e.g. ZipExtFile in python 2.6 can return less data than
950 requested.
951 """
952 data = bytes()
953 while True:
954 # io files (default in python3) return None or raise on
955 # would-block, python2 file will truncate, probably nothing can be
956 # done about that. note that regular files can't be non-blocking
957 try:
958 r = fp.read(size - len(data))
959 data += r
960 if len(r) == 0 or len(data) == size:
961 break
962 except BlockingIOError:
963 pass
964 if len(data) != size:
965 msg = "EOF: reading %s, expected %d bytes got %d"
966 raise ValueError(msg % (error_template, size, len(data)))
967 else:
968 return data