Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/numpy/lib/format.py: 12%

278 statements  

« prev     ^ index     » next       coverage.py v7.0.5, created at 2023-01-17 06:27 +0000

1""" 

2Binary serialization 

3 

4NPY format 

5========== 

6 

7A simple format for saving numpy arrays to disk with the full 

8information about them. 

9 

10The ``.npy`` format is the standard binary file format in NumPy for 

11persisting a *single* arbitrary NumPy array on disk. The format stores all 

12of the shape and dtype information necessary to reconstruct the array 

13correctly even on another machine with a different architecture. 

14The format is designed to be as simple as possible while achieving 

15its limited goals. 

16 

17The ``.npz`` format is the standard format for persisting *multiple* NumPy 

18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` 

19files, one for each array. 

20 

21Capabilities 

22------------ 

23 

24- Can represent all NumPy arrays including nested record arrays and 

25 object arrays. 

26 

27- Represents the data in its native binary form. 

28 

29- Supports Fortran-contiguous arrays directly. 

30 

31- Stores all of the necessary information to reconstruct the array 

32 including shape and dtype on a machine of a different 

33 architecture. Both little-endian and big-endian arrays are 

34 supported, and a file with little-endian numbers will yield 

35 a little-endian array on any machine reading the file. The 

36 types are described in terms of their actual sizes. For example, 

37 if a machine with a 64-bit C "long int" writes out an array with 

38 "long ints", a reading machine with 32-bit C "long ints" will yield 

39 an array with 64-bit integers. 

40 

41- Is straightforward to reverse engineer. Datasets often live longer than 

42 the programs that created them. A competent developer should be 

43 able to create a solution in their preferred programming language to 

44 read most ``.npy`` files that they have been given without much 

45 documentation. 

46 

47- Allows memory-mapping of the data. See `open_memmap`. 

48 

49- Can be read from a filelike stream object instead of an actual file. 

50 

51- Stores object arrays, i.e. arrays containing elements that are arbitrary 

52 Python objects. Files with object arrays are not to be mmapable, but 

53 can be read and written to disk. 

54 

55Limitations 

56----------- 

57 

58- Arbitrary subclasses of numpy.ndarray are not completely preserved. 

59 Subclasses will be accepted for writing, but only the array data will 

60 be written out. A regular numpy.ndarray object will be created 

61 upon reading the file. 

62 

63.. warning:: 

64 

65 Due to limitations in the interpretation of structured dtypes, dtypes 

66 with fields with empty names will have the names replaced by 'f0', 'f1', 

67 etc. Such arrays will not round-trip through the format entirely 

68 accurately. The data is intact; only the field names will differ. We are 

69 working on a fix for this. This fix will not require a change in the 

70 file format. The arrays with such structures can still be saved and 

71 restored, and the correct dtype may be restored by using the 

72 ``loadedarray.view(correct_dtype)`` method. 

73 

74File extensions 

75--------------- 

76 

77We recommend using the ``.npy`` and ``.npz`` extensions for files saved 

78in this format. This is by no means a requirement; applications may wish 

79to use these file formats but use an extension specific to the 

80application. In the absence of an obvious alternative, however, 

81we suggest using ``.npy`` and ``.npz``. 

82 

83Version numbering 

84----------------- 

85 

86The version numbering of these formats is independent of NumPy version 

87numbering. If the format is upgraded, the code in `numpy.io` will still 

88be able to read and write Version 1.0 files. 

89 

90Format Version 1.0 

91------------------ 

92 

93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. 

94 

95The next 1 byte is an unsigned byte: the major version number of the file 

96format, e.g. ``\\x01``. 

97 

98The next 1 byte is an unsigned byte: the minor version number of the file 

99format, e.g. ``\\x00``. Note: the version of the file format is not tied 

100to the version of the numpy package. 

101 

102The next 2 bytes form a little-endian unsigned short int: the length of 

103the header data HEADER_LEN. 

104 

105The next HEADER_LEN bytes form the header data describing the array's 

106format. It is an ASCII string which contains a Python literal expression 

107of a dictionary. It is terminated by a newline (``\\n``) and padded with 

108spaces (``\\x20``) to make the total of 

109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible 

110by 64 for alignment purposes. 

111 

112The dictionary contains three keys: 

113 

114 "descr" : dtype.descr 

115 An object that can be passed as an argument to the `numpy.dtype` 

116 constructor to create the array's dtype. 

117 "fortran_order" : bool 

118 Whether the array data is Fortran-contiguous or not. Since 

119 Fortran-contiguous arrays are a common form of non-C-contiguity, 

120 we allow them to be written directly to disk for efficiency. 

121 "shape" : tuple of int 

122 The shape of the array. 

123 

124For repeatability and readability, the dictionary keys are sorted in 

125alphabetic order. This is for convenience only. A writer SHOULD implement 

126this if possible. A reader MUST NOT depend on this. 

127 

128Following the header comes the array data. If the dtype contains Python 

129objects (i.e. ``dtype.hasobject is True``), then the data is a Python 

130pickle of the array. Otherwise the data is the contiguous (either C- 

131or Fortran-, depending on ``fortran_order``) bytes of the array. 

132Consumers can figure out the number of bytes by multiplying the number 

133of elements given by the shape (noting that ``shape=()`` means there is 

1341 element) by ``dtype.itemsize``. 

135 

136Format Version 2.0 

137------------------ 

138 

139The version 1.0 format only allowed the array header to have a total size of 

14065535 bytes. This can be exceeded by structured arrays with a large number of 

141columns. The version 2.0 format extends the header size to 4 GiB. 

142`numpy.save` will automatically save in 2.0 format if the data requires it, 

143else it will always use the more compatible 1.0 format. 

144 

145The description of the fourth element of the header therefore has become: 

146"The next 4 bytes form a little-endian unsigned int: the length of the header 

147data HEADER_LEN." 

148 

149Format Version 3.0 

150------------------ 

151 

152This version replaces the ASCII string (which in practice was latin1) with 

153a utf8-encoded string, so supports structured types with any unicode field 

154names. 

155 

156Notes 

157----- 

158The ``.npy`` format, including motivation for creating it and a comparison of 

159alternatives, is described in the 

160:doc:`"npy-format" NEP <neps:nep-0001-npy-format>`, however details have 

161evolved with time and this document is more current. 

162 

163""" 

164import numpy 

165import warnings 

166from numpy.lib.utils import safe_eval 

167from numpy.compat import ( 

168 isfileobj, os_fspath, pickle 

169 ) 

170 

171 

172__all__ = [] 

173 

174 

175EXPECTED_KEYS = {'descr', 'fortran_order', 'shape'} 

176MAGIC_PREFIX = b'\x93NUMPY' 

177MAGIC_LEN = len(MAGIC_PREFIX) + 2 

178ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096 

179BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes 

180# allow growth within the address space of a 64 bit machine along one axis 

181GROWTH_AXIS_MAX_DIGITS = 21 # = len(str(8*2**64-1)) hypothetical int1 dtype 

182 

183# difference between version 1.0 and 2.0 is a 4 byte (I) header length 

184# instead of 2 bytes (H) allowing storage of large structured arrays 

185_header_size_info = { 

186 (1, 0): ('<H', 'latin1'), 

187 (2, 0): ('<I', 'latin1'), 

188 (3, 0): ('<I', 'utf8'), 

189} 

190 

191# Python's literal_eval is not actually safe for large inputs, since parsing 

192# may become slow or even cause interpreter crashes. 

193# This is an arbitrary, low limit which should make it safe in practice. 

194_MAX_HEADER_SIZE = 10000 

195 

196def _check_version(version): 

197 if version not in [(1, 0), (2, 0), (3, 0), None]: 

198 msg = "we only support format version (1,0), (2,0), and (3,0), not %s" 

199 raise ValueError(msg % (version,)) 

200 

201def magic(major, minor): 

202 """ Return the magic string for the given file format version. 

203 

204 Parameters 

205 ---------- 

206 major : int in [0, 255] 

207 minor : int in [0, 255] 

208 

209 Returns 

210 ------- 

211 magic : str 

212 

213 Raises 

214 ------ 

215 ValueError if the version cannot be formatted. 

216 """ 

217 if major < 0 or major > 255: 

218 raise ValueError("major version must be 0 <= major < 256") 

219 if minor < 0 or minor > 255: 

220 raise ValueError("minor version must be 0 <= minor < 256") 

221 return MAGIC_PREFIX + bytes([major, minor]) 

222 

223def read_magic(fp): 

224 """ Read the magic string to get the version of the file format. 

225 

226 Parameters 

227 ---------- 

228 fp : filelike object 

229 

230 Returns 

231 ------- 

232 major : int 

233 minor : int 

234 """ 

235 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string") 

236 if magic_str[:-2] != MAGIC_PREFIX: 

237 msg = "the magic string is not correct; expected %r, got %r" 

238 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) 

239 major, minor = magic_str[-2:] 

240 return major, minor 

241 

242def _has_metadata(dt): 

243 if dt.metadata is not None: 

244 return True 

245 elif dt.names is not None: 

246 return any(_has_metadata(dt[k]) for k in dt.names) 

247 elif dt.subdtype is not None: 

248 return _has_metadata(dt.base) 

249 else: 

250 return False 

251 

252def dtype_to_descr(dtype): 

253 """ 

254 Get a serializable descriptor from the dtype. 

255 

256 The .descr attribute of a dtype object cannot be round-tripped through 

257 the dtype() constructor. Simple types, like dtype('float32'), have 

258 a descr which looks like a record array with one field with '' as 

259 a name. The dtype() constructor interprets this as a request to give 

260 a default name. Instead, we construct descriptor that can be passed to 

261 dtype(). 

262 

263 Parameters 

264 ---------- 

265 dtype : dtype 

266 The dtype of the array that will be written to disk. 

267 

268 Returns 

269 ------- 

270 descr : object 

271 An object that can be passed to `numpy.dtype()` in order to 

272 replicate the input dtype. 

273 

274 """ 

275 if _has_metadata(dtype): 

276 warnings.warn("metadata on a dtype may be saved or ignored, but will " 

277 "raise if saved when read. Use another form of storage.", 

278 UserWarning, stacklevel=2) 

279 if dtype.names is not None: 

280 # This is a record array. The .descr is fine. XXX: parts of the 

281 # record array with an empty name, like padding bytes, still get 

282 # fiddled with. This needs to be fixed in the C implementation of 

283 # dtype(). 

284 return dtype.descr 

285 else: 

286 return dtype.str 

287 

288def descr_to_dtype(descr): 

289 """ 

290 Returns a dtype based off the given description. 

291 

292 This is essentially the reverse of `dtype_to_descr()`. It will remove 

293 the valueless padding fields created by, i.e. simple fields like 

294 dtype('float32'), and then convert the description to its corresponding 

295 dtype. 

296 

297 Parameters 

298 ---------- 

299 descr : object 

300 The object retrieved by dtype.descr. Can be passed to 

301 `numpy.dtype()` in order to replicate the input dtype. 

302 

303 Returns 

304 ------- 

305 dtype : dtype 

306 The dtype constructed by the description. 

307 

308 """ 

309 if isinstance(descr, str): 

310 # No padding removal needed 

311 return numpy.dtype(descr) 

312 elif isinstance(descr, tuple): 

313 # subtype, will always have a shape descr[1] 

314 dt = descr_to_dtype(descr[0]) 

315 return numpy.dtype((dt, descr[1])) 

316 

317 titles = [] 

318 names = [] 

319 formats = [] 

320 offsets = [] 

321 offset = 0 

322 for field in descr: 

323 if len(field) == 2: 

324 name, descr_str = field 

325 dt = descr_to_dtype(descr_str) 

326 else: 

327 name, descr_str, shape = field 

328 dt = numpy.dtype((descr_to_dtype(descr_str), shape)) 

329 

330 # Ignore padding bytes, which will be void bytes with '' as name 

331 # Once support for blank names is removed, only "if name == ''" needed) 

332 is_pad = (name == '' and dt.type is numpy.void and dt.names is None) 

333 if not is_pad: 

334 title, name = name if isinstance(name, tuple) else (None, name) 

335 titles.append(title) 

336 names.append(name) 

337 formats.append(dt) 

338 offsets.append(offset) 

339 offset += dt.itemsize 

340 

341 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles, 

342 'offsets': offsets, 'itemsize': offset}) 

343 

344def header_data_from_array_1_0(array): 

345 """ Get the dictionary of header metadata from a numpy.ndarray. 

346 

347 Parameters 

348 ---------- 

349 array : numpy.ndarray 

350 

351 Returns 

352 ------- 

353 d : dict 

354 This has the appropriate entries for writing its string representation 

355 to the header of the file. 

356 """ 

357 d = {'shape': array.shape} 

358 if array.flags.c_contiguous: 

359 d['fortran_order'] = False 

360 elif array.flags.f_contiguous: 

361 d['fortran_order'] = True 

362 else: 

363 # Totally non-contiguous data. We will have to make it C-contiguous 

364 # before writing. Note that we need to test for C_CONTIGUOUS first 

365 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. 

366 d['fortran_order'] = False 

367 

368 d['descr'] = dtype_to_descr(array.dtype) 

369 return d 

370 

371 

372def _wrap_header(header, version): 

373 """ 

374 Takes a stringified header, and attaches the prefix and padding to it 

375 """ 

376 import struct 

377 assert version is not None 

378 fmt, encoding = _header_size_info[version] 

379 header = header.encode(encoding) 

380 hlen = len(header) + 1 

381 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN) 

382 try: 

383 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen) 

384 except struct.error: 

385 msg = "Header length {} too big for version={}".format(hlen, version) 

386 raise ValueError(msg) from None 

387 

388 # Pad the header with spaces and a final newline such that the magic 

389 # string, the header-length short and the header are aligned on a 

390 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes 

391 # aligned up to ARRAY_ALIGN on systems like Linux where mmap() 

392 # offset must be page-aligned (i.e. the beginning of the file). 

393 return header_prefix + header + b' '*padlen + b'\n' 

394 

395 

396def _wrap_header_guess_version(header): 

397 """ 

398 Like `_wrap_header`, but chooses an appropriate version given the contents 

399 """ 

400 try: 

401 return _wrap_header(header, (1, 0)) 

402 except ValueError: 

403 pass 

404 

405 try: 

406 ret = _wrap_header(header, (2, 0)) 

407 except UnicodeEncodeError: 

408 pass 

409 else: 

410 warnings.warn("Stored array in format 2.0. It can only be" 

411 "read by NumPy >= 1.9", UserWarning, stacklevel=2) 

412 return ret 

413 

414 header = _wrap_header(header, (3, 0)) 

415 warnings.warn("Stored array in format 3.0. It can only be " 

416 "read by NumPy >= 1.17", UserWarning, stacklevel=2) 

417 return header 

418 

419 

420def _write_array_header(fp, d, version=None): 

421 """ Write the header for an array and returns the version used 

422 

423 Parameters 

424 ---------- 

425 fp : filelike object 

426 d : dict 

427 This has the appropriate entries for writing its string representation 

428 to the header of the file. 

429 version : tuple or None 

430 None means use oldest that works. Providing an explicit version will 

431 raise a ValueError if the format does not allow saving this data. 

432 Default: None 

433 """ 

434 header = ["{"] 

435 for key, value in sorted(d.items()): 

436 # Need to use repr here, since we eval these when reading 

437 header.append("'%s': %s, " % (key, repr(value))) 

438 header.append("}") 

439 header = "".join(header) 

440 

441 # Add some spare space so that the array header can be modified in-place 

442 # when changing the array size, e.g. when growing it by appending data at 

443 # the end.  

444 shape = d['shape'] 

445 header += " " * ((GROWTH_AXIS_MAX_DIGITS - len(repr( 

446 shape[-1 if d['fortran_order'] else 0] 

447 ))) if len(shape) > 0 else 0) 

448 

449 if version is None: 

450 header = _wrap_header_guess_version(header) 

451 else: 

452 header = _wrap_header(header, version) 

453 fp.write(header) 

454 

455def write_array_header_1_0(fp, d): 

456 """ Write the header for an array using the 1.0 format. 

457 

458 Parameters 

459 ---------- 

460 fp : filelike object 

461 d : dict 

462 This has the appropriate entries for writing its string 

463 representation to the header of the file. 

464 """ 

465 _write_array_header(fp, d, (1, 0)) 

466 

467 

468def write_array_header_2_0(fp, d): 

469 """ Write the header for an array using the 2.0 format. 

470 The 2.0 format allows storing very large structured arrays. 

471 

472 .. versionadded:: 1.9.0 

473 

474 Parameters 

475 ---------- 

476 fp : filelike object 

477 d : dict 

478 This has the appropriate entries for writing its string 

479 representation to the header of the file. 

480 """ 

481 _write_array_header(fp, d, (2, 0)) 

482 

483def read_array_header_1_0(fp, max_header_size=_MAX_HEADER_SIZE): 

484 """ 

485 Read an array header from a filelike object using the 1.0 file format 

486 version. 

487 

488 This will leave the file object located just after the header. 

489 

490 Parameters 

491 ---------- 

492 fp : filelike object 

493 A file object or something with a `.read()` method like a file. 

494 

495 Returns 

496 ------- 

497 shape : tuple of int 

498 The shape of the array. 

499 fortran_order : bool 

500 The array data will be written out directly if it is either 

501 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

502 contiguous before writing it out. 

503 dtype : dtype 

504 The dtype of the file's data. 

505 max_header_size : int, optional 

506 Maximum allowed size of the header. Large headers may not be safe 

507 to load securely and thus require explicitly passing a larger value. 

508 See :py:meth:`ast.literal_eval()` for details. 

509 

510 Raises 

511 ------ 

512 ValueError 

513 If the data is invalid. 

514 

515 """ 

516 return _read_array_header( 

517 fp, version=(1, 0), max_header_size=max_header_size) 

518 

519def read_array_header_2_0(fp, max_header_size=_MAX_HEADER_SIZE): 

520 """ 

521 Read an array header from a filelike object using the 2.0 file format 

522 version. 

523 

524 This will leave the file object located just after the header. 

525 

526 .. versionadded:: 1.9.0 

527 

528 Parameters 

529 ---------- 

530 fp : filelike object 

531 A file object or something with a `.read()` method like a file. 

532 max_header_size : int, optional 

533 Maximum allowed size of the header. Large headers may not be safe 

534 to load securely and thus require explicitly passing a larger value. 

535 See :py:meth:`ast.literal_eval()` for details. 

536 

537 Returns 

538 ------- 

539 shape : tuple of int 

540 The shape of the array. 

541 fortran_order : bool 

542 The array data will be written out directly if it is either 

543 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

544 contiguous before writing it out. 

545 dtype : dtype 

546 The dtype of the file's data. 

547 

548 Raises 

549 ------ 

550 ValueError 

551 If the data is invalid. 

552 

553 """ 

554 return _read_array_header( 

555 fp, version=(2, 0), max_header_size=max_header_size) 

556 

557 

558def _filter_header(s): 

559 """Clean up 'L' in npz header ints. 

560 

561 Cleans up the 'L' in strings representing integers. Needed to allow npz 

562 headers produced in Python2 to be read in Python3. 

563 

564 Parameters 

565 ---------- 

566 s : string 

567 Npy file header. 

568 

569 Returns 

570 ------- 

571 header : str 

572 Cleaned up header. 

573 

574 """ 

575 import tokenize 

576 from io import StringIO 

577 

578 tokens = [] 

579 last_token_was_number = False 

580 for token in tokenize.generate_tokens(StringIO(s).readline): 

581 token_type = token[0] 

582 token_string = token[1] 

583 if (last_token_was_number and 

584 token_type == tokenize.NAME and 

585 token_string == "L"): 

586 continue 

587 else: 

588 tokens.append(token) 

589 last_token_was_number = (token_type == tokenize.NUMBER) 

590 return tokenize.untokenize(tokens) 

591 

592 

593def _read_array_header(fp, version, max_header_size=_MAX_HEADER_SIZE): 

594 """ 

595 see read_array_header_1_0 

596 """ 

597 # Read an unsigned, little-endian short int which has the length of the 

598 # header. 

599 import struct 

600 hinfo = _header_size_info.get(version) 

601 if hinfo is None: 

602 raise ValueError("Invalid version {!r}".format(version)) 

603 hlength_type, encoding = hinfo 

604 

605 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length") 

606 header_length = struct.unpack(hlength_type, hlength_str)[0] 

607 header = _read_bytes(fp, header_length, "array header") 

608 header = header.decode(encoding) 

609 if len(header) > max_header_size: 

610 raise ValueError( 

611 f"Header info length ({len(header)}) is large and may not be safe " 

612 "to load securely.\n" 

613 "To allow loading, adjust `max_header_size` or fully trust " 

614 "the `.npy` file using `allow_pickle=True`.\n" 

615 "For safety against large resource use or crashes, sandboxing " 

616 "may be necessary.") 

617 

618 # The header is a pretty-printed string representation of a literal 

619 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte 

620 # boundary. The keys are strings. 

621 # "shape" : tuple of int 

622 # "fortran_order" : bool 

623 # "descr" : dtype.descr 

624 # Versions (2, 0) and (1, 0) could have been created by a Python 2 

625 # implementation before header filtering was implemented. 

626 # 

627 # For performance reasons, we try without _filter_header first though 

628 try: 

629 d = safe_eval(header) 

630 except SyntaxError as e: 

631 if version <= (2, 0): 

632 header = _filter_header(header) 

633 try: 

634 d = safe_eval(header) 

635 except SyntaxError as e2: 

636 msg = "Cannot parse header: {!r}" 

637 raise ValueError(msg.format(header)) from e2 

638 else: 

639 warnings.warn( 

640 "Reading `.npy` or `.npz` file required additional " 

641 "header parsing as it was created on Python 2. Save the " 

642 "file again to speed up loading and avoid this warning.", 

643 UserWarning, stacklevel=4) 

644 else: 

645 msg = "Cannot parse header: {!r}" 

646 raise ValueError(msg.format(header)) from e 

647 if not isinstance(d, dict): 

648 msg = "Header is not a dictionary: {!r}" 

649 raise ValueError(msg.format(d)) 

650 

651 if EXPECTED_KEYS != d.keys(): 

652 keys = sorted(d.keys()) 

653 msg = "Header does not contain the correct keys: {!r}" 

654 raise ValueError(msg.format(keys)) 

655 

656 # Sanity-check the values. 

657 if (not isinstance(d['shape'], tuple) or 

658 not all(isinstance(x, int) for x in d['shape'])): 

659 msg = "shape is not valid: {!r}" 

660 raise ValueError(msg.format(d['shape'])) 

661 if not isinstance(d['fortran_order'], bool): 

662 msg = "fortran_order is not a valid bool: {!r}" 

663 raise ValueError(msg.format(d['fortran_order'])) 

664 try: 

665 dtype = descr_to_dtype(d['descr']) 

666 except TypeError as e: 

667 msg = "descr is not a valid dtype descriptor: {!r}" 

668 raise ValueError(msg.format(d['descr'])) from e 

669 

670 return d['shape'], d['fortran_order'], dtype 

671 

672def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): 

673 """ 

674 Write an array to an NPY file, including a header. 

675 

676 If the array is neither C-contiguous nor Fortran-contiguous AND the 

677 file_like object is not a real file object, this function will have to 

678 copy data in memory. 

679 

680 Parameters 

681 ---------- 

682 fp : file_like object 

683 An open, writable file object, or similar object with a 

684 ``.write()`` method. 

685 array : ndarray 

686 The array to write to disk. 

687 version : (int, int) or None, optional 

688 The version number of the format. None means use the oldest 

689 supported version that is able to store the data. Default: None 

690 allow_pickle : bool, optional 

691 Whether to allow writing pickled data. Default: True 

692 pickle_kwargs : dict, optional 

693 Additional keyword arguments to pass to pickle.dump, excluding 

694 'protocol'. These are only useful when pickling objects in object 

695 arrays on Python 3 to Python 2 compatible format. 

696 

697 Raises 

698 ------ 

699 ValueError 

700 If the array cannot be persisted. This includes the case of 

701 allow_pickle=False and array being an object array. 

702 Various other errors 

703 If the array contains Python objects as part of its dtype, the 

704 process of pickling them may raise various errors if the objects 

705 are not picklable. 

706 

707 """ 

708 _check_version(version) 

709 _write_array_header(fp, header_data_from_array_1_0(array), version) 

710 

711 if array.itemsize == 0: 

712 buffersize = 0 

713 else: 

714 # Set buffer size to 16 MiB to hide the Python loop overhead. 

715 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1) 

716 

717 if array.dtype.hasobject: 

718 # We contain Python objects so we cannot write out the data 

719 # directly. Instead, we will pickle it out 

720 if not allow_pickle: 

721 raise ValueError("Object arrays cannot be saved when " 

722 "allow_pickle=False") 

723 if pickle_kwargs is None: 

724 pickle_kwargs = {} 

725 pickle.dump(array, fp, protocol=3, **pickle_kwargs) 

726 elif array.flags.f_contiguous and not array.flags.c_contiguous: 

727 if isfileobj(fp): 

728 array.T.tofile(fp) 

729 else: 

730 for chunk in numpy.nditer( 

731 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

732 buffersize=buffersize, order='F'): 

733 fp.write(chunk.tobytes('C')) 

734 else: 

735 if isfileobj(fp): 

736 array.tofile(fp) 

737 else: 

738 for chunk in numpy.nditer( 

739 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

740 buffersize=buffersize, order='C'): 

741 fp.write(chunk.tobytes('C')) 

742 

743 

744def read_array(fp, allow_pickle=False, pickle_kwargs=None, *, 

745 max_header_size=_MAX_HEADER_SIZE): 

746 """ 

747 Read an array from an NPY file. 

748 

749 Parameters 

750 ---------- 

751 fp : file_like object 

752 If this is not a real file object, then this may take extra memory 

753 and time. 

754 allow_pickle : bool, optional 

755 Whether to allow writing pickled data. Default: False 

756 

757 .. versionchanged:: 1.16.3 

758 Made default False in response to CVE-2019-6446. 

759 

760 pickle_kwargs : dict 

761 Additional keyword arguments to pass to pickle.load. These are only 

762 useful when loading object arrays saved on Python 2 when using 

763 Python 3. 

764 max_header_size : int, optional 

765 Maximum allowed size of the header. Large headers may not be safe 

766 to load securely and thus require explicitly passing a larger value. 

767 See :py:meth:`ast.literal_eval()` for details. 

768 This option is ignored when `allow_pickle` is passed. In that case 

769 the file is by definition trusted and the limit is unnecessary. 

770 

771 Returns 

772 ------- 

773 array : ndarray 

774 The array from the data on disk. 

775 

776 Raises 

777 ------ 

778 ValueError 

779 If the data is invalid, or allow_pickle=False and the file contains 

780 an object array. 

781 

782 """ 

783 if allow_pickle: 

784 # Effectively ignore max_header_size, since `allow_pickle` indicates 

785 # that the input is fully trusted. 

786 max_header_size = 2**64 

787 

788 version = read_magic(fp) 

789 _check_version(version) 

790 shape, fortran_order, dtype = _read_array_header( 

791 fp, version, max_header_size=max_header_size) 

792 if len(shape) == 0: 

793 count = 1 

794 else: 

795 count = numpy.multiply.reduce(shape, dtype=numpy.int64) 

796 

797 # Now read the actual data. 

798 if dtype.hasobject: 

799 # The array contained Python objects. We need to unpickle the data. 

800 if not allow_pickle: 

801 raise ValueError("Object arrays cannot be loaded when " 

802 "allow_pickle=False") 

803 if pickle_kwargs is None: 

804 pickle_kwargs = {} 

805 try: 

806 array = pickle.load(fp, **pickle_kwargs) 

807 except UnicodeError as err: 

808 # Friendlier error message 

809 raise UnicodeError("Unpickling a python object failed: %r\n" 

810 "You may need to pass the encoding= option " 

811 "to numpy.load" % (err,)) from err 

812 else: 

813 if isfileobj(fp): 

814 # We can use the fast fromfile() function. 

815 array = numpy.fromfile(fp, dtype=dtype, count=count) 

816 else: 

817 # This is not a real file. We have to read it the 

818 # memory-intensive way. 

819 # crc32 module fails on reads greater than 2 ** 32 bytes, 

820 # breaking large reads from gzip streams. Chunk reads to 

821 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead 

822 # of the read. In non-chunked case count < max_read_count, so 

823 # only one read is performed. 

824 

825 # Use np.ndarray instead of np.empty since the latter does 

826 # not correctly instantiate zero-width string dtypes; see 

827 # https://github.com/numpy/numpy/pull/6430 

828 array = numpy.ndarray(count, dtype=dtype) 

829 

830 if dtype.itemsize > 0: 

831 # If dtype.itemsize == 0 then there's nothing more to read 

832 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize) 

833 

834 for i in range(0, count, max_read_count): 

835 read_count = min(max_read_count, count - i) 

836 read_size = int(read_count * dtype.itemsize) 

837 data = _read_bytes(fp, read_size, "array data") 

838 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, 

839 count=read_count) 

840 

841 if fortran_order: 

842 array.shape = shape[::-1] 

843 array = array.transpose() 

844 else: 

845 array.shape = shape 

846 

847 return array 

848 

849 

850def open_memmap(filename, mode='r+', dtype=None, shape=None, 

851 fortran_order=False, version=None, *, 

852 max_header_size=_MAX_HEADER_SIZE): 

853 """ 

854 Open a .npy file as a memory-mapped array. 

855 

856 This may be used to read an existing file or create a new one. 

857 

858 Parameters 

859 ---------- 

860 filename : str or path-like 

861 The name of the file on disk. This may *not* be a file-like 

862 object. 

863 mode : str, optional 

864 The mode in which to open the file; the default is 'r+'. In 

865 addition to the standard file modes, 'c' is also accepted to mean 

866 "copy on write." See `memmap` for the available mode strings. 

867 dtype : data-type, optional 

868 The data type of the array if we are creating a new file in "write" 

869 mode, if not, `dtype` is ignored. The default value is None, which 

870 results in a data-type of `float64`. 

871 shape : tuple of int 

872 The shape of the array if we are creating a new file in "write" 

873 mode, in which case this parameter is required. Otherwise, this 

874 parameter is ignored and is thus optional. 

875 fortran_order : bool, optional 

876 Whether the array should be Fortran-contiguous (True) or 

877 C-contiguous (False, the default) if we are creating a new file in 

878 "write" mode. 

879 version : tuple of int (major, minor) or None 

880 If the mode is a "write" mode, then this is the version of the file 

881 format used to create the file. None means use the oldest 

882 supported version that is able to store the data. Default: None 

883 max_header_size : int, optional 

884 Maximum allowed size of the header. Large headers may not be safe 

885 to load securely and thus require explicitly passing a larger value. 

886 See :py:meth:`ast.literal_eval()` for details. 

887 

888 Returns 

889 ------- 

890 marray : memmap 

891 The memory-mapped array. 

892 

893 Raises 

894 ------ 

895 ValueError 

896 If the data or the mode is invalid. 

897 OSError 

898 If the file is not found or cannot be opened correctly. 

899 

900 See Also 

901 -------- 

902 numpy.memmap 

903 

904 """ 

905 if isfileobj(filename): 

906 raise ValueError("Filename must be a string or a path-like object." 

907 " Memmap cannot use existing file handles.") 

908 

909 if 'w' in mode: 

910 # We are creating the file, not reading it. 

911 # Check if we ought to create the file. 

912 _check_version(version) 

913 # Ensure that the given dtype is an authentic dtype object rather 

914 # than just something that can be interpreted as a dtype object. 

915 dtype = numpy.dtype(dtype) 

916 if dtype.hasobject: 

917 msg = "Array can't be memory-mapped: Python objects in dtype." 

918 raise ValueError(msg) 

919 d = dict( 

920 descr=dtype_to_descr(dtype), 

921 fortran_order=fortran_order, 

922 shape=shape, 

923 ) 

924 # If we got here, then it should be safe to create the file. 

925 with open(os_fspath(filename), mode+'b') as fp: 

926 _write_array_header(fp, d, version) 

927 offset = fp.tell() 

928 else: 

929 # Read the header of the file first. 

930 with open(os_fspath(filename), 'rb') as fp: 

931 version = read_magic(fp) 

932 _check_version(version) 

933 

934 shape, fortran_order, dtype = _read_array_header( 

935 fp, version, max_header_size=max_header_size) 

936 if dtype.hasobject: 

937 msg = "Array can't be memory-mapped: Python objects in dtype." 

938 raise ValueError(msg) 

939 offset = fp.tell() 

940 

941 if fortran_order: 

942 order = 'F' 

943 else: 

944 order = 'C' 

945 

946 # We need to change a write-only mode to a read-write mode since we've 

947 # already written data to the file. 

948 if mode == 'w+': 

949 mode = 'r+' 

950 

951 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, 

952 mode=mode, offset=offset) 

953 

954 return marray 

955 

956 

957def _read_bytes(fp, size, error_template="ran out of data"): 

958 """ 

959 Read from file-like object until size bytes are read. 

960 Raises ValueError if not EOF is encountered before size bytes are read. 

961 Non-blocking objects only supported if they derive from io objects. 

962 

963 Required as e.g. ZipExtFile in python 2.6 can return less data than 

964 requested. 

965 """ 

966 data = bytes() 

967 while True: 

968 # io files (default in python3) return None or raise on 

969 # would-block, python2 file will truncate, probably nothing can be 

970 # done about that. note that regular files can't be non-blocking 

971 try: 

972 r = fp.read(size - len(data)) 

973 data += r 

974 if len(r) == 0 or len(data) == size: 

975 break 

976 except BlockingIOError: 

977 pass 

978 if len(data) != size: 

979 msg = "EOF: reading %s, expected %d bytes got %d" 

980 raise ValueError(msg % (error_template, size, len(data))) 

981 else: 

982 return data