Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.9/dist-packages/numpy/lib/format.py: 12%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

272 statements  

1""" 

2Binary serialization 

3 

4NPY format 

5========== 

6 

7A simple format for saving numpy arrays to disk with the full 

8information about them. 

9 

10The ``.npy`` format is the standard binary file format in NumPy for 

11persisting a *single* arbitrary NumPy array on disk. The format stores all 

12of the shape and dtype information necessary to reconstruct the array 

13correctly even on another machine with a different architecture. 

14The format is designed to be as simple as possible while achieving 

15its limited goals. 

16 

17The ``.npz`` format is the standard format for persisting *multiple* NumPy 

18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` 

19files, one for each array. 

20 

21Capabilities 

22------------ 

23 

24- Can represent all NumPy arrays including nested record arrays and 

25 object arrays. 

26 

27- Represents the data in its native binary form. 

28 

29- Supports Fortran-contiguous arrays directly. 

30 

31- Stores all of the necessary information to reconstruct the array 

32 including shape and dtype on a machine of a different 

33 architecture. Both little-endian and big-endian arrays are 

34 supported, and a file with little-endian numbers will yield 

35 a little-endian array on any machine reading the file. The 

36 types are described in terms of their actual sizes. For example, 

37 if a machine with a 64-bit C "long int" writes out an array with 

38 "long ints", a reading machine with 32-bit C "long ints" will yield 

39 an array with 64-bit integers. 

40 

41- Is straightforward to reverse engineer. Datasets often live longer than 

42 the programs that created them. A competent developer should be 

43 able to create a solution in their preferred programming language to 

44 read most ``.npy`` files that they have been given without much 

45 documentation. 

46 

47- Allows memory-mapping of the data. See `open_memmap`. 

48 

49- Can be read from a filelike stream object instead of an actual file. 

50 

51- Stores object arrays, i.e. arrays containing elements that are arbitrary 

52 Python objects. Files with object arrays are not to be mmapable, but 

53 can be read and written to disk. 

54 

55Limitations 

56----------- 

57 

58- Arbitrary subclasses of numpy.ndarray are not completely preserved. 

59 Subclasses will be accepted for writing, but only the array data will 

60 be written out. A regular numpy.ndarray object will be created 

61 upon reading the file. 

62 

63.. warning:: 

64 

65 Due to limitations in the interpretation of structured dtypes, dtypes 

66 with fields with empty names will have the names replaced by 'f0', 'f1', 

67 etc. Such arrays will not round-trip through the format entirely 

68 accurately. The data is intact; only the field names will differ. We are 

69 working on a fix for this. This fix will not require a change in the 

70 file format. The arrays with such structures can still be saved and 

71 restored, and the correct dtype may be restored by using the 

72 ``loadedarray.view(correct_dtype)`` method. 

73 

74File extensions 

75--------------- 

76 

77We recommend using the ``.npy`` and ``.npz`` extensions for files saved 

78in this format. This is by no means a requirement; applications may wish 

79to use these file formats but use an extension specific to the 

80application. In the absence of an obvious alternative, however, 

81we suggest using ``.npy`` and ``.npz``. 

82 

83Version numbering 

84----------------- 

85 

86The version numbering of these formats is independent of NumPy version 

87numbering. If the format is upgraded, the code in `numpy.io` will still 

88be able to read and write Version 1.0 files. 

89 

90Format Version 1.0 

91------------------ 

92 

93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. 

94 

95The next 1 byte is an unsigned byte: the major version number of the file 

96format, e.g. ``\\x01``. 

97 

98The next 1 byte is an unsigned byte: the minor version number of the file 

99format, e.g. ``\\x00``. Note: the version of the file format is not tied 

100to the version of the numpy package. 

101 

102The next 2 bytes form a little-endian unsigned short int: the length of 

103the header data HEADER_LEN. 

104 

105The next HEADER_LEN bytes form the header data describing the array's 

106format. It is an ASCII string which contains a Python literal expression 

107of a dictionary. It is terminated by a newline (``\\n``) and padded with 

108spaces (``\\x20``) to make the total of 

109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible 

110by 64 for alignment purposes. 

111 

112The dictionary contains three keys: 

113 

114 "descr" : dtype.descr 

115 An object that can be passed as an argument to the `numpy.dtype` 

116 constructor to create the array's dtype. 

117 "fortran_order" : bool 

118 Whether the array data is Fortran-contiguous or not. Since 

119 Fortran-contiguous arrays are a common form of non-C-contiguity, 

120 we allow them to be written directly to disk for efficiency. 

121 "shape" : tuple of int 

122 The shape of the array. 

123 

124For repeatability and readability, the dictionary keys are sorted in 

125alphabetic order. This is for convenience only. A writer SHOULD implement 

126this if possible. A reader MUST NOT depend on this. 

127 

128Following the header comes the array data. If the dtype contains Python 

129objects (i.e. ``dtype.hasobject is True``), then the data is a Python 

130pickle of the array. Otherwise the data is the contiguous (either C- 

131or Fortran-, depending on ``fortran_order``) bytes of the array. 

132Consumers can figure out the number of bytes by multiplying the number 

133of elements given by the shape (noting that ``shape=()`` means there is 

1341 element) by ``dtype.itemsize``. 

135 

136Format Version 2.0 

137------------------ 

138 

139The version 1.0 format only allowed the array header to have a total size of 

14065535 bytes. This can be exceeded by structured arrays with a large number of 

141columns. The version 2.0 format extends the header size to 4 GiB. 

142`numpy.save` will automatically save in 2.0 format if the data requires it, 

143else it will always use the more compatible 1.0 format. 

144 

145The description of the fourth element of the header therefore has become: 

146"The next 4 bytes form a little-endian unsigned int: the length of the header 

147data HEADER_LEN." 

148 

149Format Version 3.0 

150------------------ 

151 

152This version replaces the ASCII string (which in practice was latin1) with 

153a utf8-encoded string, so supports structured types with any unicode field 

154names. 

155 

156Notes 

157----- 

158The ``.npy`` format, including motivation for creating it and a comparison of 

159alternatives, is described in the 

160:doc:`"npy-format" NEP <neps:nep-0001-npy-format>`, however details have 

161evolved with time and this document is more current. 

162 

163""" 

164import numpy 

165import warnings 

166from numpy.lib.utils import safe_eval, drop_metadata 

167from numpy.compat import ( 

168 isfileobj, os_fspath, pickle 

169 ) 

170 

171 

172__all__ = [] 

173 

174 

175EXPECTED_KEYS = {'descr', 'fortran_order', 'shape'} 

176MAGIC_PREFIX = b'\x93NUMPY' 

177MAGIC_LEN = len(MAGIC_PREFIX) + 2 

178ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096 

179BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes 

180# allow growth within the address space of a 64 bit machine along one axis 

181GROWTH_AXIS_MAX_DIGITS = 21 # = len(str(8*2**64-1)) hypothetical int1 dtype 

182 

183# difference between version 1.0 and 2.0 is a 4 byte (I) header length 

184# instead of 2 bytes (H) allowing storage of large structured arrays 

185_header_size_info = { 

186 (1, 0): ('<H', 'latin1'), 

187 (2, 0): ('<I', 'latin1'), 

188 (3, 0): ('<I', 'utf8'), 

189} 

190 

191# Python's literal_eval is not actually safe for large inputs, since parsing 

192# may become slow or even cause interpreter crashes. 

193# This is an arbitrary, low limit which should make it safe in practice. 

194_MAX_HEADER_SIZE = 10000 

195 

196def _check_version(version): 

197 if version not in [(1, 0), (2, 0), (3, 0), None]: 

198 msg = "we only support format version (1,0), (2,0), and (3,0), not %s" 

199 raise ValueError(msg % (version,)) 

200 

201def magic(major, minor): 

202 """ Return the magic string for the given file format version. 

203 

204 Parameters 

205 ---------- 

206 major : int in [0, 255] 

207 minor : int in [0, 255] 

208 

209 Returns 

210 ------- 

211 magic : str 

212 

213 Raises 

214 ------ 

215 ValueError if the version cannot be formatted. 

216 """ 

217 if major < 0 or major > 255: 

218 raise ValueError("major version must be 0 <= major < 256") 

219 if minor < 0 or minor > 255: 

220 raise ValueError("minor version must be 0 <= minor < 256") 

221 return MAGIC_PREFIX + bytes([major, minor]) 

222 

223def read_magic(fp): 

224 """ Read the magic string to get the version of the file format. 

225 

226 Parameters 

227 ---------- 

228 fp : filelike object 

229 

230 Returns 

231 ------- 

232 major : int 

233 minor : int 

234 """ 

235 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string") 

236 if magic_str[:-2] != MAGIC_PREFIX: 

237 msg = "the magic string is not correct; expected %r, got %r" 

238 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) 

239 major, minor = magic_str[-2:] 

240 return major, minor 

241 

242 

243def dtype_to_descr(dtype): 

244 """ 

245 Get a serializable descriptor from the dtype. 

246 

247 The .descr attribute of a dtype object cannot be round-tripped through 

248 the dtype() constructor. Simple types, like dtype('float32'), have 

249 a descr which looks like a record array with one field with '' as 

250 a name. The dtype() constructor interprets this as a request to give 

251 a default name. Instead, we construct descriptor that can be passed to 

252 dtype(). 

253 

254 Parameters 

255 ---------- 

256 dtype : dtype 

257 The dtype of the array that will be written to disk. 

258 

259 Returns 

260 ------- 

261 descr : object 

262 An object that can be passed to `numpy.dtype()` in order to 

263 replicate the input dtype. 

264 

265 """ 

266 # NOTE: that drop_metadata may not return the right dtype e.g. for user 

267 # dtypes. In that case our code below would fail the same, though. 

268 new_dtype = drop_metadata(dtype) 

269 if new_dtype is not dtype: 

270 warnings.warn("metadata on a dtype is not saved to an npy/npz. " 

271 "Use another format (such as pickle) to store it.", 

272 UserWarning, stacklevel=2) 

273 if dtype.names is not None: 

274 # This is a record array. The .descr is fine. XXX: parts of the 

275 # record array with an empty name, like padding bytes, still get 

276 # fiddled with. This needs to be fixed in the C implementation of 

277 # dtype(). 

278 return dtype.descr 

279 else: 

280 return dtype.str 

281 

282def descr_to_dtype(descr): 

283 """ 

284 Returns a dtype based off the given description. 

285 

286 This is essentially the reverse of `dtype_to_descr()`. It will remove 

287 the valueless padding fields created by, i.e. simple fields like 

288 dtype('float32'), and then convert the description to its corresponding 

289 dtype. 

290 

291 Parameters 

292 ---------- 

293 descr : object 

294 The object retrieved by dtype.descr. Can be passed to 

295 `numpy.dtype()` in order to replicate the input dtype. 

296 

297 Returns 

298 ------- 

299 dtype : dtype 

300 The dtype constructed by the description. 

301 

302 """ 

303 if isinstance(descr, str): 

304 # No padding removal needed 

305 return numpy.dtype(descr) 

306 elif isinstance(descr, tuple): 

307 # subtype, will always have a shape descr[1] 

308 dt = descr_to_dtype(descr[0]) 

309 return numpy.dtype((dt, descr[1])) 

310 

311 titles = [] 

312 names = [] 

313 formats = [] 

314 offsets = [] 

315 offset = 0 

316 for field in descr: 

317 if len(field) == 2: 

318 name, descr_str = field 

319 dt = descr_to_dtype(descr_str) 

320 else: 

321 name, descr_str, shape = field 

322 dt = numpy.dtype((descr_to_dtype(descr_str), shape)) 

323 

324 # Ignore padding bytes, which will be void bytes with '' as name 

325 # Once support for blank names is removed, only "if name == ''" needed) 

326 is_pad = (name == '' and dt.type is numpy.void and dt.names is None) 

327 if not is_pad: 

328 title, name = name if isinstance(name, tuple) else (None, name) 

329 titles.append(title) 

330 names.append(name) 

331 formats.append(dt) 

332 offsets.append(offset) 

333 offset += dt.itemsize 

334 

335 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles, 

336 'offsets': offsets, 'itemsize': offset}) 

337 

338def header_data_from_array_1_0(array): 

339 """ Get the dictionary of header metadata from a numpy.ndarray. 

340 

341 Parameters 

342 ---------- 

343 array : numpy.ndarray 

344 

345 Returns 

346 ------- 

347 d : dict 

348 This has the appropriate entries for writing its string representation 

349 to the header of the file. 

350 """ 

351 d = {'shape': array.shape} 

352 if array.flags.c_contiguous: 

353 d['fortran_order'] = False 

354 elif array.flags.f_contiguous: 

355 d['fortran_order'] = True 

356 else: 

357 # Totally non-contiguous data. We will have to make it C-contiguous 

358 # before writing. Note that we need to test for C_CONTIGUOUS first 

359 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. 

360 d['fortran_order'] = False 

361 

362 d['descr'] = dtype_to_descr(array.dtype) 

363 return d 

364 

365 

366def _wrap_header(header, version): 

367 """ 

368 Takes a stringified header, and attaches the prefix and padding to it 

369 """ 

370 import struct 

371 assert version is not None 

372 fmt, encoding = _header_size_info[version] 

373 header = header.encode(encoding) 

374 hlen = len(header) + 1 

375 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN) 

376 try: 

377 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen) 

378 except struct.error: 

379 msg = "Header length {} too big for version={}".format(hlen, version) 

380 raise ValueError(msg) from None 

381 

382 # Pad the header with spaces and a final newline such that the magic 

383 # string, the header-length short and the header are aligned on a 

384 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes 

385 # aligned up to ARRAY_ALIGN on systems like Linux where mmap() 

386 # offset must be page-aligned (i.e. the beginning of the file). 

387 return header_prefix + header + b' '*padlen + b'\n' 

388 

389 

390def _wrap_header_guess_version(header): 

391 """ 

392 Like `_wrap_header`, but chooses an appropriate version given the contents 

393 """ 

394 try: 

395 return _wrap_header(header, (1, 0)) 

396 except ValueError: 

397 pass 

398 

399 try: 

400 ret = _wrap_header(header, (2, 0)) 

401 except UnicodeEncodeError: 

402 pass 

403 else: 

404 warnings.warn("Stored array in format 2.0. It can only be" 

405 "read by NumPy >= 1.9", UserWarning, stacklevel=2) 

406 return ret 

407 

408 header = _wrap_header(header, (3, 0)) 

409 warnings.warn("Stored array in format 3.0. It can only be " 

410 "read by NumPy >= 1.17", UserWarning, stacklevel=2) 

411 return header 

412 

413 

414def _write_array_header(fp, d, version=None): 

415 """ Write the header for an array and returns the version used 

416 

417 Parameters 

418 ---------- 

419 fp : filelike object 

420 d : dict 

421 This has the appropriate entries for writing its string representation 

422 to the header of the file. 

423 version : tuple or None 

424 None means use oldest that works. Providing an explicit version will 

425 raise a ValueError if the format does not allow saving this data. 

426 Default: None 

427 """ 

428 header = ["{"] 

429 for key, value in sorted(d.items()): 

430 # Need to use repr here, since we eval these when reading 

431 header.append("'%s': %s, " % (key, repr(value))) 

432 header.append("}") 

433 header = "".join(header) 

434 

435 # Add some spare space so that the array header can be modified in-place 

436 # when changing the array size, e.g. when growing it by appending data at 

437 # the end. 

438 shape = d['shape'] 

439 header += " " * ((GROWTH_AXIS_MAX_DIGITS - len(repr( 

440 shape[-1 if d['fortran_order'] else 0] 

441 ))) if len(shape) > 0 else 0) 

442 

443 if version is None: 

444 header = _wrap_header_guess_version(header) 

445 else: 

446 header = _wrap_header(header, version) 

447 fp.write(header) 

448 

449def write_array_header_1_0(fp, d): 

450 """ Write the header for an array using the 1.0 format. 

451 

452 Parameters 

453 ---------- 

454 fp : filelike object 

455 d : dict 

456 This has the appropriate entries for writing its string 

457 representation to the header of the file. 

458 """ 

459 _write_array_header(fp, d, (1, 0)) 

460 

461 

462def write_array_header_2_0(fp, d): 

463 """ Write the header for an array using the 2.0 format. 

464 The 2.0 format allows storing very large structured arrays. 

465 

466 .. versionadded:: 1.9.0 

467 

468 Parameters 

469 ---------- 

470 fp : filelike object 

471 d : dict 

472 This has the appropriate entries for writing its string 

473 representation to the header of the file. 

474 """ 

475 _write_array_header(fp, d, (2, 0)) 

476 

477def read_array_header_1_0(fp, max_header_size=_MAX_HEADER_SIZE): 

478 """ 

479 Read an array header from a filelike object using the 1.0 file format 

480 version. 

481 

482 This will leave the file object located just after the header. 

483 

484 Parameters 

485 ---------- 

486 fp : filelike object 

487 A file object or something with a `.read()` method like a file. 

488 

489 Returns 

490 ------- 

491 shape : tuple of int 

492 The shape of the array. 

493 fortran_order : bool 

494 The array data will be written out directly if it is either 

495 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

496 contiguous before writing it out. 

497 dtype : dtype 

498 The dtype of the file's data. 

499 max_header_size : int, optional 

500 Maximum allowed size of the header. Large headers may not be safe 

501 to load securely and thus require explicitly passing a larger value. 

502 See :py:func:`ast.literal_eval()` for details. 

503 

504 Raises 

505 ------ 

506 ValueError 

507 If the data is invalid. 

508 

509 """ 

510 return _read_array_header( 

511 fp, version=(1, 0), max_header_size=max_header_size) 

512 

513def read_array_header_2_0(fp, max_header_size=_MAX_HEADER_SIZE): 

514 """ 

515 Read an array header from a filelike object using the 2.0 file format 

516 version. 

517 

518 This will leave the file object located just after the header. 

519 

520 .. versionadded:: 1.9.0 

521 

522 Parameters 

523 ---------- 

524 fp : filelike object 

525 A file object or something with a `.read()` method like a file. 

526 max_header_size : int, optional 

527 Maximum allowed size of the header. Large headers may not be safe 

528 to load securely and thus require explicitly passing a larger value. 

529 See :py:func:`ast.literal_eval()` for details. 

530 

531 Returns 

532 ------- 

533 shape : tuple of int 

534 The shape of the array. 

535 fortran_order : bool 

536 The array data will be written out directly if it is either 

537 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

538 contiguous before writing it out. 

539 dtype : dtype 

540 The dtype of the file's data. 

541 

542 Raises 

543 ------ 

544 ValueError 

545 If the data is invalid. 

546 

547 """ 

548 return _read_array_header( 

549 fp, version=(2, 0), max_header_size=max_header_size) 

550 

551 

552def _filter_header(s): 

553 """Clean up 'L' in npz header ints. 

554 

555 Cleans up the 'L' in strings representing integers. Needed to allow npz 

556 headers produced in Python2 to be read in Python3. 

557 

558 Parameters 

559 ---------- 

560 s : string 

561 Npy file header. 

562 

563 Returns 

564 ------- 

565 header : str 

566 Cleaned up header. 

567 

568 """ 

569 import tokenize 

570 from io import StringIO 

571 

572 tokens = [] 

573 last_token_was_number = False 

574 for token in tokenize.generate_tokens(StringIO(s).readline): 

575 token_type = token[0] 

576 token_string = token[1] 

577 if (last_token_was_number and 

578 token_type == tokenize.NAME and 

579 token_string == "L"): 

580 continue 

581 else: 

582 tokens.append(token) 

583 last_token_was_number = (token_type == tokenize.NUMBER) 

584 return tokenize.untokenize(tokens) 

585 

586 

587def _read_array_header(fp, version, max_header_size=_MAX_HEADER_SIZE): 

588 """ 

589 see read_array_header_1_0 

590 """ 

591 # Read an unsigned, little-endian short int which has the length of the 

592 # header. 

593 import struct 

594 hinfo = _header_size_info.get(version) 

595 if hinfo is None: 

596 raise ValueError("Invalid version {!r}".format(version)) 

597 hlength_type, encoding = hinfo 

598 

599 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length") 

600 header_length = struct.unpack(hlength_type, hlength_str)[0] 

601 header = _read_bytes(fp, header_length, "array header") 

602 header = header.decode(encoding) 

603 if len(header) > max_header_size: 

604 raise ValueError( 

605 f"Header info length ({len(header)}) is large and may not be safe " 

606 "to load securely.\n" 

607 "To allow loading, adjust `max_header_size` or fully trust " 

608 "the `.npy` file using `allow_pickle=True`.\n" 

609 "For safety against large resource use or crashes, sandboxing " 

610 "may be necessary.") 

611 

612 # The header is a pretty-printed string representation of a literal 

613 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte 

614 # boundary. The keys are strings. 

615 # "shape" : tuple of int 

616 # "fortran_order" : bool 

617 # "descr" : dtype.descr 

618 # Versions (2, 0) and (1, 0) could have been created by a Python 2 

619 # implementation before header filtering was implemented. 

620 # 

621 # For performance reasons, we try without _filter_header first though 

622 try: 

623 d = safe_eval(header) 

624 except SyntaxError as e: 

625 if version <= (2, 0): 

626 header = _filter_header(header) 

627 try: 

628 d = safe_eval(header) 

629 except SyntaxError as e2: 

630 msg = "Cannot parse header: {!r}" 

631 raise ValueError(msg.format(header)) from e2 

632 else: 

633 warnings.warn( 

634 "Reading `.npy` or `.npz` file required additional " 

635 "header parsing as it was created on Python 2. Save the " 

636 "file again to speed up loading and avoid this warning.", 

637 UserWarning, stacklevel=4) 

638 else: 

639 msg = "Cannot parse header: {!r}" 

640 raise ValueError(msg.format(header)) from e 

641 if not isinstance(d, dict): 

642 msg = "Header is not a dictionary: {!r}" 

643 raise ValueError(msg.format(d)) 

644 

645 if EXPECTED_KEYS != d.keys(): 

646 keys = sorted(d.keys()) 

647 msg = "Header does not contain the correct keys: {!r}" 

648 raise ValueError(msg.format(keys)) 

649 

650 # Sanity-check the values. 

651 if (not isinstance(d['shape'], tuple) or 

652 not all(isinstance(x, int) for x in d['shape'])): 

653 msg = "shape is not valid: {!r}" 

654 raise ValueError(msg.format(d['shape'])) 

655 if not isinstance(d['fortran_order'], bool): 

656 msg = "fortran_order is not a valid bool: {!r}" 

657 raise ValueError(msg.format(d['fortran_order'])) 

658 try: 

659 dtype = descr_to_dtype(d['descr']) 

660 except TypeError as e: 

661 msg = "descr is not a valid dtype descriptor: {!r}" 

662 raise ValueError(msg.format(d['descr'])) from e 

663 

664 return d['shape'], d['fortran_order'], dtype 

665 

666def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): 

667 """ 

668 Write an array to an NPY file, including a header. 

669 

670 If the array is neither C-contiguous nor Fortran-contiguous AND the 

671 file_like object is not a real file object, this function will have to 

672 copy data in memory. 

673 

674 Parameters 

675 ---------- 

676 fp : file_like object 

677 An open, writable file object, or similar object with a 

678 ``.write()`` method. 

679 array : ndarray 

680 The array to write to disk. 

681 version : (int, int) or None, optional 

682 The version number of the format. None means use the oldest 

683 supported version that is able to store the data. Default: None 

684 allow_pickle : bool, optional 

685 Whether to allow writing pickled data. Default: True 

686 pickle_kwargs : dict, optional 

687 Additional keyword arguments to pass to pickle.dump, excluding 

688 'protocol'. These are only useful when pickling objects in object 

689 arrays on Python 3 to Python 2 compatible format. 

690 

691 Raises 

692 ------ 

693 ValueError 

694 If the array cannot be persisted. This includes the case of 

695 allow_pickle=False and array being an object array. 

696 Various other errors 

697 If the array contains Python objects as part of its dtype, the 

698 process of pickling them may raise various errors if the objects 

699 are not picklable. 

700 

701 """ 

702 _check_version(version) 

703 _write_array_header(fp, header_data_from_array_1_0(array), version) 

704 

705 if array.itemsize == 0: 

706 buffersize = 0 

707 else: 

708 # Set buffer size to 16 MiB to hide the Python loop overhead. 

709 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1) 

710 

711 if array.dtype.hasobject: 

712 # We contain Python objects so we cannot write out the data 

713 # directly. Instead, we will pickle it out 

714 if not allow_pickle: 

715 raise ValueError("Object arrays cannot be saved when " 

716 "allow_pickle=False") 

717 if pickle_kwargs is None: 

718 pickle_kwargs = {} 

719 pickle.dump(array, fp, protocol=3, **pickle_kwargs) 

720 elif array.flags.f_contiguous and not array.flags.c_contiguous: 

721 if isfileobj(fp): 

722 array.T.tofile(fp) 

723 else: 

724 for chunk in numpy.nditer( 

725 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

726 buffersize=buffersize, order='F'): 

727 fp.write(chunk.tobytes('C')) 

728 else: 

729 if isfileobj(fp): 

730 array.tofile(fp) 

731 else: 

732 for chunk in numpy.nditer( 

733 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

734 buffersize=buffersize, order='C'): 

735 fp.write(chunk.tobytes('C')) 

736 

737 

738def read_array(fp, allow_pickle=False, pickle_kwargs=None, *, 

739 max_header_size=_MAX_HEADER_SIZE): 

740 """ 

741 Read an array from an NPY file. 

742 

743 Parameters 

744 ---------- 

745 fp : file_like object 

746 If this is not a real file object, then this may take extra memory 

747 and time. 

748 allow_pickle : bool, optional 

749 Whether to allow writing pickled data. Default: False 

750 

751 .. versionchanged:: 1.16.3 

752 Made default False in response to CVE-2019-6446. 

753 

754 pickle_kwargs : dict 

755 Additional keyword arguments to pass to pickle.load. These are only 

756 useful when loading object arrays saved on Python 2 when using 

757 Python 3. 

758 max_header_size : int, optional 

759 Maximum allowed size of the header. Large headers may not be safe 

760 to load securely and thus require explicitly passing a larger value. 

761 See :py:func:`ast.literal_eval()` for details. 

762 This option is ignored when `allow_pickle` is passed. In that case 

763 the file is by definition trusted and the limit is unnecessary. 

764 

765 Returns 

766 ------- 

767 array : ndarray 

768 The array from the data on disk. 

769 

770 Raises 

771 ------ 

772 ValueError 

773 If the data is invalid, or allow_pickle=False and the file contains 

774 an object array. 

775 

776 """ 

777 if allow_pickle: 

778 # Effectively ignore max_header_size, since `allow_pickle` indicates 

779 # that the input is fully trusted. 

780 max_header_size = 2**64 

781 

782 version = read_magic(fp) 

783 _check_version(version) 

784 shape, fortran_order, dtype = _read_array_header( 

785 fp, version, max_header_size=max_header_size) 

786 if len(shape) == 0: 

787 count = 1 

788 else: 

789 count = numpy.multiply.reduce(shape, dtype=numpy.int64) 

790 

791 # Now read the actual data. 

792 if dtype.hasobject: 

793 # The array contained Python objects. We need to unpickle the data. 

794 if not allow_pickle: 

795 raise ValueError("Object arrays cannot be loaded when " 

796 "allow_pickle=False") 

797 if pickle_kwargs is None: 

798 pickle_kwargs = {} 

799 try: 

800 array = pickle.load(fp, **pickle_kwargs) 

801 except UnicodeError as err: 

802 # Friendlier error message 

803 raise UnicodeError("Unpickling a python object failed: %r\n" 

804 "You may need to pass the encoding= option " 

805 "to numpy.load" % (err,)) from err 

806 else: 

807 if isfileobj(fp): 

808 # We can use the fast fromfile() function. 

809 array = numpy.fromfile(fp, dtype=dtype, count=count) 

810 else: 

811 # This is not a real file. We have to read it the 

812 # memory-intensive way. 

813 # crc32 module fails on reads greater than 2 ** 32 bytes, 

814 # breaking large reads from gzip streams. Chunk reads to 

815 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead 

816 # of the read. In non-chunked case count < max_read_count, so 

817 # only one read is performed. 

818 

819 # Use np.ndarray instead of np.empty since the latter does 

820 # not correctly instantiate zero-width string dtypes; see 

821 # https://github.com/numpy/numpy/pull/6430 

822 array = numpy.ndarray(count, dtype=dtype) 

823 

824 if dtype.itemsize > 0: 

825 # If dtype.itemsize == 0 then there's nothing more to read 

826 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize) 

827 

828 for i in range(0, count, max_read_count): 

829 read_count = min(max_read_count, count - i) 

830 read_size = int(read_count * dtype.itemsize) 

831 data = _read_bytes(fp, read_size, "array data") 

832 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, 

833 count=read_count) 

834 

835 if fortran_order: 

836 array.shape = shape[::-1] 

837 array = array.transpose() 

838 else: 

839 array.shape = shape 

840 

841 return array 

842 

843 

844def open_memmap(filename, mode='r+', dtype=None, shape=None, 

845 fortran_order=False, version=None, *, 

846 max_header_size=_MAX_HEADER_SIZE): 

847 """ 

848 Open a .npy file as a memory-mapped array. 

849 

850 This may be used to read an existing file or create a new one. 

851 

852 Parameters 

853 ---------- 

854 filename : str or path-like 

855 The name of the file on disk. This may *not* be a file-like 

856 object. 

857 mode : str, optional 

858 The mode in which to open the file; the default is 'r+'. In 

859 addition to the standard file modes, 'c' is also accepted to mean 

860 "copy on write." See `memmap` for the available mode strings. 

861 dtype : data-type, optional 

862 The data type of the array if we are creating a new file in "write" 

863 mode, if not, `dtype` is ignored. The default value is None, which 

864 results in a data-type of `float64`. 

865 shape : tuple of int 

866 The shape of the array if we are creating a new file in "write" 

867 mode, in which case this parameter is required. Otherwise, this 

868 parameter is ignored and is thus optional. 

869 fortran_order : bool, optional 

870 Whether the array should be Fortran-contiguous (True) or 

871 C-contiguous (False, the default) if we are creating a new file in 

872 "write" mode. 

873 version : tuple of int (major, minor) or None 

874 If the mode is a "write" mode, then this is the version of the file 

875 format used to create the file. None means use the oldest 

876 supported version that is able to store the data. Default: None 

877 max_header_size : int, optional 

878 Maximum allowed size of the header. Large headers may not be safe 

879 to load securely and thus require explicitly passing a larger value. 

880 See :py:func:`ast.literal_eval()` for details. 

881 

882 Returns 

883 ------- 

884 marray : memmap 

885 The memory-mapped array. 

886 

887 Raises 

888 ------ 

889 ValueError 

890 If the data or the mode is invalid. 

891 OSError 

892 If the file is not found or cannot be opened correctly. 

893 

894 See Also 

895 -------- 

896 numpy.memmap 

897 

898 """ 

899 if isfileobj(filename): 

900 raise ValueError("Filename must be a string or a path-like object." 

901 " Memmap cannot use existing file handles.") 

902 

903 if 'w' in mode: 

904 # We are creating the file, not reading it. 

905 # Check if we ought to create the file. 

906 _check_version(version) 

907 # Ensure that the given dtype is an authentic dtype object rather 

908 # than just something that can be interpreted as a dtype object. 

909 dtype = numpy.dtype(dtype) 

910 if dtype.hasobject: 

911 msg = "Array can't be memory-mapped: Python objects in dtype." 

912 raise ValueError(msg) 

913 d = dict( 

914 descr=dtype_to_descr(dtype), 

915 fortran_order=fortran_order, 

916 shape=shape, 

917 ) 

918 # If we got here, then it should be safe to create the file. 

919 with open(os_fspath(filename), mode+'b') as fp: 

920 _write_array_header(fp, d, version) 

921 offset = fp.tell() 

922 else: 

923 # Read the header of the file first. 

924 with open(os_fspath(filename), 'rb') as fp: 

925 version = read_magic(fp) 

926 _check_version(version) 

927 

928 shape, fortran_order, dtype = _read_array_header( 

929 fp, version, max_header_size=max_header_size) 

930 if dtype.hasobject: 

931 msg = "Array can't be memory-mapped: Python objects in dtype." 

932 raise ValueError(msg) 

933 offset = fp.tell() 

934 

935 if fortran_order: 

936 order = 'F' 

937 else: 

938 order = 'C' 

939 

940 # We need to change a write-only mode to a read-write mode since we've 

941 # already written data to the file. 

942 if mode == 'w+': 

943 mode = 'r+' 

944 

945 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, 

946 mode=mode, offset=offset) 

947 

948 return marray 

949 

950 

951def _read_bytes(fp, size, error_template="ran out of data"): 

952 """ 

953 Read from file-like object until size bytes are read. 

954 Raises ValueError if not EOF is encountered before size bytes are read. 

955 Non-blocking objects only supported if they derive from io objects. 

956 

957 Required as e.g. ZipExtFile in python 2.6 can return less data than 

958 requested. 

959 """ 

960 data = bytes() 

961 while True: 

962 # io files (default in python3) return None or raise on 

963 # would-block, python2 file will truncate, probably nothing can be 

964 # done about that. note that regular files can't be non-blocking 

965 try: 

966 r = fp.read(size - len(data)) 

967 data += r 

968 if len(r) == 0 or len(data) == size: 

969 break 

970 except BlockingIOError: 

971 pass 

972 if len(data) != size: 

973 msg = "EOF: reading %s, expected %d bytes got %d" 

974 raise ValueError(msg % (error_template, size, len(data))) 

975 else: 

976 return data