Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/numpy/lib/format.py: 12%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

273 statements  

1""" 

2Binary serialization 

3 

4NPY format 

5========== 

6 

7A simple format for saving numpy arrays to disk with the full 

8information about them. 

9 

10The ``.npy`` format is the standard binary file format in NumPy for 

11persisting a *single* arbitrary NumPy array on disk. The format stores all 

12of the shape and dtype information necessary to reconstruct the array 

13correctly even on another machine with a different architecture. 

14The format is designed to be as simple as possible while achieving 

15its limited goals. 

16 

17The ``.npz`` format is the standard format for persisting *multiple* NumPy 

18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` 

19files, one for each array. 

20 

21Capabilities 

22------------ 

23 

24- Can represent all NumPy arrays including nested record arrays and 

25 object arrays. 

26 

27- Represents the data in its native binary form. 

28 

29- Supports Fortran-contiguous arrays directly. 

30 

31- Stores all of the necessary information to reconstruct the array 

32 including shape and dtype on a machine of a different 

33 architecture. Both little-endian and big-endian arrays are 

34 supported, and a file with little-endian numbers will yield 

35 a little-endian array on any machine reading the file. The 

36 types are described in terms of their actual sizes. For example, 

37 if a machine with a 64-bit C "long int" writes out an array with 

38 "long ints", a reading machine with 32-bit C "long ints" will yield 

39 an array with 64-bit integers. 

40 

41- Is straightforward to reverse engineer. Datasets often live longer than 

42 the programs that created them. A competent developer should be 

43 able to create a solution in their preferred programming language to 

44 read most ``.npy`` files that they have been given without much 

45 documentation. 

46 

47- Allows memory-mapping of the data. See `open_memmap`. 

48 

49- Can be read from a filelike stream object instead of an actual file. 

50 

51- Stores object arrays, i.e. arrays containing elements that are arbitrary 

52 Python objects. Files with object arrays are not to be mmapable, but 

53 can be read and written to disk. 

54 

55Limitations 

56----------- 

57 

58- Arbitrary subclasses of numpy.ndarray are not completely preserved. 

59 Subclasses will be accepted for writing, but only the array data will 

60 be written out. A regular numpy.ndarray object will be created 

61 upon reading the file. 

62 

63.. warning:: 

64 

65 Due to limitations in the interpretation of structured dtypes, dtypes 

66 with fields with empty names will have the names replaced by 'f0', 'f1', 

67 etc. Such arrays will not round-trip through the format entirely 

68 accurately. The data is intact; only the field names will differ. We are 

69 working on a fix for this. This fix will not require a change in the 

70 file format. The arrays with such structures can still be saved and 

71 restored, and the correct dtype may be restored by using the 

72 ``loadedarray.view(correct_dtype)`` method. 

73 

74File extensions 

75--------------- 

76 

77We recommend using the ``.npy`` and ``.npz`` extensions for files saved 

78in this format. This is by no means a requirement; applications may wish 

79to use these file formats but use an extension specific to the 

80application. In the absence of an obvious alternative, however, 

81we suggest using ``.npy`` and ``.npz``. 

82 

83Version numbering 

84----------------- 

85 

86The version numbering of these formats is independent of NumPy version 

87numbering. If the format is upgraded, the code in `numpy.io` will still 

88be able to read and write Version 1.0 files. 

89 

90Format Version 1.0 

91------------------ 

92 

93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. 

94 

95The next 1 byte is an unsigned byte: the major version number of the file 

96format, e.g. ``\\x01``. 

97 

98The next 1 byte is an unsigned byte: the minor version number of the file 

99format, e.g. ``\\x00``. Note: the version of the file format is not tied 

100to the version of the numpy package. 

101 

102The next 2 bytes form a little-endian unsigned short int: the length of 

103the header data HEADER_LEN. 

104 

105The next HEADER_LEN bytes form the header data describing the array's 

106format. It is an ASCII string which contains a Python literal expression 

107of a dictionary. It is terminated by a newline (``\\n``) and padded with 

108spaces (``\\x20``) to make the total of 

109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible 

110by 64 for alignment purposes. 

111 

112The dictionary contains three keys: 

113 

114 "descr" : dtype.descr 

115 An object that can be passed as an argument to the `numpy.dtype` 

116 constructor to create the array's dtype. 

117 "fortran_order" : bool 

118 Whether the array data is Fortran-contiguous or not. Since 

119 Fortran-contiguous arrays are a common form of non-C-contiguity, 

120 we allow them to be written directly to disk for efficiency. 

121 "shape" : tuple of int 

122 The shape of the array. 

123 

124For repeatability and readability, the dictionary keys are sorted in 

125alphabetic order. This is for convenience only. A writer SHOULD implement 

126this if possible. A reader MUST NOT depend on this. 

127 

128Following the header comes the array data. If the dtype contains Python 

129objects (i.e. ``dtype.hasobject is True``), then the data is a Python 

130pickle of the array. Otherwise the data is the contiguous (either C- 

131or Fortran-, depending on ``fortran_order``) bytes of the array. 

132Consumers can figure out the number of bytes by multiplying the number 

133of elements given by the shape (noting that ``shape=()`` means there is 

1341 element) by ``dtype.itemsize``. 

135 

136Format Version 2.0 

137------------------ 

138 

139The version 1.0 format only allowed the array header to have a total size of 

14065535 bytes. This can be exceeded by structured arrays with a large number of 

141columns. The version 2.0 format extends the header size to 4 GiB. 

142`numpy.save` will automatically save in 2.0 format if the data requires it, 

143else it will always use the more compatible 1.0 format. 

144 

145The description of the fourth element of the header therefore has become: 

146"The next 4 bytes form a little-endian unsigned int: the length of the header 

147data HEADER_LEN." 

148 

149Format Version 3.0 

150------------------ 

151 

152This version replaces the ASCII string (which in practice was latin1) with 

153a utf8-encoded string, so supports structured types with any unicode field 

154names. 

155 

156Notes 

157----- 

158The ``.npy`` format, including motivation for creating it and a comparison of 

159alternatives, is described in the 

160:doc:`"npy-format" NEP <neps:nep-0001-npy-format>`, however details have 

161evolved with time and this document is more current. 

162 

163""" 

164import numpy 

165import warnings 

166from numpy.lib.utils import safe_eval 

167from numpy.compat import ( 

168 os_fspath, pickle 

169 ) 

170from numpy.compat.py3k import _isfileobj 

171 

172 

173__all__ = [] 

174 

175 

176EXPECTED_KEYS = {'descr', 'fortran_order', 'shape'} 

177MAGIC_PREFIX = b'\x93NUMPY' 

178MAGIC_LEN = len(MAGIC_PREFIX) + 2 

179ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096 

180BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes 

181# allow growth within the address space of a 64 bit machine along one axis 

182GROWTH_AXIS_MAX_DIGITS = 21 # = len(str(8*2**64-1)) hypothetical int1 dtype 

183 

184# difference between version 1.0 and 2.0 is a 4 byte (I) header length 

185# instead of 2 bytes (H) allowing storage of large structured arrays 

186_header_size_info = { 

187 (1, 0): ('<H', 'latin1'), 

188 (2, 0): ('<I', 'latin1'), 

189 (3, 0): ('<I', 'utf8'), 

190} 

191 

192# Python's literal_eval is not actually safe for large inputs, since parsing 

193# may become slow or even cause interpreter crashes. 

194# This is an arbitrary, low limit which should make it safe in practice. 

195_MAX_HEADER_SIZE = 10000 

196 

197def _check_version(version): 

198 if version not in [(1, 0), (2, 0), (3, 0), None]: 

199 msg = "we only support format version (1,0), (2,0), and (3,0), not %s" 

200 raise ValueError(msg % (version,)) 

201 

202def magic(major, minor): 

203 """ Return the magic string for the given file format version. 

204 

205 Parameters 

206 ---------- 

207 major : int in [0, 255] 

208 minor : int in [0, 255] 

209 

210 Returns 

211 ------- 

212 magic : str 

213 

214 Raises 

215 ------ 

216 ValueError if the version cannot be formatted. 

217 """ 

218 if major < 0 or major > 255: 

219 raise ValueError("major version must be 0 <= major < 256") 

220 if minor < 0 or minor > 255: 

221 raise ValueError("minor version must be 0 <= minor < 256") 

222 return MAGIC_PREFIX + bytes([major, minor]) 

223 

224def read_magic(fp): 

225 """ Read the magic string to get the version of the file format. 

226 

227 Parameters 

228 ---------- 

229 fp : filelike object 

230 

231 Returns 

232 ------- 

233 major : int 

234 minor : int 

235 """ 

236 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string") 

237 if magic_str[:-2] != MAGIC_PREFIX: 

238 msg = "the magic string is not correct; expected %r, got %r" 

239 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) 

240 major, minor = magic_str[-2:] 

241 return major, minor 

242 

243def _has_metadata(dt): 

244 if dt.metadata is not None: 

245 return True 

246 elif dt.names is not None: 

247 return any(_has_metadata(dt[k]) for k in dt.names) 

248 elif dt.subdtype is not None: 

249 return _has_metadata(dt.base) 

250 else: 

251 return False 

252 

253def dtype_to_descr(dtype): 

254 """ 

255 Get a serializable descriptor from the dtype. 

256 

257 The .descr attribute of a dtype object cannot be round-tripped through 

258 the dtype() constructor. Simple types, like dtype('float32'), have 

259 a descr which looks like a record array with one field with '' as 

260 a name. The dtype() constructor interprets this as a request to give 

261 a default name. Instead, we construct descriptor that can be passed to 

262 dtype(). 

263 

264 Parameters 

265 ---------- 

266 dtype : dtype 

267 The dtype of the array that will be written to disk. 

268 

269 Returns 

270 ------- 

271 descr : object 

272 An object that can be passed to `numpy.dtype()` in order to 

273 replicate the input dtype. 

274 

275 """ 

276 if _has_metadata(dtype): 

277 warnings.warn("metadata on a dtype may be saved or ignored, but will " 

278 "raise if saved when read. Use another form of storage.", 

279 UserWarning, stacklevel=2) 

280 if dtype.names is not None: 

281 # This is a record array. The .descr is fine. XXX: parts of the 

282 # record array with an empty name, like padding bytes, still get 

283 # fiddled with. This needs to be fixed in the C implementation of 

284 # dtype(). 

285 return dtype.descr 

286 else: 

287 return dtype.str 

288 

289def descr_to_dtype(descr): 

290 """ 

291 Returns a dtype based off the given description. 

292 

293 This is essentially the reverse of `dtype_to_descr()`. It will remove 

294 the valueless padding fields created by, i.e. simple fields like 

295 dtype('float32'), and then convert the description to its corresponding 

296 dtype. 

297 

298 Parameters 

299 ---------- 

300 descr : object 

301 The object retrieved by dtype.descr. Can be passed to 

302 `numpy.dtype()` in order to replicate the input dtype. 

303 

304 Returns 

305 ------- 

306 dtype : dtype 

307 The dtype constructed by the description. 

308 

309 """ 

310 if isinstance(descr, str): 

311 # No padding removal needed 

312 return numpy.dtype(descr) 

313 elif isinstance(descr, tuple): 

314 # subtype, will always have a shape descr[1] 

315 dt = descr_to_dtype(descr[0]) 

316 return numpy.dtype((dt, descr[1])) 

317 

318 titles = [] 

319 names = [] 

320 formats = [] 

321 offsets = [] 

322 offset = 0 

323 for field in descr: 

324 if len(field) == 2: 

325 name, descr_str = field 

326 dt = descr_to_dtype(descr_str) 

327 else: 

328 name, descr_str, shape = field 

329 dt = numpy.dtype((descr_to_dtype(descr_str), shape)) 

330 

331 # Ignore padding bytes, which will be void bytes with '' as name 

332 # Once support for blank names is removed, only "if name == ''" needed) 

333 is_pad = (name == '' and dt.type is numpy.void and dt.names is None) 

334 if not is_pad: 

335 title, name = name if isinstance(name, tuple) else (None, name) 

336 titles.append(title) 

337 names.append(name) 

338 formats.append(dt) 

339 offsets.append(offset) 

340 offset += dt.itemsize 

341 

342 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles, 

343 'offsets': offsets, 'itemsize': offset}) 

344 

345def header_data_from_array_1_0(array): 

346 """ Get the dictionary of header metadata from a numpy.ndarray. 

347 

348 Parameters 

349 ---------- 

350 array : numpy.ndarray 

351 

352 Returns 

353 ------- 

354 d : dict 

355 This has the appropriate entries for writing its string representation 

356 to the header of the file. 

357 """ 

358 d = {'shape': array.shape} 

359 if array.flags.c_contiguous: 

360 d['fortran_order'] = False 

361 elif array.flags.f_contiguous: 

362 d['fortran_order'] = True 

363 else: 

364 # Totally non-contiguous data. We will have to make it C-contiguous 

365 # before writing. Note that we need to test for C_CONTIGUOUS first 

366 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. 

367 d['fortran_order'] = False 

368 

369 d['descr'] = dtype_to_descr(array.dtype) 

370 return d 

371 

372 

373def _wrap_header(header, version): 

374 """ 

375 Takes a stringified header, and attaches the prefix and padding to it 

376 """ 

377 import struct 

378 assert version is not None 

379 fmt, encoding = _header_size_info[version] 

380 header = header.encode(encoding) 

381 hlen = len(header) + 1 

382 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN) 

383 try: 

384 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen) 

385 except struct.error: 

386 msg = "Header length {} too big for version={}".format(hlen, version) 

387 raise ValueError(msg) from None 

388 

389 # Pad the header with spaces and a final newline such that the magic 

390 # string, the header-length short and the header are aligned on a 

391 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes 

392 # aligned up to ARRAY_ALIGN on systems like Linux where mmap() 

393 # offset must be page-aligned (i.e. the beginning of the file). 

394 return header_prefix + header + b' '*padlen + b'\n' 

395 

396 

397def _wrap_header_guess_version(header): 

398 """ 

399 Like `_wrap_header`, but chooses an appropriate version given the contents 

400 """ 

401 try: 

402 return _wrap_header(header, (1, 0)) 

403 except ValueError: 

404 pass 

405 

406 try: 

407 ret = _wrap_header(header, (2, 0)) 

408 except UnicodeEncodeError: 

409 pass 

410 else: 

411 warnings.warn("Stored array in format 2.0. It can only be" 

412 "read by NumPy >= 1.9", UserWarning, stacklevel=2) 

413 return ret 

414 

415 header = _wrap_header(header, (3, 0)) 

416 warnings.warn("Stored array in format 3.0. It can only be " 

417 "read by NumPy >= 1.17", UserWarning, stacklevel=2) 

418 return header 

419 

420 

421def _write_array_header(fp, d, version=None): 

422 """ Write the header for an array and returns the version used 

423 

424 Parameters 

425 ---------- 

426 fp : filelike object 

427 d : dict 

428 This has the appropriate entries for writing its string representation 

429 to the header of the file. 

430 version : tuple or None 

431 None means use oldest that works. Providing an explicit version will 

432 raise a ValueError if the format does not allow saving this data. 

433 Default: None 

434 """ 

435 header = ["{"] 

436 for key, value in sorted(d.items()): 

437 # Need to use repr here, since we eval these when reading 

438 header.append("'%s': %s, " % (key, repr(value))) 

439 header.append("}") 

440 header = "".join(header) 

441 

442 # Add some spare space so that the array header can be modified in-place 

443 # when changing the array size, e.g. when growing it by appending data at 

444 # the end.  

445 shape = d['shape'] 

446 header += " " * ((GROWTH_AXIS_MAX_DIGITS - len(repr( 

447 shape[-1 if d['fortran_order'] else 0] 

448 ))) if len(shape) > 0 else 0) 

449 

450 if version is None: 

451 header = _wrap_header_guess_version(header) 

452 else: 

453 header = _wrap_header(header, version) 

454 fp.write(header) 

455 

456def write_array_header_1_0(fp, d): 

457 """ Write the header for an array using the 1.0 format. 

458 

459 Parameters 

460 ---------- 

461 fp : filelike object 

462 d : dict 

463 This has the appropriate entries for writing its string 

464 representation to the header of the file. 

465 """ 

466 _write_array_header(fp, d, (1, 0)) 

467 

468 

469def write_array_header_2_0(fp, d): 

470 """ Write the header for an array using the 2.0 format. 

471 The 2.0 format allows storing very large structured arrays. 

472 

473 .. versionadded:: 1.9.0 

474 

475 Parameters 

476 ---------- 

477 fp : filelike object 

478 d : dict 

479 This has the appropriate entries for writing its string 

480 representation to the header of the file. 

481 """ 

482 _write_array_header(fp, d, (2, 0)) 

483 

484def read_array_header_1_0(fp, max_header_size=_MAX_HEADER_SIZE): 

485 """ 

486 Read an array header from a filelike object using the 1.0 file format 

487 version. 

488 

489 This will leave the file object located just after the header. 

490 

491 Parameters 

492 ---------- 

493 fp : filelike object 

494 A file object or something with a `.read()` method like a file. 

495 

496 Returns 

497 ------- 

498 shape : tuple of int 

499 The shape of the array. 

500 fortran_order : bool 

501 The array data will be written out directly if it is either 

502 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

503 contiguous before writing it out. 

504 dtype : dtype 

505 The dtype of the file's data. 

506 max_header_size : int, optional 

507 Maximum allowed size of the header. Large headers may not be safe 

508 to load securely and thus require explicitly passing a larger value. 

509 See :py:meth:`ast.literal_eval()` for details. 

510 

511 Raises 

512 ------ 

513 ValueError 

514 If the data is invalid. 

515 

516 """ 

517 return _read_array_header( 

518 fp, version=(1, 0), max_header_size=max_header_size) 

519 

520def read_array_header_2_0(fp, max_header_size=_MAX_HEADER_SIZE): 

521 """ 

522 Read an array header from a filelike object using the 2.0 file format 

523 version. 

524 

525 This will leave the file object located just after the header. 

526 

527 .. versionadded:: 1.9.0 

528 

529 Parameters 

530 ---------- 

531 fp : filelike object 

532 A file object or something with a `.read()` method like a file. 

533 max_header_size : int, optional 

534 Maximum allowed size of the header. Large headers may not be safe 

535 to load securely and thus require explicitly passing a larger value. 

536 See :py:meth:`ast.literal_eval()` for details. 

537 

538 Returns 

539 ------- 

540 shape : tuple of int 

541 The shape of the array. 

542 fortran_order : bool 

543 The array data will be written out directly if it is either 

544 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

545 contiguous before writing it out. 

546 dtype : dtype 

547 The dtype of the file's data. 

548 

549 Raises 

550 ------ 

551 ValueError 

552 If the data is invalid. 

553 

554 """ 

555 return _read_array_header( 

556 fp, version=(2, 0), max_header_size=max_header_size) 

557 

558 

559def _filter_header(s): 

560 """Clean up 'L' in npz header ints. 

561 

562 Cleans up the 'L' in strings representing integers. Needed to allow npz 

563 headers produced in Python2 to be read in Python3. 

564 

565 Parameters 

566 ---------- 

567 s : string 

568 Npy file header. 

569 

570 Returns 

571 ------- 

572 header : str 

573 Cleaned up header. 

574 

575 """ 

576 import tokenize 

577 from io import StringIO 

578 

579 tokens = [] 

580 last_token_was_number = False 

581 for token in tokenize.generate_tokens(StringIO(s).readline): 

582 token_type = token[0] 

583 token_string = token[1] 

584 if (last_token_was_number and 

585 token_type == tokenize.NAME and 

586 token_string == "L"): 

587 continue 

588 else: 

589 tokens.append(token) 

590 last_token_was_number = (token_type == tokenize.NUMBER) 

591 return tokenize.untokenize(tokens) 

592 

593 

594def _read_array_header(fp, version, max_header_size=_MAX_HEADER_SIZE): 

595 """ 

596 see read_array_header_1_0 

597 """ 

598 # Read an unsigned, little-endian short int which has the length of the 

599 # header. 

600 import struct 

601 hinfo = _header_size_info.get(version) 

602 if hinfo is None: 

603 raise ValueError("Invalid version {!r}".format(version)) 

604 hlength_type, encoding = hinfo 

605 

606 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length") 

607 header_length = struct.unpack(hlength_type, hlength_str)[0] 

608 header = _read_bytes(fp, header_length, "array header") 

609 header = header.decode(encoding) 

610 if len(header) > max_header_size: 

611 raise ValueError( 

612 f"Header info length ({len(header)}) is large and may not be safe " 

613 "to load securely.\n" 

614 "To allow loading, adjust `max_header_size` or fully trust " 

615 "the `.npy` file using `allow_pickle=True`.\n" 

616 "For safety against large resource use or crashes, sandboxing " 

617 "may be necessary.") 

618 

619 # The header is a pretty-printed string representation of a literal 

620 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte 

621 # boundary. The keys are strings. 

622 # "shape" : tuple of int 

623 # "fortran_order" : bool 

624 # "descr" : dtype.descr 

625 # Versions (2, 0) and (1, 0) could have been created by a Python 2 

626 # implementation before header filtering was implemented. 

627 if version <= (2, 0): 

628 header = _filter_header(header) 

629 try: 

630 d = safe_eval(header) 

631 except SyntaxError as e: 

632 msg = "Cannot parse header: {!r}" 

633 raise ValueError(msg.format(header)) from e 

634 if not isinstance(d, dict): 

635 msg = "Header is not a dictionary: {!r}" 

636 raise ValueError(msg.format(d)) 

637 

638 if EXPECTED_KEYS != d.keys(): 

639 keys = sorted(d.keys()) 

640 msg = "Header does not contain the correct keys: {!r}" 

641 raise ValueError(msg.format(keys)) 

642 

643 # Sanity-check the values. 

644 if (not isinstance(d['shape'], tuple) or 

645 not all(isinstance(x, int) for x in d['shape'])): 

646 msg = "shape is not valid: {!r}" 

647 raise ValueError(msg.format(d['shape'])) 

648 if not isinstance(d['fortran_order'], bool): 

649 msg = "fortran_order is not a valid bool: {!r}" 

650 raise ValueError(msg.format(d['fortran_order'])) 

651 try: 

652 dtype = descr_to_dtype(d['descr']) 

653 except TypeError as e: 

654 msg = "descr is not a valid dtype descriptor: {!r}" 

655 raise ValueError(msg.format(d['descr'])) from e 

656 

657 return d['shape'], d['fortran_order'], dtype 

658 

659def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): 

660 """ 

661 Write an array to an NPY file, including a header. 

662 

663 If the array is neither C-contiguous nor Fortran-contiguous AND the 

664 file_like object is not a real file object, this function will have to 

665 copy data in memory. 

666 

667 Parameters 

668 ---------- 

669 fp : file_like object 

670 An open, writable file object, or similar object with a 

671 ``.write()`` method. 

672 array : ndarray 

673 The array to write to disk. 

674 version : (int, int) or None, optional 

675 The version number of the format. None means use the oldest 

676 supported version that is able to store the data. Default: None 

677 allow_pickle : bool, optional 

678 Whether to allow writing pickled data. Default: True 

679 pickle_kwargs : dict, optional 

680 Additional keyword arguments to pass to pickle.dump, excluding 

681 'protocol'. These are only useful when pickling objects in object 

682 arrays on Python 3 to Python 2 compatible format. 

683 

684 Raises 

685 ------ 

686 ValueError 

687 If the array cannot be persisted. This includes the case of 

688 allow_pickle=False and array being an object array. 

689 Various other errors 

690 If the array contains Python objects as part of its dtype, the 

691 process of pickling them may raise various errors if the objects 

692 are not picklable. 

693 

694 """ 

695 _check_version(version) 

696 _write_array_header(fp, header_data_from_array_1_0(array), version) 

697 

698 if array.itemsize == 0: 

699 buffersize = 0 

700 else: 

701 # Set buffer size to 16 MiB to hide the Python loop overhead. 

702 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1) 

703 

704 if array.dtype.hasobject: 

705 # We contain Python objects so we cannot write out the data 

706 # directly. Instead, we will pickle it out 

707 if not allow_pickle: 

708 raise ValueError("Object arrays cannot be saved when " 

709 "allow_pickle=False") 

710 if pickle_kwargs is None: 

711 pickle_kwargs = {} 

712 pickle.dump(array, fp, protocol=3, **pickle_kwargs) 

713 elif array.flags.f_contiguous and not array.flags.c_contiguous: 

714 if _isfileobj(fp): 

715 array.T.tofile(fp) 

716 else: 

717 for chunk in numpy.nditer( 

718 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

719 buffersize=buffersize, order='F'): 

720 fp.write(chunk.tobytes('C')) 

721 else: 

722 if _isfileobj(fp): 

723 array.tofile(fp) 

724 else: 

725 for chunk in numpy.nditer( 

726 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

727 buffersize=buffersize, order='C'): 

728 fp.write(chunk.tobytes('C')) 

729 

730 

731def read_array(fp, allow_pickle=False, pickle_kwargs=None, *, 

732 max_header_size=_MAX_HEADER_SIZE): 

733 """ 

734 Read an array from an NPY file. 

735 

736 Parameters 

737 ---------- 

738 fp : file_like object 

739 If this is not a real file object, then this may take extra memory 

740 and time. 

741 allow_pickle : bool, optional 

742 Whether to allow writing pickled data. Default: False 

743 

744 .. versionchanged:: 1.16.3 

745 Made default False in response to CVE-2019-6446. 

746 

747 pickle_kwargs : dict 

748 Additional keyword arguments to pass to pickle.load. These are only 

749 useful when loading object arrays saved on Python 2 when using 

750 Python 3. 

751 max_header_size : int, optional 

752 Maximum allowed size of the header. Large headers may not be safe 

753 to load securely and thus require explicitly passing a larger value. 

754 See :py:meth:`ast.literal_eval()` for details. 

755 This option is ignored when `allow_pickle` is passed. In that case 

756 the file is by definition trusted and the limit is unnecessary. 

757 

758 Returns 

759 ------- 

760 array : ndarray 

761 The array from the data on disk. 

762 

763 Raises 

764 ------ 

765 ValueError 

766 If the data is invalid, or allow_pickle=False and the file contains 

767 an object array. 

768 

769 """ 

770 if allow_pickle: 

771 # Effectively ignore max_header_size, since `allow_pickle` indicates 

772 # that the input is fully trusted. 

773 max_header_size = 2**64 

774 

775 version = read_magic(fp) 

776 _check_version(version) 

777 shape, fortran_order, dtype = _read_array_header( 

778 fp, version, max_header_size=max_header_size) 

779 if len(shape) == 0: 

780 count = 1 

781 else: 

782 count = numpy.multiply.reduce(shape, dtype=numpy.int64) 

783 

784 # Now read the actual data. 

785 if dtype.hasobject: 

786 # The array contained Python objects. We need to unpickle the data. 

787 if not allow_pickle: 

788 raise ValueError("Object arrays cannot be loaded when " 

789 "allow_pickle=False") 

790 if pickle_kwargs is None: 

791 pickle_kwargs = {} 

792 try: 

793 array = pickle.load(fp, **pickle_kwargs) 

794 except UnicodeError as err: 

795 # Friendlier error message 

796 raise UnicodeError("Unpickling a python object failed: %r\n" 

797 "You may need to pass the encoding= option " 

798 "to numpy.load" % (err,)) from err 

799 else: 

800 if _isfileobj(fp): 

801 # We can use the fast fromfile() function. 

802 array = numpy.fromfile(fp, dtype=dtype, count=count) 

803 else: 

804 # This is not a real file. We have to read it the 

805 # memory-intensive way. 

806 # crc32 module fails on reads greater than 2 ** 32 bytes, 

807 # breaking large reads from gzip streams. Chunk reads to 

808 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead 

809 # of the read. In non-chunked case count < max_read_count, so 

810 # only one read is performed. 

811 

812 # Use np.ndarray instead of np.empty since the latter does 

813 # not correctly instantiate zero-width string dtypes; see 

814 # https://github.com/numpy/numpy/pull/6430 

815 array = numpy.ndarray(count, dtype=dtype) 

816 

817 if dtype.itemsize > 0: 

818 # If dtype.itemsize == 0 then there's nothing more to read 

819 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize) 

820 

821 for i in range(0, count, max_read_count): 

822 read_count = min(max_read_count, count - i) 

823 read_size = int(read_count * dtype.itemsize) 

824 data = _read_bytes(fp, read_size, "array data") 

825 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, 

826 count=read_count) 

827 

828 if fortran_order: 

829 array.shape = shape[::-1] 

830 array = array.transpose() 

831 else: 

832 array.shape = shape 

833 

834 return array 

835 

836 

837def open_memmap(filename, mode='r+', dtype=None, shape=None, 

838 fortran_order=False, version=None, *, 

839 max_header_size=_MAX_HEADER_SIZE): 

840 """ 

841 Open a .npy file as a memory-mapped array. 

842 

843 This may be used to read an existing file or create a new one. 

844 

845 Parameters 

846 ---------- 

847 filename : str or path-like 

848 The name of the file on disk. This may *not* be a file-like 

849 object. 

850 mode : str, optional 

851 The mode in which to open the file; the default is 'r+'. In 

852 addition to the standard file modes, 'c' is also accepted to mean 

853 "copy on write." See `memmap` for the available mode strings. 

854 dtype : data-type, optional 

855 The data type of the array if we are creating a new file in "write" 

856 mode, if not, `dtype` is ignored. The default value is None, which 

857 results in a data-type of `float64`. 

858 shape : tuple of int 

859 The shape of the array if we are creating a new file in "write" 

860 mode, in which case this parameter is required. Otherwise, this 

861 parameter is ignored and is thus optional. 

862 fortran_order : bool, optional 

863 Whether the array should be Fortran-contiguous (True) or 

864 C-contiguous (False, the default) if we are creating a new file in 

865 "write" mode. 

866 version : tuple of int (major, minor) or None 

867 If the mode is a "write" mode, then this is the version of the file 

868 format used to create the file. None means use the oldest 

869 supported version that is able to store the data. Default: None 

870 max_header_size : int, optional 

871 Maximum allowed size of the header. Large headers may not be safe 

872 to load securely and thus require explicitly passing a larger value. 

873 See :py:meth:`ast.literal_eval()` for details. 

874 

875 Returns 

876 ------- 

877 marray : memmap 

878 The memory-mapped array. 

879 

880 Raises 

881 ------ 

882 ValueError 

883 If the data or the mode is invalid. 

884 OSError 

885 If the file is not found or cannot be opened correctly. 

886 

887 See Also 

888 -------- 

889 numpy.memmap 

890 

891 """ 

892 if _isfileobj(filename): 

893 raise ValueError("Filename must be a string or a path-like object." 

894 " Memmap cannot use existing file handles.") 

895 

896 if 'w' in mode: 

897 # We are creating the file, not reading it. 

898 # Check if we ought to create the file. 

899 _check_version(version) 

900 # Ensure that the given dtype is an authentic dtype object rather 

901 # than just something that can be interpreted as a dtype object. 

902 dtype = numpy.dtype(dtype) 

903 if dtype.hasobject: 

904 msg = "Array can't be memory-mapped: Python objects in dtype." 

905 raise ValueError(msg) 

906 d = dict( 

907 descr=dtype_to_descr(dtype), 

908 fortran_order=fortran_order, 

909 shape=shape, 

910 ) 

911 # If we got here, then it should be safe to create the file. 

912 with open(os_fspath(filename), mode+'b') as fp: 

913 _write_array_header(fp, d, version) 

914 offset = fp.tell() 

915 else: 

916 # Read the header of the file first. 

917 with open(os_fspath(filename), 'rb') as fp: 

918 version = read_magic(fp) 

919 _check_version(version) 

920 

921 shape, fortran_order, dtype = _read_array_header( 

922 fp, version, max_header_size=max_header_size) 

923 if dtype.hasobject: 

924 msg = "Array can't be memory-mapped: Python objects in dtype." 

925 raise ValueError(msg) 

926 offset = fp.tell() 

927 

928 if fortran_order: 

929 order = 'F' 

930 else: 

931 order = 'C' 

932 

933 # We need to change a write-only mode to a read-write mode since we've 

934 # already written data to the file. 

935 if mode == 'w+': 

936 mode = 'r+' 

937 

938 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, 

939 mode=mode, offset=offset) 

940 

941 return marray 

942 

943 

944def _read_bytes(fp, size, error_template="ran out of data"): 

945 """ 

946 Read from file-like object until size bytes are read. 

947 Raises ValueError if not EOF is encountered before size bytes are read. 

948 Non-blocking objects only supported if they derive from io objects. 

949 

950 Required as e.g. ZipExtFile in python 2.6 can return less data than 

951 requested. 

952 """ 

953 data = bytes() 

954 while True: 

955 # io files (default in python3) return None or raise on 

956 # would-block, python2 file will truncate, probably nothing can be 

957 # done about that. note that regular files can't be non-blocking 

958 try: 

959 r = fp.read(size - len(data)) 

960 data += r 

961 if len(r) == 0 or len(data) == size: 

962 break 

963 except BlockingIOError: 

964 pass 

965 if len(data) != size: 

966 msg = "EOF: reading %s, expected %d bytes got %d" 

967 raise ValueError(msg % (error_template, size, len(data))) 

968 else: 

969 return data