Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.10/site-packages/numpy/lib/format.py: 12%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

292 statements  

1""" 

2Binary serialization 

3 

4NPY format 

5========== 

6 

7A simple format for saving numpy arrays to disk with the full 

8information about them. 

9 

10The ``.npy`` format is the standard binary file format in NumPy for 

11persisting a *single* arbitrary NumPy array on disk. The format stores all 

12of the shape and dtype information necessary to reconstruct the array 

13correctly even on another machine with a different architecture. 

14The format is designed to be as simple as possible while achieving 

15its limited goals. 

16 

17The ``.npz`` format is the standard format for persisting *multiple* NumPy 

18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` 

19files, one for each array. 

20 

21Capabilities 

22------------ 

23 

24- Can represent all NumPy arrays including nested record arrays and 

25 object arrays. 

26 

27- Represents the data in its native binary form. 

28 

29- Supports Fortran-contiguous arrays directly. 

30 

31- Stores all of the necessary information to reconstruct the array 

32 including shape and dtype on a machine of a different 

33 architecture. Both little-endian and big-endian arrays are 

34 supported, and a file with little-endian numbers will yield 

35 a little-endian array on any machine reading the file. The 

36 types are described in terms of their actual sizes. For example, 

37 if a machine with a 64-bit C "long int" writes out an array with 

38 "long ints", a reading machine with 32-bit C "long ints" will yield 

39 an array with 64-bit integers. 

40 

41- Is straightforward to reverse engineer. Datasets often live longer than 

42 the programs that created them. A competent developer should be 

43 able to create a solution in their preferred programming language to 

44 read most ``.npy`` files that they have been given without much 

45 documentation. 

46 

47- Allows memory-mapping of the data. See `open_memmap`. 

48 

49- Can be read from a filelike stream object instead of an actual file. 

50 

51- Stores object arrays, i.e. arrays containing elements that are arbitrary 

52 Python objects. Files with object arrays are not to be mmapable, but 

53 can be read and written to disk. 

54 

55Limitations 

56----------- 

57 

58- Arbitrary subclasses of numpy.ndarray are not completely preserved. 

59 Subclasses will be accepted for writing, but only the array data will 

60 be written out. A regular numpy.ndarray object will be created 

61 upon reading the file. 

62 

63.. warning:: 

64 

65 Due to limitations in the interpretation of structured dtypes, dtypes 

66 with fields with empty names will have the names replaced by 'f0', 'f1', 

67 etc. Such arrays will not round-trip through the format entirely 

68 accurately. The data is intact; only the field names will differ. We are 

69 working on a fix for this. This fix will not require a change in the 

70 file format. The arrays with such structures can still be saved and 

71 restored, and the correct dtype may be restored by using the 

72 ``loadedarray.view(correct_dtype)`` method. 

73 

74File extensions 

75--------------- 

76 

77We recommend using the ``.npy`` and ``.npz`` extensions for files saved 

78in this format. This is by no means a requirement; applications may wish 

79to use these file formats but use an extension specific to the 

80application. In the absence of an obvious alternative, however, 

81we suggest using ``.npy`` and ``.npz``. 

82 

83Version numbering 

84----------------- 

85 

86The version numbering of these formats is independent of NumPy version 

87numbering. If the format is upgraded, the code in `numpy.io` will still 

88be able to read and write Version 1.0 files. 

89 

90Format Version 1.0 

91------------------ 

92 

93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. 

94 

95The next 1 byte is an unsigned byte: the major version number of the file 

96format, e.g. ``\\x01``. 

97 

98The next 1 byte is an unsigned byte: the minor version number of the file 

99format, e.g. ``\\x00``. Note: the version of the file format is not tied 

100to the version of the numpy package. 

101 

102The next 2 bytes form a little-endian unsigned short int: the length of 

103the header data HEADER_LEN. 

104 

105The next HEADER_LEN bytes form the header data describing the array's 

106format. It is an ASCII string which contains a Python literal expression 

107of a dictionary. It is terminated by a newline (``\\n``) and padded with 

108spaces (``\\x20``) to make the total of 

109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible 

110by 64 for alignment purposes. 

111 

112The dictionary contains three keys: 

113 

114 "descr" : dtype.descr 

115 An object that can be passed as an argument to the `numpy.dtype` 

116 constructor to create the array's dtype. 

117 "fortran_order" : bool 

118 Whether the array data is Fortran-contiguous or not. Since 

119 Fortran-contiguous arrays are a common form of non-C-contiguity, 

120 we allow them to be written directly to disk for efficiency. 

121 "shape" : tuple of int 

122 The shape of the array. 

123 

124For repeatability and readability, the dictionary keys are sorted in 

125alphabetic order. This is for convenience only. A writer SHOULD implement 

126this if possible. A reader MUST NOT depend on this. 

127 

128Following the header comes the array data. If the dtype contains Python 

129objects (i.e. ``dtype.hasobject is True``), then the data is a Python 

130pickle of the array. Otherwise the data is the contiguous (either C- 

131or Fortran-, depending on ``fortran_order``) bytes of the array. 

132Consumers can figure out the number of bytes by multiplying the number 

133of elements given by the shape (noting that ``shape=()`` means there is 

1341 element) by ``dtype.itemsize``. 

135 

136Format Version 2.0 

137------------------ 

138 

139The version 1.0 format only allowed the array header to have a total size of 

14065535 bytes. This can be exceeded by structured arrays with a large number of 

141columns. The version 2.0 format extends the header size to 4 GiB. 

142`numpy.save` will automatically save in 2.0 format if the data requires it, 

143else it will always use the more compatible 1.0 format. 

144 

145The description of the fourth element of the header therefore has become: 

146"The next 4 bytes form a little-endian unsigned int: the length of the header 

147data HEADER_LEN." 

148 

149Format Version 3.0 

150------------------ 

151 

152This version replaces the ASCII string (which in practice was latin1) with 

153a utf8-encoded string, so supports structured types with any unicode field 

154names. 

155 

156Notes 

157----- 

158The ``.npy`` format, including motivation for creating it and a comparison of 

159alternatives, is described in the 

160:doc:`"npy-format" NEP <neps:nep-0001-npy-format>`, however details have 

161evolved with time and this document is more current. 

162 

163""" 

164import io 

165import os 

166import pickle 

167import warnings 

168 

169import numpy 

170from numpy.lib._utils_impl import drop_metadata 

171 

172 

173__all__ = [] 

174 

175drop_metadata.__module__ = "numpy.lib.format" 

176 

177EXPECTED_KEYS = {'descr', 'fortran_order', 'shape'} 

178MAGIC_PREFIX = b'\x93NUMPY' 

179MAGIC_LEN = len(MAGIC_PREFIX) + 2 

180ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096 

181BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes 

182# allow growth within the address space of a 64 bit machine along one axis 

183GROWTH_AXIS_MAX_DIGITS = 21 # = len(str(8*2**64-1)) hypothetical int1 dtype 

184 

185# difference between version 1.0 and 2.0 is a 4 byte (I) header length 

186# instead of 2 bytes (H) allowing storage of large structured arrays 

187_header_size_info = { 

188 (1, 0): ('<H', 'latin1'), 

189 (2, 0): ('<I', 'latin1'), 

190 (3, 0): ('<I', 'utf8'), 

191} 

192 

193# Python's literal_eval is not actually safe for large inputs, since parsing 

194# may become slow or even cause interpreter crashes. 

195# This is an arbitrary, low limit which should make it safe in practice. 

196_MAX_HEADER_SIZE = 10000 

197 

198def _check_version(version): 

199 if version not in [(1, 0), (2, 0), (3, 0), None]: 

200 msg = "we only support format version (1,0), (2,0), and (3,0), not %s" 

201 raise ValueError(msg % (version,)) 

202 

203def magic(major, minor): 

204 """ Return the magic string for the given file format version. 

205 

206 Parameters 

207 ---------- 

208 major : int in [0, 255] 

209 minor : int in [0, 255] 

210 

211 Returns 

212 ------- 

213 magic : str 

214 

215 Raises 

216 ------ 

217 ValueError if the version cannot be formatted. 

218 """ 

219 if major < 0 or major > 255: 

220 raise ValueError("major version must be 0 <= major < 256") 

221 if minor < 0 or minor > 255: 

222 raise ValueError("minor version must be 0 <= minor < 256") 

223 return MAGIC_PREFIX + bytes([major, minor]) 

224 

225def read_magic(fp): 

226 """ Read the magic string to get the version of the file format. 

227 

228 Parameters 

229 ---------- 

230 fp : filelike object 

231 

232 Returns 

233 ------- 

234 major : int 

235 minor : int 

236 """ 

237 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string") 

238 if magic_str[:-2] != MAGIC_PREFIX: 

239 msg = "the magic string is not correct; expected %r, got %r" 

240 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) 

241 major, minor = magic_str[-2:] 

242 return major, minor 

243 

244 

245def dtype_to_descr(dtype): 

246 """ 

247 Get a serializable descriptor from the dtype. 

248 

249 The .descr attribute of a dtype object cannot be round-tripped through 

250 the dtype() constructor. Simple types, like dtype('float32'), have 

251 a descr which looks like a record array with one field with '' as 

252 a name. The dtype() constructor interprets this as a request to give 

253 a default name. Instead, we construct descriptor that can be passed to 

254 dtype(). 

255 

256 Parameters 

257 ---------- 

258 dtype : dtype 

259 The dtype of the array that will be written to disk. 

260 

261 Returns 

262 ------- 

263 descr : object 

264 An object that can be passed to `numpy.dtype()` in order to 

265 replicate the input dtype. 

266 

267 """ 

268 # NOTE: that drop_metadata may not return the right dtype e.g. for user 

269 # dtypes. In that case our code below would fail the same, though. 

270 new_dtype = drop_metadata(dtype) 

271 if new_dtype is not dtype: 

272 warnings.warn("metadata on a dtype is not saved to an npy/npz. " 

273 "Use another format (such as pickle) to store it.", 

274 UserWarning, stacklevel=2) 

275 dtype = new_dtype 

276 

277 if dtype.names is not None: 

278 # This is a record array. The .descr is fine. XXX: parts of the 

279 # record array with an empty name, like padding bytes, still get 

280 # fiddled with. This needs to be fixed in the C implementation of 

281 # dtype(). 

282 return dtype.descr 

283 elif not type(dtype)._legacy: 

284 # this must be a user-defined dtype since numpy does not yet expose any 

285 # non-legacy dtypes in the public API 

286 # 

287 # non-legacy dtypes don't yet have __array_interface__ 

288 # support. Instead, as a hack, we use pickle to save the array, and lie 

289 # that the dtype is object. When the array is loaded, the descriptor is 

290 # unpickled with the array and the object dtype in the header is 

291 # discarded. 

292 # 

293 # a future NEP should define a way to serialize user-defined 

294 # descriptors and ideally work out the possible security implications 

295 warnings.warn("Custom dtypes are saved as python objects using the " 

296 "pickle protocol. Loading this file requires " 

297 "allow_pickle=True to be set.", 

298 UserWarning, stacklevel=2) 

299 return "|O" 

300 else: 

301 return dtype.str 

302 

303def descr_to_dtype(descr): 

304 """ 

305 Returns a dtype based off the given description. 

306 

307 This is essentially the reverse of `~lib.format.dtype_to_descr`. It will 

308 remove the valueless padding fields created by, i.e. simple fields like 

309 dtype('float32'), and then convert the description to its corresponding 

310 dtype. 

311 

312 Parameters 

313 ---------- 

314 descr : object 

315 The object retrieved by dtype.descr. Can be passed to 

316 `numpy.dtype` in order to replicate the input dtype. 

317 

318 Returns 

319 ------- 

320 dtype : dtype 

321 The dtype constructed by the description. 

322 

323 """ 

324 if isinstance(descr, str): 

325 # No padding removal needed 

326 return numpy.dtype(descr) 

327 elif isinstance(descr, tuple): 

328 # subtype, will always have a shape descr[1] 

329 dt = descr_to_dtype(descr[0]) 

330 return numpy.dtype((dt, descr[1])) 

331 

332 titles = [] 

333 names = [] 

334 formats = [] 

335 offsets = [] 

336 offset = 0 

337 for field in descr: 

338 if len(field) == 2: 

339 name, descr_str = field 

340 dt = descr_to_dtype(descr_str) 

341 else: 

342 name, descr_str, shape = field 

343 dt = numpy.dtype((descr_to_dtype(descr_str), shape)) 

344 

345 # Ignore padding bytes, which will be void bytes with '' as name 

346 # Once support for blank names is removed, only "if name == ''" needed) 

347 is_pad = (name == '' and dt.type is numpy.void and dt.names is None) 

348 if not is_pad: 

349 title, name = name if isinstance(name, tuple) else (None, name) 

350 titles.append(title) 

351 names.append(name) 

352 formats.append(dt) 

353 offsets.append(offset) 

354 offset += dt.itemsize 

355 

356 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles, 

357 'offsets': offsets, 'itemsize': offset}) 

358 

359def header_data_from_array_1_0(array): 

360 """ Get the dictionary of header metadata from a numpy.ndarray. 

361 

362 Parameters 

363 ---------- 

364 array : numpy.ndarray 

365 

366 Returns 

367 ------- 

368 d : dict 

369 This has the appropriate entries for writing its string representation 

370 to the header of the file. 

371 """ 

372 d = {'shape': array.shape} 

373 if array.flags.c_contiguous: 

374 d['fortran_order'] = False 

375 elif array.flags.f_contiguous: 

376 d['fortran_order'] = True 

377 else: 

378 # Totally non-contiguous data. We will have to make it C-contiguous 

379 # before writing. Note that we need to test for C_CONTIGUOUS first 

380 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. 

381 d['fortran_order'] = False 

382 

383 d['descr'] = dtype_to_descr(array.dtype) 

384 return d 

385 

386 

387def _wrap_header(header, version): 

388 """ 

389 Takes a stringified header, and attaches the prefix and padding to it 

390 """ 

391 import struct 

392 assert version is not None 

393 fmt, encoding = _header_size_info[version] 

394 header = header.encode(encoding) 

395 hlen = len(header) + 1 

396 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN) 

397 try: 

398 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen) 

399 except struct.error: 

400 msg = "Header length {} too big for version={}".format(hlen, version) 

401 raise ValueError(msg) from None 

402 

403 # Pad the header with spaces and a final newline such that the magic 

404 # string, the header-length short and the header are aligned on a 

405 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes 

406 # aligned up to ARRAY_ALIGN on systems like Linux where mmap() 

407 # offset must be page-aligned (i.e. the beginning of the file). 

408 return header_prefix + header + b' '*padlen + b'\n' 

409 

410 

411def _wrap_header_guess_version(header): 

412 """ 

413 Like `_wrap_header`, but chooses an appropriate version given the contents 

414 """ 

415 try: 

416 return _wrap_header(header, (1, 0)) 

417 except ValueError: 

418 pass 

419 

420 try: 

421 ret = _wrap_header(header, (2, 0)) 

422 except UnicodeEncodeError: 

423 pass 

424 else: 

425 warnings.warn("Stored array in format 2.0. It can only be" 

426 "read by NumPy >= 1.9", UserWarning, stacklevel=2) 

427 return ret 

428 

429 header = _wrap_header(header, (3, 0)) 

430 warnings.warn("Stored array in format 3.0. It can only be " 

431 "read by NumPy >= 1.17", UserWarning, stacklevel=2) 

432 return header 

433 

434 

435def _write_array_header(fp, d, version=None): 

436 """ Write the header for an array and returns the version used 

437 

438 Parameters 

439 ---------- 

440 fp : filelike object 

441 d : dict 

442 This has the appropriate entries for writing its string representation 

443 to the header of the file. 

444 version : tuple or None 

445 None means use oldest that works. Providing an explicit version will 

446 raise a ValueError if the format does not allow saving this data. 

447 Default: None 

448 """ 

449 header = ["{"] 

450 for key, value in sorted(d.items()): 

451 # Need to use repr here, since we eval these when reading 

452 header.append("'%s': %s, " % (key, repr(value))) 

453 header.append("}") 

454 header = "".join(header) 

455 

456 # Add some spare space so that the array header can be modified in-place 

457 # when changing the array size, e.g. when growing it by appending data at 

458 # the end. 

459 shape = d['shape'] 

460 header += " " * ((GROWTH_AXIS_MAX_DIGITS - len(repr( 

461 shape[-1 if d['fortran_order'] else 0] 

462 ))) if len(shape) > 0 else 0) 

463 

464 if version is None: 

465 header = _wrap_header_guess_version(header) 

466 else: 

467 header = _wrap_header(header, version) 

468 fp.write(header) 

469 

470def write_array_header_1_0(fp, d): 

471 """ Write the header for an array using the 1.0 format. 

472 

473 Parameters 

474 ---------- 

475 fp : filelike object 

476 d : dict 

477 This has the appropriate entries for writing its string 

478 representation to the header of the file. 

479 """ 

480 _write_array_header(fp, d, (1, 0)) 

481 

482 

483def write_array_header_2_0(fp, d): 

484 """ Write the header for an array using the 2.0 format. 

485 The 2.0 format allows storing very large structured arrays. 

486 

487 Parameters 

488 ---------- 

489 fp : filelike object 

490 d : dict 

491 This has the appropriate entries for writing its string 

492 representation to the header of the file. 

493 """ 

494 _write_array_header(fp, d, (2, 0)) 

495 

496def read_array_header_1_0(fp, max_header_size=_MAX_HEADER_SIZE): 

497 """ 

498 Read an array header from a filelike object using the 1.0 file format 

499 version. 

500 

501 This will leave the file object located just after the header. 

502 

503 Parameters 

504 ---------- 

505 fp : filelike object 

506 A file object or something with a `.read()` method like a file. 

507 

508 Returns 

509 ------- 

510 shape : tuple of int 

511 The shape of the array. 

512 fortran_order : bool 

513 The array data will be written out directly if it is either 

514 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

515 contiguous before writing it out. 

516 dtype : dtype 

517 The dtype of the file's data. 

518 max_header_size : int, optional 

519 Maximum allowed size of the header. Large headers may not be safe 

520 to load securely and thus require explicitly passing a larger value. 

521 See :py:func:`ast.literal_eval()` for details. 

522 

523 Raises 

524 ------ 

525 ValueError 

526 If the data is invalid. 

527 

528 """ 

529 return _read_array_header( 

530 fp, version=(1, 0), max_header_size=max_header_size) 

531 

532def read_array_header_2_0(fp, max_header_size=_MAX_HEADER_SIZE): 

533 """ 

534 Read an array header from a filelike object using the 2.0 file format 

535 version. 

536 

537 This will leave the file object located just after the header. 

538 

539 Parameters 

540 ---------- 

541 fp : filelike object 

542 A file object or something with a `.read()` method like a file. 

543 max_header_size : int, optional 

544 Maximum allowed size of the header. Large headers may not be safe 

545 to load securely and thus require explicitly passing a larger value. 

546 See :py:func:`ast.literal_eval()` for details. 

547 

548 Returns 

549 ------- 

550 shape : tuple of int 

551 The shape of the array. 

552 fortran_order : bool 

553 The array data will be written out directly if it is either 

554 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

555 contiguous before writing it out. 

556 dtype : dtype 

557 The dtype of the file's data. 

558 

559 Raises 

560 ------ 

561 ValueError 

562 If the data is invalid. 

563 

564 """ 

565 return _read_array_header( 

566 fp, version=(2, 0), max_header_size=max_header_size) 

567 

568 

569def _filter_header(s): 

570 """Clean up 'L' in npz header ints. 

571 

572 Cleans up the 'L' in strings representing integers. Needed to allow npz 

573 headers produced in Python2 to be read in Python3. 

574 

575 Parameters 

576 ---------- 

577 s : string 

578 Npy file header. 

579 

580 Returns 

581 ------- 

582 header : str 

583 Cleaned up header. 

584 

585 """ 

586 import tokenize 

587 from io import StringIO 

588 

589 tokens = [] 

590 last_token_was_number = False 

591 for token in tokenize.generate_tokens(StringIO(s).readline): 

592 token_type = token[0] 

593 token_string = token[1] 

594 if (last_token_was_number and 

595 token_type == tokenize.NAME and 

596 token_string == "L"): 

597 continue 

598 else: 

599 tokens.append(token) 

600 last_token_was_number = (token_type == tokenize.NUMBER) 

601 return tokenize.untokenize(tokens) 

602 

603 

604def _read_array_header(fp, version, max_header_size=_MAX_HEADER_SIZE): 

605 """ 

606 see read_array_header_1_0 

607 """ 

608 # Read an unsigned, little-endian short int which has the length of the 

609 # header. 

610 import ast 

611 import struct 

612 hinfo = _header_size_info.get(version) 

613 if hinfo is None: 

614 raise ValueError("Invalid version {!r}".format(version)) 

615 hlength_type, encoding = hinfo 

616 

617 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length") 

618 header_length = struct.unpack(hlength_type, hlength_str)[0] 

619 header = _read_bytes(fp, header_length, "array header") 

620 header = header.decode(encoding) 

621 if len(header) > max_header_size: 

622 raise ValueError( 

623 f"Header info length ({len(header)}) is large and may not be safe " 

624 "to load securely.\n" 

625 "To allow loading, adjust `max_header_size` or fully trust " 

626 "the `.npy` file using `allow_pickle=True`.\n" 

627 "For safety against large resource use or crashes, sandboxing " 

628 "may be necessary.") 

629 

630 # The header is a pretty-printed string representation of a literal 

631 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte 

632 # boundary. The keys are strings. 

633 # "shape" : tuple of int 

634 # "fortran_order" : bool 

635 # "descr" : dtype.descr 

636 # Versions (2, 0) and (1, 0) could have been created by a Python 2 

637 # implementation before header filtering was implemented. 

638 # 

639 # For performance reasons, we try without _filter_header first though 

640 try: 

641 d = ast.literal_eval(header) 

642 except SyntaxError as e: 

643 if version <= (2, 0): 

644 header = _filter_header(header) 

645 try: 

646 d = ast.literal_eval(header) 

647 except SyntaxError as e2: 

648 msg = "Cannot parse header: {!r}" 

649 raise ValueError(msg.format(header)) from e2 

650 else: 

651 warnings.warn( 

652 "Reading `.npy` or `.npz` file required additional " 

653 "header parsing as it was created on Python 2. Save the " 

654 "file again to speed up loading and avoid this warning.", 

655 UserWarning, stacklevel=4) 

656 else: 

657 msg = "Cannot parse header: {!r}" 

658 raise ValueError(msg.format(header)) from e 

659 if not isinstance(d, dict): 

660 msg = "Header is not a dictionary: {!r}" 

661 raise ValueError(msg.format(d)) 

662 

663 if EXPECTED_KEYS != d.keys(): 

664 keys = sorted(d.keys()) 

665 msg = "Header does not contain the correct keys: {!r}" 

666 raise ValueError(msg.format(keys)) 

667 

668 # Sanity-check the values. 

669 if (not isinstance(d['shape'], tuple) or 

670 not all(isinstance(x, int) for x in d['shape'])): 

671 msg = "shape is not valid: {!r}" 

672 raise ValueError(msg.format(d['shape'])) 

673 if not isinstance(d['fortran_order'], bool): 

674 msg = "fortran_order is not a valid bool: {!r}" 

675 raise ValueError(msg.format(d['fortran_order'])) 

676 try: 

677 dtype = descr_to_dtype(d['descr']) 

678 except TypeError as e: 

679 msg = "descr is not a valid dtype descriptor: {!r}" 

680 raise ValueError(msg.format(d['descr'])) from e 

681 

682 return d['shape'], d['fortran_order'], dtype 

683 

684def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): 

685 """ 

686 Write an array to an NPY file, including a header. 

687 

688 If the array is neither C-contiguous nor Fortran-contiguous AND the 

689 file_like object is not a real file object, this function will have to 

690 copy data in memory. 

691 

692 Parameters 

693 ---------- 

694 fp : file_like object 

695 An open, writable file object, or similar object with a 

696 ``.write()`` method. 

697 array : ndarray 

698 The array to write to disk. 

699 version : (int, int) or None, optional 

700 The version number of the format. None means use the oldest 

701 supported version that is able to store the data. Default: None 

702 allow_pickle : bool, optional 

703 Whether to allow writing pickled data. Default: True 

704 pickle_kwargs : dict, optional 

705 Additional keyword arguments to pass to pickle.dump, excluding 

706 'protocol'. These are only useful when pickling objects in object 

707 arrays on Python 3 to Python 2 compatible format. 

708 

709 Raises 

710 ------ 

711 ValueError 

712 If the array cannot be persisted. This includes the case of 

713 allow_pickle=False and array being an object array. 

714 Various other errors 

715 If the array contains Python objects as part of its dtype, the 

716 process of pickling them may raise various errors if the objects 

717 are not picklable. 

718 

719 """ 

720 _check_version(version) 

721 _write_array_header(fp, header_data_from_array_1_0(array), version) 

722 

723 if array.itemsize == 0: 

724 buffersize = 0 

725 else: 

726 # Set buffer size to 16 MiB to hide the Python loop overhead. 

727 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1) 

728 

729 dtype_class = type(array.dtype) 

730 

731 if array.dtype.hasobject or not dtype_class._legacy: 

732 # We contain Python objects so we cannot write out the data 

733 # directly. Instead, we will pickle it out 

734 if not allow_pickle: 

735 if array.dtype.hasobject: 

736 raise ValueError("Object arrays cannot be saved when " 

737 "allow_pickle=False") 

738 if not dtype_class._legacy: 

739 raise ValueError("User-defined dtypes cannot be saved " 

740 "when allow_pickle=False") 

741 if pickle_kwargs is None: 

742 pickle_kwargs = {} 

743 pickle.dump(array, fp, protocol=4, **pickle_kwargs) 

744 elif array.flags.f_contiguous and not array.flags.c_contiguous: 

745 if isfileobj(fp): 

746 array.T.tofile(fp) 

747 else: 

748 for chunk in numpy.nditer( 

749 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

750 buffersize=buffersize, order='F'): 

751 fp.write(chunk.tobytes('C')) 

752 else: 

753 if isfileobj(fp): 

754 array.tofile(fp) 

755 else: 

756 for chunk in numpy.nditer( 

757 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

758 buffersize=buffersize, order='C'): 

759 fp.write(chunk.tobytes('C')) 

760 

761 

762def read_array(fp, allow_pickle=False, pickle_kwargs=None, *, 

763 max_header_size=_MAX_HEADER_SIZE): 

764 """ 

765 Read an array from an NPY file. 

766 

767 Parameters 

768 ---------- 

769 fp : file_like object 

770 If this is not a real file object, then this may take extra memory 

771 and time. 

772 allow_pickle : bool, optional 

773 Whether to allow writing pickled data. Default: False 

774 pickle_kwargs : dict 

775 Additional keyword arguments to pass to pickle.load. These are only 

776 useful when loading object arrays saved on Python 2 when using 

777 Python 3. 

778 max_header_size : int, optional 

779 Maximum allowed size of the header. Large headers may not be safe 

780 to load securely and thus require explicitly passing a larger value. 

781 See :py:func:`ast.literal_eval()` for details. 

782 This option is ignored when `allow_pickle` is passed. In that case 

783 the file is by definition trusted and the limit is unnecessary. 

784 

785 Returns 

786 ------- 

787 array : ndarray 

788 The array from the data on disk. 

789 

790 Raises 

791 ------ 

792 ValueError 

793 If the data is invalid, or allow_pickle=False and the file contains 

794 an object array. 

795 

796 """ 

797 if allow_pickle: 

798 # Effectively ignore max_header_size, since `allow_pickle` indicates 

799 # that the input is fully trusted. 

800 max_header_size = 2**64 

801 

802 version = read_magic(fp) 

803 _check_version(version) 

804 shape, fortran_order, dtype = _read_array_header( 

805 fp, version, max_header_size=max_header_size) 

806 if len(shape) == 0: 

807 count = 1 

808 else: 

809 count = numpy.multiply.reduce(shape, dtype=numpy.int64) 

810 

811 # Now read the actual data. 

812 if dtype.hasobject: 

813 # The array contained Python objects. We need to unpickle the data. 

814 if not allow_pickle: 

815 raise ValueError("Object arrays cannot be loaded when " 

816 "allow_pickle=False") 

817 if pickle_kwargs is None: 

818 pickle_kwargs = {} 

819 try: 

820 array = pickle.load(fp, **pickle_kwargs) 

821 except UnicodeError as err: 

822 # Friendlier error message 

823 raise UnicodeError("Unpickling a python object failed: %r\n" 

824 "You may need to pass the encoding= option " 

825 "to numpy.load" % (err,)) from err 

826 else: 

827 if isfileobj(fp): 

828 # We can use the fast fromfile() function. 

829 array = numpy.fromfile(fp, dtype=dtype, count=count) 

830 else: 

831 # This is not a real file. We have to read it the 

832 # memory-intensive way. 

833 # crc32 module fails on reads greater than 2 ** 32 bytes, 

834 # breaking large reads from gzip streams. Chunk reads to 

835 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead 

836 # of the read. In non-chunked case count < max_read_count, so 

837 # only one read is performed. 

838 

839 # Use np.ndarray instead of np.empty since the latter does 

840 # not correctly instantiate zero-width string dtypes; see 

841 # https://github.com/numpy/numpy/pull/6430 

842 array = numpy.ndarray(count, dtype=dtype) 

843 

844 if dtype.itemsize > 0: 

845 # If dtype.itemsize == 0 then there's nothing more to read 

846 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize) 

847 

848 for i in range(0, count, max_read_count): 

849 read_count = min(max_read_count, count - i) 

850 read_size = int(read_count * dtype.itemsize) 

851 data = _read_bytes(fp, read_size, "array data") 

852 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, 

853 count=read_count) 

854 

855 if fortran_order: 

856 array.shape = shape[::-1] 

857 array = array.transpose() 

858 else: 

859 array.shape = shape 

860 

861 return array 

862 

863 

864def open_memmap(filename, mode='r+', dtype=None, shape=None, 

865 fortran_order=False, version=None, *, 

866 max_header_size=_MAX_HEADER_SIZE): 

867 """ 

868 Open a .npy file as a memory-mapped array. 

869 

870 This may be used to read an existing file or create a new one. 

871 

872 Parameters 

873 ---------- 

874 filename : str or path-like 

875 The name of the file on disk. This may *not* be a file-like 

876 object. 

877 mode : str, optional 

878 The mode in which to open the file; the default is 'r+'. In 

879 addition to the standard file modes, 'c' is also accepted to mean 

880 "copy on write." See `memmap` for the available mode strings. 

881 dtype : data-type, optional 

882 The data type of the array if we are creating a new file in "write" 

883 mode, if not, `dtype` is ignored. The default value is None, which 

884 results in a data-type of `float64`. 

885 shape : tuple of int 

886 The shape of the array if we are creating a new file in "write" 

887 mode, in which case this parameter is required. Otherwise, this 

888 parameter is ignored and is thus optional. 

889 fortran_order : bool, optional 

890 Whether the array should be Fortran-contiguous (True) or 

891 C-contiguous (False, the default) if we are creating a new file in 

892 "write" mode. 

893 version : tuple of int (major, minor) or None 

894 If the mode is a "write" mode, then this is the version of the file 

895 format used to create the file. None means use the oldest 

896 supported version that is able to store the data. Default: None 

897 max_header_size : int, optional 

898 Maximum allowed size of the header. Large headers may not be safe 

899 to load securely and thus require explicitly passing a larger value. 

900 See :py:func:`ast.literal_eval()` for details. 

901 

902 Returns 

903 ------- 

904 marray : memmap 

905 The memory-mapped array. 

906 

907 Raises 

908 ------ 

909 ValueError 

910 If the data or the mode is invalid. 

911 OSError 

912 If the file is not found or cannot be opened correctly. 

913 

914 See Also 

915 -------- 

916 numpy.memmap 

917 

918 """ 

919 if isfileobj(filename): 

920 raise ValueError("Filename must be a string or a path-like object." 

921 " Memmap cannot use existing file handles.") 

922 

923 if 'w' in mode: 

924 # We are creating the file, not reading it. 

925 # Check if we ought to create the file. 

926 _check_version(version) 

927 # Ensure that the given dtype is an authentic dtype object rather 

928 # than just something that can be interpreted as a dtype object. 

929 dtype = numpy.dtype(dtype) 

930 if dtype.hasobject: 

931 msg = "Array can't be memory-mapped: Python objects in dtype." 

932 raise ValueError(msg) 

933 d = dict( 

934 descr=dtype_to_descr(dtype), 

935 fortran_order=fortran_order, 

936 shape=shape, 

937 ) 

938 # If we got here, then it should be safe to create the file. 

939 with open(os.fspath(filename), mode+'b') as fp: 

940 _write_array_header(fp, d, version) 

941 offset = fp.tell() 

942 else: 

943 # Read the header of the file first. 

944 with open(os.fspath(filename), 'rb') as fp: 

945 version = read_magic(fp) 

946 _check_version(version) 

947 

948 shape, fortran_order, dtype = _read_array_header( 

949 fp, version, max_header_size=max_header_size) 

950 if dtype.hasobject: 

951 msg = "Array can't be memory-mapped: Python objects in dtype." 

952 raise ValueError(msg) 

953 offset = fp.tell() 

954 

955 if fortran_order: 

956 order = 'F' 

957 else: 

958 order = 'C' 

959 

960 # We need to change a write-only mode to a read-write mode since we've 

961 # already written data to the file. 

962 if mode == 'w+': 

963 mode = 'r+' 

964 

965 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, 

966 mode=mode, offset=offset) 

967 

968 return marray 

969 

970 

971def _read_bytes(fp, size, error_template="ran out of data"): 

972 """ 

973 Read from file-like object until size bytes are read. 

974 Raises ValueError if not EOF is encountered before size bytes are read. 

975 Non-blocking objects only supported if they derive from io objects. 

976 

977 Required as e.g. ZipExtFile in python 2.6 can return less data than 

978 requested. 

979 """ 

980 data = bytes() 

981 while True: 

982 # io files (default in python3) return None or raise on 

983 # would-block, python2 file will truncate, probably nothing can be 

984 # done about that. note that regular files can't be non-blocking 

985 try: 

986 r = fp.read(size - len(data)) 

987 data += r 

988 if len(r) == 0 or len(data) == size: 

989 break 

990 except BlockingIOError: 

991 pass 

992 if len(data) != size: 

993 msg = "EOF: reading %s, expected %d bytes got %d" 

994 raise ValueError(msg % (error_template, size, len(data))) 

995 else: 

996 return data 

997 

998 

999def isfileobj(f): 

1000 if not isinstance(f, (io.FileIO, io.BufferedReader, io.BufferedWriter)): 

1001 return False 

1002 try: 

1003 # BufferedReader/Writer may raise OSError when 

1004 # fetching `fileno()` (e.g. when wrapping BytesIO). 

1005 f.fileno() 

1006 return True 

1007 except OSError: 

1008 return False