Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.9/dist-packages/numpy/lib/format.py: 12%

289 statements  

« prev     ^ index     » next       coverage.py v7.4.4, created at 2024-04-09 06:12 +0000

1""" 

2Binary serialization 

3 

4NPY format 

5========== 

6 

7A simple format for saving numpy arrays to disk with the full 

8information about them. 

9 

10The ``.npy`` format is the standard binary file format in NumPy for 

11persisting a *single* arbitrary NumPy array on disk. The format stores all 

12of the shape and dtype information necessary to reconstruct the array 

13correctly even on another machine with a different architecture. 

14The format is designed to be as simple as possible while achieving 

15its limited goals. 

16 

17The ``.npz`` format is the standard format for persisting *multiple* NumPy 

18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` 

19files, one for each array. 

20 

21Capabilities 

22------------ 

23 

24- Can represent all NumPy arrays including nested record arrays and 

25 object arrays. 

26 

27- Represents the data in its native binary form. 

28 

29- Supports Fortran-contiguous arrays directly. 

30 

31- Stores all of the necessary information to reconstruct the array 

32 including shape and dtype on a machine of a different 

33 architecture. Both little-endian and big-endian arrays are 

34 supported, and a file with little-endian numbers will yield 

35 a little-endian array on any machine reading the file. The 

36 types are described in terms of their actual sizes. For example, 

37 if a machine with a 64-bit C "long int" writes out an array with 

38 "long ints", a reading machine with 32-bit C "long ints" will yield 

39 an array with 64-bit integers. 

40 

41- Is straightforward to reverse engineer. Datasets often live longer than 

42 the programs that created them. A competent developer should be 

43 able to create a solution in their preferred programming language to 

44 read most ``.npy`` files that they have been given without much 

45 documentation. 

46 

47- Allows memory-mapping of the data. See `open_memmap`. 

48 

49- Can be read from a filelike stream object instead of an actual file. 

50 

51- Stores object arrays, i.e. arrays containing elements that are arbitrary 

52 Python objects. Files with object arrays are not to be mmapable, but 

53 can be read and written to disk. 

54 

55Limitations 

56----------- 

57 

58- Arbitrary subclasses of numpy.ndarray are not completely preserved. 

59 Subclasses will be accepted for writing, but only the array data will 

60 be written out. A regular numpy.ndarray object will be created 

61 upon reading the file. 

62 

63.. warning:: 

64 

65 Due to limitations in the interpretation of structured dtypes, dtypes 

66 with fields with empty names will have the names replaced by 'f0', 'f1', 

67 etc. Such arrays will not round-trip through the format entirely 

68 accurately. The data is intact; only the field names will differ. We are 

69 working on a fix for this. This fix will not require a change in the 

70 file format. The arrays with such structures can still be saved and 

71 restored, and the correct dtype may be restored by using the 

72 ``loadedarray.view(correct_dtype)`` method. 

73 

74File extensions 

75--------------- 

76 

77We recommend using the ``.npy`` and ``.npz`` extensions for files saved 

78in this format. This is by no means a requirement; applications may wish 

79to use these file formats but use an extension specific to the 

80application. In the absence of an obvious alternative, however, 

81we suggest using ``.npy`` and ``.npz``. 

82 

83Version numbering 

84----------------- 

85 

86The version numbering of these formats is independent of NumPy version 

87numbering. If the format is upgraded, the code in `numpy.io` will still 

88be able to read and write Version 1.0 files. 

89 

90Format Version 1.0 

91------------------ 

92 

93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. 

94 

95The next 1 byte is an unsigned byte: the major version number of the file 

96format, e.g. ``\\x01``. 

97 

98The next 1 byte is an unsigned byte: the minor version number of the file 

99format, e.g. ``\\x00``. Note: the version of the file format is not tied 

100to the version of the numpy package. 

101 

102The next 2 bytes form a little-endian unsigned short int: the length of 

103the header data HEADER_LEN. 

104 

105The next HEADER_LEN bytes form the header data describing the array's 

106format. It is an ASCII string which contains a Python literal expression 

107of a dictionary. It is terminated by a newline (``\\n``) and padded with 

108spaces (``\\x20``) to make the total of 

109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible 

110by 64 for alignment purposes. 

111 

112The dictionary contains three keys: 

113 

114 "descr" : dtype.descr 

115 An object that can be passed as an argument to the `numpy.dtype` 

116 constructor to create the array's dtype. 

117 "fortran_order" : bool 

118 Whether the array data is Fortran-contiguous or not. Since 

119 Fortran-contiguous arrays are a common form of non-C-contiguity, 

120 we allow them to be written directly to disk for efficiency. 

121 "shape" : tuple of int 

122 The shape of the array. 

123 

124For repeatability and readability, the dictionary keys are sorted in 

125alphabetic order. This is for convenience only. A writer SHOULD implement 

126this if possible. A reader MUST NOT depend on this. 

127 

128Following the header comes the array data. If the dtype contains Python 

129objects (i.e. ``dtype.hasobject is True``), then the data is a Python 

130pickle of the array. Otherwise the data is the contiguous (either C- 

131or Fortran-, depending on ``fortran_order``) bytes of the array. 

132Consumers can figure out the number of bytes by multiplying the number 

133of elements given by the shape (noting that ``shape=()`` means there is 

1341 element) by ``dtype.itemsize``. 

135 

136Format Version 2.0 

137------------------ 

138 

139The version 1.0 format only allowed the array header to have a total size of 

14065535 bytes. This can be exceeded by structured arrays with a large number of 

141columns. The version 2.0 format extends the header size to 4 GiB. 

142`numpy.save` will automatically save in 2.0 format if the data requires it, 

143else it will always use the more compatible 1.0 format. 

144 

145The description of the fourth element of the header therefore has become: 

146"The next 4 bytes form a little-endian unsigned int: the length of the header 

147data HEADER_LEN." 

148 

149Format Version 3.0 

150------------------ 

151 

152This version replaces the ASCII string (which in practice was latin1) with 

153a utf8-encoded string, so supports structured types with any unicode field 

154names. 

155 

156Notes 

157----- 

158The ``.npy`` format, including motivation for creating it and a comparison of 

159alternatives, is described in the 

160:doc:`"npy-format" NEP <neps:nep-0001-npy-format>`, however details have 

161evolved with time and this document is more current. 

162 

163""" 

164import io 

165import os 

166import pickle 

167import warnings 

168 

169import numpy 

170from numpy.lib._utils_impl import drop_metadata 

171 

172 

173__all__ = [] 

174 

175 

176EXPECTED_KEYS = {'descr', 'fortran_order', 'shape'} 

177MAGIC_PREFIX = b'\x93NUMPY' 

178MAGIC_LEN = len(MAGIC_PREFIX) + 2 

179ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096 

180BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes 

181# allow growth within the address space of a 64 bit machine along one axis 

182GROWTH_AXIS_MAX_DIGITS = 21 # = len(str(8*2**64-1)) hypothetical int1 dtype 

183 

184# difference between version 1.0 and 2.0 is a 4 byte (I) header length 

185# instead of 2 bytes (H) allowing storage of large structured arrays 

186_header_size_info = { 

187 (1, 0): ('<H', 'latin1'), 

188 (2, 0): ('<I', 'latin1'), 

189 (3, 0): ('<I', 'utf8'), 

190} 

191 

192# Python's literal_eval is not actually safe for large inputs, since parsing 

193# may become slow or even cause interpreter crashes. 

194# This is an arbitrary, low limit which should make it safe in practice. 

195_MAX_HEADER_SIZE = 10000 

196 

197def _check_version(version): 

198 if version not in [(1, 0), (2, 0), (3, 0), None]: 

199 msg = "we only support format version (1,0), (2,0), and (3,0), not %s" 

200 raise ValueError(msg % (version,)) 

201 

202def magic(major, minor): 

203 """ Return the magic string for the given file format version. 

204 

205 Parameters 

206 ---------- 

207 major : int in [0, 255] 

208 minor : int in [0, 255] 

209 

210 Returns 

211 ------- 

212 magic : str 

213 

214 Raises 

215 ------ 

216 ValueError if the version cannot be formatted. 

217 """ 

218 if major < 0 or major > 255: 

219 raise ValueError("major version must be 0 <= major < 256") 

220 if minor < 0 or minor > 255: 

221 raise ValueError("minor version must be 0 <= minor < 256") 

222 return MAGIC_PREFIX + bytes([major, minor]) 

223 

224def read_magic(fp): 

225 """ Read the magic string to get the version of the file format. 

226 

227 Parameters 

228 ---------- 

229 fp : filelike object 

230 

231 Returns 

232 ------- 

233 major : int 

234 minor : int 

235 """ 

236 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string") 

237 if magic_str[:-2] != MAGIC_PREFIX: 

238 msg = "the magic string is not correct; expected %r, got %r" 

239 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) 

240 major, minor = magic_str[-2:] 

241 return major, minor 

242 

243 

244def dtype_to_descr(dtype): 

245 """ 

246 Get a serializable descriptor from the dtype. 

247 

248 The .descr attribute of a dtype object cannot be round-tripped through 

249 the dtype() constructor. Simple types, like dtype('float32'), have 

250 a descr which looks like a record array with one field with '' as 

251 a name. The dtype() constructor interprets this as a request to give 

252 a default name. Instead, we construct descriptor that can be passed to 

253 dtype(). 

254 

255 Parameters 

256 ---------- 

257 dtype : dtype 

258 The dtype of the array that will be written to disk. 

259 

260 Returns 

261 ------- 

262 descr : object 

263 An object that can be passed to `numpy.dtype()` in order to 

264 replicate the input dtype. 

265 

266 """ 

267 # NOTE: that drop_metadata may not return the right dtype e.g. for user 

268 # dtypes. In that case our code below would fail the same, though. 

269 new_dtype = drop_metadata(dtype) 

270 if new_dtype is not dtype: 

271 warnings.warn("metadata on a dtype is not saved to an npy/npz. " 

272 "Use another format (such as pickle) to store it.", 

273 UserWarning, stacklevel=2) 

274 if dtype.names is not None: 

275 # This is a record array. The .descr is fine. XXX: parts of the 

276 # record array with an empty name, like padding bytes, still get 

277 # fiddled with. This needs to be fixed in the C implementation of 

278 # dtype(). 

279 return dtype.descr 

280 elif not type(dtype)._legacy: 

281 # this must be a user-defined dtype since numpy does not yet expose any 

282 # non-legacy dtypes in the public API 

283 # 

284 # non-legacy dtypes don't yet have __array_interface__ 

285 # support. Instead, as a hack, we use pickle to save the array, and lie 

286 # that the dtype is object. When the array is loaded, the descriptor is 

287 # unpickled with the array and the object dtype in the header is 

288 # discarded. 

289 # 

290 # a future NEP should define a way to serialize user-defined 

291 # descriptors and ideally work out the possible security implications 

292 warnings.warn("Custom dtypes are saved as python objects using the " 

293 "pickle protocol. Loading this file requires " 

294 "allow_pickle=True to be set.", 

295 UserWarning, stacklevel=2) 

296 return "|O" 

297 else: 

298 return dtype.str 

299 

300def descr_to_dtype(descr): 

301 """ 

302 Returns a dtype based off the given description. 

303 

304 This is essentially the reverse of `~lib.format.dtype_to_descr`. It will 

305 remove the valueless padding fields created by, i.e. simple fields like 

306 dtype('float32'), and then convert the description to its corresponding 

307 dtype. 

308 

309 Parameters 

310 ---------- 

311 descr : object 

312 The object retrieved by dtype.descr. Can be passed to 

313 `numpy.dtype` in order to replicate the input dtype. 

314 

315 Returns 

316 ------- 

317 dtype : dtype 

318 The dtype constructed by the description. 

319 

320 """ 

321 if isinstance(descr, str): 

322 # No padding removal needed 

323 return numpy.dtype(descr) 

324 elif isinstance(descr, tuple): 

325 # subtype, will always have a shape descr[1] 

326 dt = descr_to_dtype(descr[0]) 

327 return numpy.dtype((dt, descr[1])) 

328 

329 titles = [] 

330 names = [] 

331 formats = [] 

332 offsets = [] 

333 offset = 0 

334 for field in descr: 

335 if len(field) == 2: 

336 name, descr_str = field 

337 dt = descr_to_dtype(descr_str) 

338 else: 

339 name, descr_str, shape = field 

340 dt = numpy.dtype((descr_to_dtype(descr_str), shape)) 

341 

342 # Ignore padding bytes, which will be void bytes with '' as name 

343 # Once support for blank names is removed, only "if name == ''" needed) 

344 is_pad = (name == '' and dt.type is numpy.void and dt.names is None) 

345 if not is_pad: 

346 title, name = name if isinstance(name, tuple) else (None, name) 

347 titles.append(title) 

348 names.append(name) 

349 formats.append(dt) 

350 offsets.append(offset) 

351 offset += dt.itemsize 

352 

353 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles, 

354 'offsets': offsets, 'itemsize': offset}) 

355 

356def header_data_from_array_1_0(array): 

357 """ Get the dictionary of header metadata from a numpy.ndarray. 

358 

359 Parameters 

360 ---------- 

361 array : numpy.ndarray 

362 

363 Returns 

364 ------- 

365 d : dict 

366 This has the appropriate entries for writing its string representation 

367 to the header of the file. 

368 """ 

369 d = {'shape': array.shape} 

370 if array.flags.c_contiguous: 

371 d['fortran_order'] = False 

372 elif array.flags.f_contiguous: 

373 d['fortran_order'] = True 

374 else: 

375 # Totally non-contiguous data. We will have to make it C-contiguous 

376 # before writing. Note that we need to test for C_CONTIGUOUS first 

377 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. 

378 d['fortran_order'] = False 

379 

380 d['descr'] = dtype_to_descr(array.dtype) 

381 return d 

382 

383 

384def _wrap_header(header, version): 

385 """ 

386 Takes a stringified header, and attaches the prefix and padding to it 

387 """ 

388 import struct 

389 assert version is not None 

390 fmt, encoding = _header_size_info[version] 

391 header = header.encode(encoding) 

392 hlen = len(header) + 1 

393 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN) 

394 try: 

395 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen) 

396 except struct.error: 

397 msg = "Header length {} too big for version={}".format(hlen, version) 

398 raise ValueError(msg) from None 

399 

400 # Pad the header with spaces and a final newline such that the magic 

401 # string, the header-length short and the header are aligned on a 

402 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes 

403 # aligned up to ARRAY_ALIGN on systems like Linux where mmap() 

404 # offset must be page-aligned (i.e. the beginning of the file). 

405 return header_prefix + header + b' '*padlen + b'\n' 

406 

407 

408def _wrap_header_guess_version(header): 

409 """ 

410 Like `_wrap_header`, but chooses an appropriate version given the contents 

411 """ 

412 try: 

413 return _wrap_header(header, (1, 0)) 

414 except ValueError: 

415 pass 

416 

417 try: 

418 ret = _wrap_header(header, (2, 0)) 

419 except UnicodeEncodeError: 

420 pass 

421 else: 

422 warnings.warn("Stored array in format 2.0. It can only be" 

423 "read by NumPy >= 1.9", UserWarning, stacklevel=2) 

424 return ret 

425 

426 header = _wrap_header(header, (3, 0)) 

427 warnings.warn("Stored array in format 3.0. It can only be " 

428 "read by NumPy >= 1.17", UserWarning, stacklevel=2) 

429 return header 

430 

431 

432def _write_array_header(fp, d, version=None): 

433 """ Write the header for an array and returns the version used 

434 

435 Parameters 

436 ---------- 

437 fp : filelike object 

438 d : dict 

439 This has the appropriate entries for writing its string representation 

440 to the header of the file. 

441 version : tuple or None 

442 None means use oldest that works. Providing an explicit version will 

443 raise a ValueError if the format does not allow saving this data. 

444 Default: None 

445 """ 

446 header = ["{"] 

447 for key, value in sorted(d.items()): 

448 # Need to use repr here, since we eval these when reading 

449 header.append("'%s': %s, " % (key, repr(value))) 

450 header.append("}") 

451 header = "".join(header) 

452 

453 # Add some spare space so that the array header can be modified in-place 

454 # when changing the array size, e.g. when growing it by appending data at 

455 # the end. 

456 shape = d['shape'] 

457 header += " " * ((GROWTH_AXIS_MAX_DIGITS - len(repr( 

458 shape[-1 if d['fortran_order'] else 0] 

459 ))) if len(shape) > 0 else 0) 

460 

461 if version is None: 

462 header = _wrap_header_guess_version(header) 

463 else: 

464 header = _wrap_header(header, version) 

465 fp.write(header) 

466 

467def write_array_header_1_0(fp, d): 

468 """ Write the header for an array using the 1.0 format. 

469 

470 Parameters 

471 ---------- 

472 fp : filelike object 

473 d : dict 

474 This has the appropriate entries for writing its string 

475 representation to the header of the file. 

476 """ 

477 _write_array_header(fp, d, (1, 0)) 

478 

479 

480def write_array_header_2_0(fp, d): 

481 """ Write the header for an array using the 2.0 format. 

482 The 2.0 format allows storing very large structured arrays. 

483 

484 .. versionadded:: 1.9.0 

485 

486 Parameters 

487 ---------- 

488 fp : filelike object 

489 d : dict 

490 This has the appropriate entries for writing its string 

491 representation to the header of the file. 

492 """ 

493 _write_array_header(fp, d, (2, 0)) 

494 

495def read_array_header_1_0(fp, max_header_size=_MAX_HEADER_SIZE): 

496 """ 

497 Read an array header from a filelike object using the 1.0 file format 

498 version. 

499 

500 This will leave the file object located just after the header. 

501 

502 Parameters 

503 ---------- 

504 fp : filelike object 

505 A file object or something with a `.read()` method like a file. 

506 

507 Returns 

508 ------- 

509 shape : tuple of int 

510 The shape of the array. 

511 fortran_order : bool 

512 The array data will be written out directly if it is either 

513 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

514 contiguous before writing it out. 

515 dtype : dtype 

516 The dtype of the file's data. 

517 max_header_size : int, optional 

518 Maximum allowed size of the header. Large headers may not be safe 

519 to load securely and thus require explicitly passing a larger value. 

520 See :py:func:`ast.literal_eval()` for details. 

521 

522 Raises 

523 ------ 

524 ValueError 

525 If the data is invalid. 

526 

527 """ 

528 return _read_array_header( 

529 fp, version=(1, 0), max_header_size=max_header_size) 

530 

531def read_array_header_2_0(fp, max_header_size=_MAX_HEADER_SIZE): 

532 """ 

533 Read an array header from a filelike object using the 2.0 file format 

534 version. 

535 

536 This will leave the file object located just after the header. 

537 

538 .. versionadded:: 1.9.0 

539 

540 Parameters 

541 ---------- 

542 fp : filelike object 

543 A file object or something with a `.read()` method like a file. 

544 max_header_size : int, optional 

545 Maximum allowed size of the header. Large headers may not be safe 

546 to load securely and thus require explicitly passing a larger value. 

547 See :py:func:`ast.literal_eval()` for details. 

548 

549 Returns 

550 ------- 

551 shape : tuple of int 

552 The shape of the array. 

553 fortran_order : bool 

554 The array data will be written out directly if it is either 

555 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

556 contiguous before writing it out. 

557 dtype : dtype 

558 The dtype of the file's data. 

559 

560 Raises 

561 ------ 

562 ValueError 

563 If the data is invalid. 

564 

565 """ 

566 return _read_array_header( 

567 fp, version=(2, 0), max_header_size=max_header_size) 

568 

569 

570def _filter_header(s): 

571 """Clean up 'L' in npz header ints. 

572 

573 Cleans up the 'L' in strings representing integers. Needed to allow npz 

574 headers produced in Python2 to be read in Python3. 

575 

576 Parameters 

577 ---------- 

578 s : string 

579 Npy file header. 

580 

581 Returns 

582 ------- 

583 header : str 

584 Cleaned up header. 

585 

586 """ 

587 import tokenize 

588 from io import StringIO 

589 

590 tokens = [] 

591 last_token_was_number = False 

592 for token in tokenize.generate_tokens(StringIO(s).readline): 

593 token_type = token[0] 

594 token_string = token[1] 

595 if (last_token_was_number and 

596 token_type == tokenize.NAME and 

597 token_string == "L"): 

598 continue 

599 else: 

600 tokens.append(token) 

601 last_token_was_number = (token_type == tokenize.NUMBER) 

602 return tokenize.untokenize(tokens) 

603 

604 

605def _read_array_header(fp, version, max_header_size=_MAX_HEADER_SIZE): 

606 """ 

607 see read_array_header_1_0 

608 """ 

609 # Read an unsigned, little-endian short int which has the length of the 

610 # header. 

611 import ast 

612 import struct 

613 hinfo = _header_size_info.get(version) 

614 if hinfo is None: 

615 raise ValueError("Invalid version {!r}".format(version)) 

616 hlength_type, encoding = hinfo 

617 

618 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length") 

619 header_length = struct.unpack(hlength_type, hlength_str)[0] 

620 header = _read_bytes(fp, header_length, "array header") 

621 header = header.decode(encoding) 

622 if len(header) > max_header_size: 

623 raise ValueError( 

624 f"Header info length ({len(header)}) is large and may not be safe " 

625 "to load securely.\n" 

626 "To allow loading, adjust `max_header_size` or fully trust " 

627 "the `.npy` file using `allow_pickle=True`.\n" 

628 "For safety against large resource use or crashes, sandboxing " 

629 "may be necessary.") 

630 

631 # The header is a pretty-printed string representation of a literal 

632 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte 

633 # boundary. The keys are strings. 

634 # "shape" : tuple of int 

635 # "fortran_order" : bool 

636 # "descr" : dtype.descr 

637 # Versions (2, 0) and (1, 0) could have been created by a Python 2 

638 # implementation before header filtering was implemented. 

639 # 

640 # For performance reasons, we try without _filter_header first though 

641 try: 

642 d = ast.literal_eval(header) 

643 except SyntaxError as e: 

644 if version <= (2, 0): 

645 header = _filter_header(header) 

646 try: 

647 d = ast.literal_eval(header) 

648 except SyntaxError as e2: 

649 msg = "Cannot parse header: {!r}" 

650 raise ValueError(msg.format(header)) from e2 

651 else: 

652 warnings.warn( 

653 "Reading `.npy` or `.npz` file required additional " 

654 "header parsing as it was created on Python 2. Save the " 

655 "file again to speed up loading and avoid this warning.", 

656 UserWarning, stacklevel=4) 

657 else: 

658 msg = "Cannot parse header: {!r}" 

659 raise ValueError(msg.format(header)) from e 

660 if not isinstance(d, dict): 

661 msg = "Header is not a dictionary: {!r}" 

662 raise ValueError(msg.format(d)) 

663 

664 if EXPECTED_KEYS != d.keys(): 

665 keys = sorted(d.keys()) 

666 msg = "Header does not contain the correct keys: {!r}" 

667 raise ValueError(msg.format(keys)) 

668 

669 # Sanity-check the values. 

670 if (not isinstance(d['shape'], tuple) or 

671 not all(isinstance(x, int) for x in d['shape'])): 

672 msg = "shape is not valid: {!r}" 

673 raise ValueError(msg.format(d['shape'])) 

674 if not isinstance(d['fortran_order'], bool): 

675 msg = "fortran_order is not a valid bool: {!r}" 

676 raise ValueError(msg.format(d['fortran_order'])) 

677 try: 

678 dtype = descr_to_dtype(d['descr']) 

679 except TypeError as e: 

680 msg = "descr is not a valid dtype descriptor: {!r}" 

681 raise ValueError(msg.format(d['descr'])) from e 

682 

683 return d['shape'], d['fortran_order'], dtype 

684 

685def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): 

686 """ 

687 Write an array to an NPY file, including a header. 

688 

689 If the array is neither C-contiguous nor Fortran-contiguous AND the 

690 file_like object is not a real file object, this function will have to 

691 copy data in memory. 

692 

693 Parameters 

694 ---------- 

695 fp : file_like object 

696 An open, writable file object, or similar object with a 

697 ``.write()`` method. 

698 array : ndarray 

699 The array to write to disk. 

700 version : (int, int) or None, optional 

701 The version number of the format. None means use the oldest 

702 supported version that is able to store the data. Default: None 

703 allow_pickle : bool, optional 

704 Whether to allow writing pickled data. Default: True 

705 pickle_kwargs : dict, optional 

706 Additional keyword arguments to pass to pickle.dump, excluding 

707 'protocol'. These are only useful when pickling objects in object 

708 arrays on Python 3 to Python 2 compatible format. 

709 

710 Raises 

711 ------ 

712 ValueError 

713 If the array cannot be persisted. This includes the case of 

714 allow_pickle=False and array being an object array. 

715 Various other errors 

716 If the array contains Python objects as part of its dtype, the 

717 process of pickling them may raise various errors if the objects 

718 are not picklable. 

719 

720 """ 

721 _check_version(version) 

722 _write_array_header(fp, header_data_from_array_1_0(array), version) 

723 

724 if array.itemsize == 0: 

725 buffersize = 0 

726 else: 

727 # Set buffer size to 16 MiB to hide the Python loop overhead. 

728 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1) 

729 

730 dtype_class = type(array.dtype) 

731 

732 if array.dtype.hasobject or not dtype_class._legacy: 

733 # We contain Python objects so we cannot write out the data 

734 # directly. Instead, we will pickle it out 

735 if not allow_pickle: 

736 if array.dtype.hasobject: 

737 raise ValueError("Object arrays cannot be saved when " 

738 "allow_pickle=False") 

739 if not dtype_class._legacy: 

740 raise ValueError("User-defined dtypes cannot be saved " 

741 "when allow_pickle=False") 

742 if pickle_kwargs is None: 

743 pickle_kwargs = {} 

744 pickle.dump(array, fp, protocol=3, **pickle_kwargs) 

745 elif array.flags.f_contiguous and not array.flags.c_contiguous: 

746 if isfileobj(fp): 

747 array.T.tofile(fp) 

748 else: 

749 for chunk in numpy.nditer( 

750 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

751 buffersize=buffersize, order='F'): 

752 fp.write(chunk.tobytes('C')) 

753 else: 

754 if isfileobj(fp): 

755 array.tofile(fp) 

756 else: 

757 for chunk in numpy.nditer( 

758 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

759 buffersize=buffersize, order='C'): 

760 fp.write(chunk.tobytes('C')) 

761 

762 

763def read_array(fp, allow_pickle=False, pickle_kwargs=None, *, 

764 max_header_size=_MAX_HEADER_SIZE): 

765 """ 

766 Read an array from an NPY file. 

767 

768 Parameters 

769 ---------- 

770 fp : file_like object 

771 If this is not a real file object, then this may take extra memory 

772 and time. 

773 allow_pickle : bool, optional 

774 Whether to allow writing pickled data. Default: False 

775 

776 .. versionchanged:: 1.16.3 

777 Made default False in response to CVE-2019-6446. 

778 

779 pickle_kwargs : dict 

780 Additional keyword arguments to pass to pickle.load. These are only 

781 useful when loading object arrays saved on Python 2 when using 

782 Python 3. 

783 max_header_size : int, optional 

784 Maximum allowed size of the header. Large headers may not be safe 

785 to load securely and thus require explicitly passing a larger value. 

786 See :py:func:`ast.literal_eval()` for details. 

787 This option is ignored when `allow_pickle` is passed. In that case 

788 the file is by definition trusted and the limit is unnecessary. 

789 

790 Returns 

791 ------- 

792 array : ndarray 

793 The array from the data on disk. 

794 

795 Raises 

796 ------ 

797 ValueError 

798 If the data is invalid, or allow_pickle=False and the file contains 

799 an object array. 

800 

801 """ 

802 if allow_pickle: 

803 # Effectively ignore max_header_size, since `allow_pickle` indicates 

804 # that the input is fully trusted. 

805 max_header_size = 2**64 

806 

807 version = read_magic(fp) 

808 _check_version(version) 

809 shape, fortran_order, dtype = _read_array_header( 

810 fp, version, max_header_size=max_header_size) 

811 if len(shape) == 0: 

812 count = 1 

813 else: 

814 count = numpy.multiply.reduce(shape, dtype=numpy.int64) 

815 

816 # Now read the actual data. 

817 if dtype.hasobject: 

818 # The array contained Python objects. We need to unpickle the data. 

819 if not allow_pickle: 

820 raise ValueError("Object arrays cannot be loaded when " 

821 "allow_pickle=False") 

822 if pickle_kwargs is None: 

823 pickle_kwargs = {} 

824 try: 

825 array = pickle.load(fp, **pickle_kwargs) 

826 except UnicodeError as err: 

827 # Friendlier error message 

828 raise UnicodeError("Unpickling a python object failed: %r\n" 

829 "You may need to pass the encoding= option " 

830 "to numpy.load" % (err,)) from err 

831 else: 

832 if isfileobj(fp): 

833 # We can use the fast fromfile() function. 

834 array = numpy.fromfile(fp, dtype=dtype, count=count) 

835 else: 

836 # This is not a real file. We have to read it the 

837 # memory-intensive way. 

838 # crc32 module fails on reads greater than 2 ** 32 bytes, 

839 # breaking large reads from gzip streams. Chunk reads to 

840 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead 

841 # of the read. In non-chunked case count < max_read_count, so 

842 # only one read is performed. 

843 

844 # Use np.ndarray instead of np.empty since the latter does 

845 # not correctly instantiate zero-width string dtypes; see 

846 # https://github.com/numpy/numpy/pull/6430 

847 array = numpy.ndarray(count, dtype=dtype) 

848 

849 if dtype.itemsize > 0: 

850 # If dtype.itemsize == 0 then there's nothing more to read 

851 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize) 

852 

853 for i in range(0, count, max_read_count): 

854 read_count = min(max_read_count, count - i) 

855 read_size = int(read_count * dtype.itemsize) 

856 data = _read_bytes(fp, read_size, "array data") 

857 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, 

858 count=read_count) 

859 

860 if fortran_order: 

861 array.shape = shape[::-1] 

862 array = array.transpose() 

863 else: 

864 array.shape = shape 

865 

866 return array 

867 

868 

869def open_memmap(filename, mode='r+', dtype=None, shape=None, 

870 fortran_order=False, version=None, *, 

871 max_header_size=_MAX_HEADER_SIZE): 

872 """ 

873 Open a .npy file as a memory-mapped array. 

874 

875 This may be used to read an existing file or create a new one. 

876 

877 Parameters 

878 ---------- 

879 filename : str or path-like 

880 The name of the file on disk. This may *not* be a file-like 

881 object. 

882 mode : str, optional 

883 The mode in which to open the file; the default is 'r+'. In 

884 addition to the standard file modes, 'c' is also accepted to mean 

885 "copy on write." See `memmap` for the available mode strings. 

886 dtype : data-type, optional 

887 The data type of the array if we are creating a new file in "write" 

888 mode, if not, `dtype` is ignored. The default value is None, which 

889 results in a data-type of `float64`. 

890 shape : tuple of int 

891 The shape of the array if we are creating a new file in "write" 

892 mode, in which case this parameter is required. Otherwise, this 

893 parameter is ignored and is thus optional. 

894 fortran_order : bool, optional 

895 Whether the array should be Fortran-contiguous (True) or 

896 C-contiguous (False, the default) if we are creating a new file in 

897 "write" mode. 

898 version : tuple of int (major, minor) or None 

899 If the mode is a "write" mode, then this is the version of the file 

900 format used to create the file. None means use the oldest 

901 supported version that is able to store the data. Default: None 

902 max_header_size : int, optional 

903 Maximum allowed size of the header. Large headers may not be safe 

904 to load securely and thus require explicitly passing a larger value. 

905 See :py:func:`ast.literal_eval()` for details. 

906 

907 Returns 

908 ------- 

909 marray : memmap 

910 The memory-mapped array. 

911 

912 Raises 

913 ------ 

914 ValueError 

915 If the data or the mode is invalid. 

916 OSError 

917 If the file is not found or cannot be opened correctly. 

918 

919 See Also 

920 -------- 

921 numpy.memmap 

922 

923 """ 

924 if isfileobj(filename): 

925 raise ValueError("Filename must be a string or a path-like object." 

926 " Memmap cannot use existing file handles.") 

927 

928 if 'w' in mode: 

929 # We are creating the file, not reading it. 

930 # Check if we ought to create the file. 

931 _check_version(version) 

932 # Ensure that the given dtype is an authentic dtype object rather 

933 # than just something that can be interpreted as a dtype object. 

934 dtype = numpy.dtype(dtype) 

935 if dtype.hasobject: 

936 msg = "Array can't be memory-mapped: Python objects in dtype." 

937 raise ValueError(msg) 

938 d = dict( 

939 descr=dtype_to_descr(dtype), 

940 fortran_order=fortran_order, 

941 shape=shape, 

942 ) 

943 # If we got here, then it should be safe to create the file. 

944 with open(os.fspath(filename), mode+'b') as fp: 

945 _write_array_header(fp, d, version) 

946 offset = fp.tell() 

947 else: 

948 # Read the header of the file first. 

949 with open(os.fspath(filename), 'rb') as fp: 

950 version = read_magic(fp) 

951 _check_version(version) 

952 

953 shape, fortran_order, dtype = _read_array_header( 

954 fp, version, max_header_size=max_header_size) 

955 if dtype.hasobject: 

956 msg = "Array can't be memory-mapped: Python objects in dtype." 

957 raise ValueError(msg) 

958 offset = fp.tell() 

959 

960 if fortran_order: 

961 order = 'F' 

962 else: 

963 order = 'C' 

964 

965 # We need to change a write-only mode to a read-write mode since we've 

966 # already written data to the file. 

967 if mode == 'w+': 

968 mode = 'r+' 

969 

970 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, 

971 mode=mode, offset=offset) 

972 

973 return marray 

974 

975 

976def _read_bytes(fp, size, error_template="ran out of data"): 

977 """ 

978 Read from file-like object until size bytes are read. 

979 Raises ValueError if not EOF is encountered before size bytes are read. 

980 Non-blocking objects only supported if they derive from io objects. 

981 

982 Required as e.g. ZipExtFile in python 2.6 can return less data than 

983 requested. 

984 """ 

985 data = bytes() 

986 while True: 

987 # io files (default in python3) return None or raise on 

988 # would-block, python2 file will truncate, probably nothing can be 

989 # done about that. note that regular files can't be non-blocking 

990 try: 

991 r = fp.read(size - len(data)) 

992 data += r 

993 if len(r) == 0 or len(data) == size: 

994 break 

995 except BlockingIOError: 

996 pass 

997 if len(data) != size: 

998 msg = "EOF: reading %s, expected %d bytes got %d" 

999 raise ValueError(msg % (error_template, size, len(data))) 

1000 else: 

1001 return data 

1002 

1003 

1004def isfileobj(f): 

1005 if not isinstance(f, (io.FileIO, io.BufferedReader, io.BufferedWriter)): 

1006 return False 

1007 try: 

1008 # BufferedReader/Writer may raise OSError when 

1009 # fetching `fileno()` (e.g. when wrapping BytesIO). 

1010 f.fileno() 

1011 return True 

1012 except OSError: 

1013 return False