Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.11/site-packages/ftfy/bad_codecs/__init__.py: 86%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

22 statements  

1r""" 

2The `ftfy.bad_codecs` module gives Python the ability to decode some common, 

3flawed encodings. 

4 

5Python does not want you to be sloppy with your text. Its encoders and decoders 

6("codecs") follow the relevant standards whenever possible, which means that 

7when you get text that *doesn't* follow those standards, you'll probably fail 

8to decode it. Or you might succeed at decoding it for implementation-specific 

9reasons, which is perhaps worse. 

10 

11There are some encodings out there that Python wishes didn't exist, which are 

12widely used outside of Python: 

13 

14- "utf-8-variants", a family of not-quite-UTF-8 encodings, including the 

15 ever-popular CESU-8 and "Java modified UTF-8". 

16- "Sloppy" versions of character map encodings, where bytes that don't map to 

17 anything will instead map to the Unicode character with the same number. 

18 

19Simply importing this module, or in fact any part of the `ftfy` package, will 

20make these new "bad codecs" available to Python through the standard Codecs 

21API. You never have to actually call any functions inside `ftfy.bad_codecs`. 

22 

23However, if you want to call something because your code checker insists on it, 

24you can call ``ftfy.bad_codecs.ok()``. 

25 

26A quick example of decoding text that's encoded in CESU-8: 

27 

28 >>> import ftfy.bad_codecs 

29 >>> print(b'\xed\xa0\xbd\xed\xb8\x8d'.decode('utf-8-variants')) 

30 😍 

31""" 

32 

33import codecs 

34from encodings import normalize_encoding 

35from typing import Optional 

36 

37_CACHE: dict[str, codecs.CodecInfo] = {} 

38 

39# Define some aliases for 'utf-8-variants'. All hyphens get turned into 

40# underscores, because of `normalize_encoding`. 

41UTF8_VAR_NAMES = ( 

42 "utf_8_variants", 

43 "utf8_variants", 

44 "utf_8_variant", 

45 "utf8_variant", 

46 "utf_8_var", 

47 "utf8_var", 

48 "cesu_8", 

49 "cesu8", 

50 "java_utf_8", 

51 "java_utf8", 

52) 

53 

54 

55def search_function(encoding: str) -> Optional[codecs.CodecInfo]: 

56 """ 

57 Register our "bad codecs" with Python's codecs API. This involves adding 

58 a search function that takes in an encoding name, and returns a codec 

59 for that encoding if it knows one, or None if it doesn't. 

60 

61 The encodings this will match are: 

62 

63 - Encodings of the form 'sloppy-windows-NNNN' or 'sloppy-iso-8859-N', 

64 where the non-sloppy version is an encoding that leaves some bytes 

65 unmapped to characters. 

66 - The 'utf-8-variants' encoding, which has the several aliases seen 

67 above. 

68 """ 

69 if encoding in _CACHE: 

70 return _CACHE[encoding] 

71 

72 norm_encoding = normalize_encoding(encoding) 

73 codec = None 

74 if norm_encoding in UTF8_VAR_NAMES: 

75 from ftfy.bad_codecs.utf8_variants import CODEC_INFO 

76 

77 codec = CODEC_INFO 

78 elif norm_encoding.startswith("sloppy_"): 

79 from ftfy.bad_codecs.sloppy import CODECS 

80 

81 codec = CODECS.get(norm_encoding) 

82 

83 if codec is not None: 

84 _CACHE[encoding] = codec 

85 

86 return codec 

87 

88 

89def ok() -> None: 

90 """ 

91 A feel-good function that gives you something to call after importing 

92 this package. 

93 

94 Why is this here? Pyflakes. Pyflakes gets upset when you import a module 

95 and appear not to use it. It doesn't know that you're using it when 

96 you use the ``unicode.encode`` and ``bytes.decode`` methods with certain 

97 encodings. 

98 """ 

99 

100 

101codecs.register(search_function)