Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/wcwidth/wcwidth.py: 76%

100 statements  

« prev     ^ index     » next       coverage.py v7.3.2, created at 2023-12-08 06:33 +0000

1""" 

2This is a python implementation of wcwidth() and wcswidth(). 

3 

4https://github.com/jquast/wcwidth 

5 

6from Markus Kuhn's C code, retrieved from: 

7 

8 http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c 

9 

10This is an implementation of wcwidth() and wcswidth() (defined in 

11IEEE Std 1002.1-2001) for Unicode. 

12 

13http://www.opengroup.org/onlinepubs/007904975/functions/wcwidth.html 

14http://www.opengroup.org/onlinepubs/007904975/functions/wcswidth.html 

15 

16In fixed-width output devices, Latin characters all occupy a single 

17"cell" position of equal width, whereas ideographic CJK characters 

18occupy two such cells. Interoperability between terminal-line 

19applications and (teletype-style) character terminals using the 

20UTF-8 encoding requires agreement on which character should advance 

21the cursor by how many cell positions. No established formal 

22standards exist at present on which Unicode character shall occupy 

23how many cell positions on character terminals. These routines are 

24a first attempt of defining such behavior based on simple rules 

25applied to data provided by the Unicode Consortium. 

26 

27For some graphical characters, the Unicode standard explicitly 

28defines a character-cell width via the definition of the East Asian 

29FullWidth (F), Wide (W), Half-width (H), and Narrow (Na) classes. 

30In all these cases, there is no ambiguity about which width a 

31terminal shall use. For characters in the East Asian Ambiguous (A) 

32class, the width choice depends purely on a preference of backward 

33compatibility with either historic CJK or Western practice. 

34Choosing single-width for these characters is easy to justify as 

35the appropriate long-term solution, as the CJK practice of 

36displaying these characters as double-width comes from historic 

37implementation simplicity (8-bit encoded characters were displayed 

38single-width and 16-bit ones double-width, even for Greek, 

39Cyrillic, etc.) and not any typographic considerations. 

40 

41Much less clear is the choice of width for the Not East Asian 

42(Neutral) class. Existing practice does not dictate a width for any 

43of these characters. It would nevertheless make sense 

44typographically to allocate two character cells to characters such 

45as for instance EM SPACE or VOLUME INTEGRAL, which cannot be 

46represented adequately with a single-width glyph. The following 

47routines at present merely assign a single-cell width to all 

48neutral characters, in the interest of simplicity. This is not 

49entirely satisfactory and should be reconsidered before 

50establishing a formal standard in this area. At the moment, the 

51decision which Not East Asian (Neutral) characters should be 

52represented by double-width glyphs cannot yet be answered by 

53applying a simple rule from the Unicode database content. Setting 

54up a proper standard for the behavior of UTF-8 character terminals 

55will require a careful analysis not only of each Unicode character, 

56but also of each presentation form, something the author of these 

57routines has avoided to do so far. 

58 

59http://www.unicode.org/unicode/reports/tr11/ 

60 

61Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c 

62""" 

63from __future__ import division 

64 

65# std imports 

66import os 

67import sys 

68import warnings 

69 

70# local 

71from .table_vs16 import VS16_NARROW_TO_WIDE 

72from .table_wide import WIDE_EASTASIAN 

73from .table_zero import ZERO_WIDTH 

74from .unicode_versions import list_versions 

75 

76try: 

77 # std imports 

78 from functools import lru_cache 

79except ImportError: 

80 # lru_cache was added in Python 3.2 

81 # 3rd party 

82 from backports.functools_lru_cache import lru_cache 

83 

84# global cache 

85_PY3 = sys.version_info[0] >= 3 

86 

87 

88def _bisearch(ucs, table): 

89 """ 

90 Auxiliary function for binary search in interval table. 

91 

92 :arg int ucs: Ordinal value of unicode character. 

93 :arg list table: List of starting and ending ranges of ordinal values, 

94 in form of ``[(start, end), ...]``. 

95 :rtype: int 

96 :returns: 1 if ordinal value ucs is found within lookup table, else 0. 

97 """ 

98 lbound = 0 

99 ubound = len(table) - 1 

100 

101 if ucs < table[0][0] or ucs > table[ubound][1]: 

102 return 0 

103 while ubound >= lbound: 

104 mid = (lbound + ubound) // 2 

105 if ucs > table[mid][1]: 

106 lbound = mid + 1 

107 elif ucs < table[mid][0]: 

108 ubound = mid - 1 

109 else: 

110 return 1 

111 

112 return 0 

113 

114 

115@lru_cache(maxsize=1000) 

116def wcwidth(wc, unicode_version='auto'): 

117 r""" 

118 Given one Unicode character, return its printable length on a terminal. 

119 

120 :param str wc: A single Unicode character. 

121 :param str unicode_version: A Unicode version number, such as 

122 ``'6.0.0'``. A list of version levels suported by wcwidth 

123 is returned by :func:`list_versions`. 

124 

125 Any version string may be specified without error -- the nearest 

126 matching version is selected. When ``latest`` (default), the 

127 highest Unicode version level is used. 

128 :return: The width, in cells, necessary to display the character of 

129 Unicode string character, ``wc``. Returns 0 if the ``wc`` argument has 

130 no printable effect on a terminal (such as NUL '\0'), -1 if ``wc`` is 

131 not printable, or has an indeterminate effect on the terminal, such as 

132 a control character. Otherwise, the number of column positions the 

133 character occupies on a graphic terminal (1 or 2) is returned. 

134 :rtype: int 

135 

136 See :ref:`Specification` for details of cell measurement. 

137 """ 

138 ucs = ord(wc) if wc else 0 

139 

140 # small optimization: early return of 1 for printable ASCII, this provides 

141 # approximately 40% performance improvement for mostly-ascii documents, with 

142 # less than 1% impact to others. 

143 if 32 <= ucs < 0x7f: 

144 return 1 

145 

146 # C0/C1 control characters are -1 for compatibility with POSIX-like calls 

147 if ucs and ucs < 32 or 0x07F <= ucs < 0x0A0: 

148 return -1 

149 

150 _unicode_version = _wcmatch_version(unicode_version) 

151 

152 # Zero width 

153 if _bisearch(ucs, ZERO_WIDTH[_unicode_version]): 

154 return 0 

155 

156 # 1 or 2 width 

157 return 1 + _bisearch(ucs, WIDE_EASTASIAN[_unicode_version]) 

158 

159 

160def wcswidth(pwcs, n=None, unicode_version='auto'): 

161 """ 

162 Given a unicode string, return its printable length on a terminal. 

163 

164 :param str pwcs: Measure width of given unicode string. 

165 :param int n: When ``n`` is None (default), return the length of the 

166 entire string, otherwise width the first ``n`` characters specified. 

167 :param str unicode_version: An explicit definition of the unicode version 

168 level to use for determination, may be ``auto`` (default), which uses 

169 the Environment Variable, ``UNICODE_VERSION`` if defined, or the latest 

170 available unicode version, otherwise. 

171 :rtype: int 

172 :returns: The width, in cells, needed to display the first ``n`` characters 

173 of the unicode string ``pwcs``. Returns ``-1`` for C0 and C1 control 

174 characters! 

175 

176 See :ref:`Specification` for details of cell measurement. 

177 """ 

178 # this 'n' argument is a holdover for POSIX function 

179 _unicode_version = None 

180 end = len(pwcs) if n is None else n 

181 width = 0 

182 idx = 0 

183 last_measured_char = None 

184 while idx < end: 

185 char = pwcs[idx] 

186 if char == u'\u200D': 

187 # Zero Width Joiner, do not measure this or next character 

188 idx += 2 

189 continue 

190 if char == u'\uFE0F' and last_measured_char: 

191 # on variation selector 16 (VS16) following another character, 

192 # conditionally add '1' to the measured width if that character is 

193 # known to be converted from narrow to wide by the VS16 character. 

194 if _unicode_version is None: 

195 _unicode_version = _wcversion_value(_wcmatch_version(unicode_version)) 

196 if _unicode_version >= (9, 0, 0): 

197 width += _bisearch(ord(last_measured_char), VS16_NARROW_TO_WIDE["9.0.0"]) 

198 last_measured_char = None 

199 idx += 1 

200 continue 

201 # measure character at current index 

202 wcw = wcwidth(char, unicode_version) 

203 if wcw < 0: 

204 # early return -1 on C0 and C1 control characters 

205 return wcw 

206 if wcw > 0: 

207 # track last character measured to contain a cell, so that 

208 # subsequent VS-16 modifiers may be understood 

209 last_measured_char = char 

210 width += wcw 

211 idx += 1 

212 return width 

213 

214 

215@lru_cache(maxsize=128) 

216def _wcversion_value(ver_string): 

217 """ 

218 Integer-mapped value of given dotted version string. 

219 

220 :param str ver_string: Unicode version string, of form ``n.n.n``. 

221 :rtype: tuple(int) 

222 :returns: tuple of digit tuples, ``tuple(int, [...])``. 

223 """ 

224 retval = tuple(map(int, (ver_string.split('.')))) 

225 return retval 

226 

227 

228@lru_cache(maxsize=8) 

229def _wcmatch_version(given_version): 

230 """ 

231 Return nearest matching supported Unicode version level. 

232 

233 If an exact match is not determined, the nearest lowest version level is 

234 returned after a warning is emitted. For example, given supported levels 

235 ``4.1.0`` and ``5.0.0``, and a version string of ``4.9.9``, then ``4.1.0`` 

236 is selected and returned: 

237 

238 >>> _wcmatch_version('4.9.9') 

239 '4.1.0' 

240 >>> _wcmatch_version('8.0') 

241 '8.0.0' 

242 >>> _wcmatch_version('1') 

243 '4.1.0' 

244 

245 :param str given_version: given version for compare, may be ``auto`` 

246 (default), to select Unicode Version from Environment Variable, 

247 ``UNICODE_VERSION``. If the environment variable is not set, then the 

248 latest is used. 

249 :rtype: str 

250 :returns: unicode string, or non-unicode ``str`` type for python 2 

251 when given ``version`` is also type ``str``. 

252 """ 

253 # Design note: the choice to return the same type that is given certainly 

254 # complicates it for python 2 str-type, but allows us to define an api that 

255 # uses 'string-type' for unicode version level definitions, so all of our 

256 # example code works with all versions of python. 

257 # 

258 # That, along with the string-to-numeric and comparisons of earliest, 

259 # latest, matching, or nearest, greatly complicates this function. 

260 # Performance is somewhat curbed by memoization. 

261 _return_str = not _PY3 and isinstance(given_version, str) 

262 

263 if _return_str: 

264 # avoid list-comprehension to work around a coverage issue: 

265 # https://github.com/nedbat/coveragepy/issues/753 

266 unicode_versions = list(map(lambda ucs: ucs.encode(), list_versions())) 

267 else: 

268 unicode_versions = list_versions() 

269 latest_version = unicode_versions[-1] 

270 

271 if given_version in (u'auto', 'auto'): 

272 given_version = os.environ.get( 

273 'UNICODE_VERSION', 

274 'latest' if not _return_str else latest_version.encode()) 

275 

276 if given_version in (u'latest', 'latest'): 

277 # default match, when given as 'latest', use the most latest unicode 

278 # version specification level supported. 

279 return latest_version if not _return_str else latest_version.encode() 

280 

281 if given_version in unicode_versions: 

282 # exact match, downstream has specified an explicit matching version 

283 # matching any value of list_versions(). 

284 return given_version if not _return_str else given_version.encode() 

285 

286 # The user's version is not supported by ours. We return the newest unicode 

287 # version level that we support below their given value. 

288 try: 

289 cmp_given = _wcversion_value(given_version) 

290 

291 except ValueError: 

292 # submitted value raises ValueError in int(), warn and use latest. 

293 warnings.warn("UNICODE_VERSION value, {given_version!r}, is invalid. " 

294 "Value should be in form of `integer[.]+', the latest " 

295 "supported unicode version {latest_version!r} has been " 

296 "inferred.".format(given_version=given_version, 

297 latest_version=latest_version)) 

298 return latest_version if not _return_str else latest_version.encode() 

299 

300 # given version is less than any available version, return earliest 

301 # version. 

302 earliest_version = unicode_versions[0] 

303 cmp_earliest_version = _wcversion_value(earliest_version) 

304 

305 if cmp_given <= cmp_earliest_version: 

306 # this probably isn't what you wanted, the oldest wcwidth.c you will 

307 # find in the wild is likely version 5 or 6, which we both support, 

308 # but it's better than not saying anything at all. 

309 warnings.warn("UNICODE_VERSION value, {given_version!r}, is lower " 

310 "than any available unicode version. Returning lowest " 

311 "version level, {earliest_version!r}".format( 

312 given_version=given_version, 

313 earliest_version=earliest_version)) 

314 return earliest_version if not _return_str else earliest_version.encode() 

315 

316 # create list of versions which are less than our equal to given version, 

317 # and return the tail value, which is the highest level we may support, 

318 # or the latest value we support, when completely unmatched or higher 

319 # than any supported version. 

320 # 

321 # function will never complete, always returns. 

322 for idx, unicode_version in enumerate(unicode_versions): 

323 # look ahead to next value 

324 try: 

325 cmp_next_version = _wcversion_value(unicode_versions[idx + 1]) 

326 except IndexError: 

327 # at end of list, return latest version 

328 return latest_version if not _return_str else latest_version.encode() 

329 

330 # Maybe our given version has less parts, as in tuple(8, 0), than the 

331 # next compare version tuple(8, 0, 0). Test for an exact match by 

332 # comparison of only the leading dotted piece(s): (8, 0) == (8, 0). 

333 if cmp_given == cmp_next_version[:len(cmp_given)]: 

334 return unicode_versions[idx + 1] 

335 

336 # Or, if any next value is greater than our given support level 

337 # version, return the current value in index. Even though it must 

338 # be less than the given value, its our closest possible match. That 

339 # is, 4.1 is returned for given 4.9.9, where 4.1 and 5.0 are available. 

340 if cmp_next_version > cmp_given: 

341 return unicode_version 

342 assert False, ("Code path unreachable", given_version, unicode_versions) # pragma: no cover