Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.11/site-packages/wcwidth/wcwidth.py: 77%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

92 statements  

1""" 

2This is a python implementation of wcwidth() and wcswidth(). 

3 

4https://github.com/jquast/wcwidth 

5 

6from Markus Kuhn's C code, retrieved from: 

7 

8 http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c 

9 

10This is an implementation of wcwidth() and wcswidth() (defined in 

11IEEE Std 1002.1-2001) for Unicode. 

12 

13http://www.opengroup.org/onlinepubs/007904975/functions/wcwidth.html 

14http://www.opengroup.org/onlinepubs/007904975/functions/wcswidth.html 

15 

16In fixed-width output devices, Latin characters all occupy a single 

17"cell" position of equal width, whereas ideographic CJK characters 

18occupy two such cells. Interoperability between terminal-line 

19applications and (teletype-style) character terminals using the 

20UTF-8 encoding requires agreement on which character should advance 

21the cursor by how many cell positions. No established formal 

22standards exist at present on which Unicode character shall occupy 

23how many cell positions on character terminals. These routines are 

24a first attempt of defining such behavior based on simple rules 

25applied to data provided by the Unicode Consortium. 

26 

27For some graphical characters, the Unicode standard explicitly 

28defines a character-cell width via the definition of the East Asian 

29FullWidth (F), Wide (W), Half-width (H), and Narrow (Na) classes. 

30In all these cases, there is no ambiguity about which width a 

31terminal shall use. For characters in the East Asian Ambiguous (A) 

32class, the width choice depends purely on a preference of backward 

33compatibility with either historic CJK or Western practice. 

34Choosing single-width for these characters is easy to justify as 

35the appropriate long-term solution, as the CJK practice of 

36displaying these characters as double-width comes from historic 

37implementation simplicity (8-bit encoded characters were displayed 

38single-width and 16-bit ones double-width, even for Greek, 

39Cyrillic, etc.) and not any typographic considerations. 

40 

41Much less clear is the choice of width for the Not East Asian 

42(Neutral) class. Existing practice does not dictate a width for any 

43of these characters. It would nevertheless make sense 

44typographically to allocate two character cells to characters such 

45as for instance EM SPACE or VOLUME INTEGRAL, which cannot be 

46represented adequately with a single-width glyph. The following 

47routines at present merely assign a single-cell width to all 

48neutral characters, in the interest of simplicity. This is not 

49entirely satisfactory and should be reconsidered before 

50establishing a formal standard in this area. At the moment, the 

51decision which Not East Asian (Neutral) characters should be 

52represented by double-width glyphs cannot yet be answered by 

53applying a simple rule from the Unicode database content. Setting 

54up a proper standard for the behavior of UTF-8 character terminals 

55will require a careful analysis not only of each Unicode character, 

56but also of each presentation form, something the author of these 

57routines has avoided to do so far. 

58 

59http://www.unicode.org/unicode/reports/tr11/ 

60 

61Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c 

62""" 

63 

64# std imports 

65import os 

66import warnings 

67from functools import lru_cache 

68 

69# local 

70from .table_vs16 import VS16_NARROW_TO_WIDE 

71from .table_wide import WIDE_EASTASIAN 

72from .table_zero import ZERO_WIDTH 

73from .unicode_versions import list_versions 

74 

75 

76def _bisearch(ucs, table): 

77 """ 

78 Auxiliary function for binary search in interval table. 

79 

80 :arg int ucs: Ordinal value of unicode character. 

81 :arg list table: List of starting and ending ranges of ordinal values, 

82 in form of ``[(start, end), ...]``. 

83 :rtype: int 

84 :returns: 1 if ordinal value ucs is found within lookup table, else 0. 

85 """ 

86 lbound = 0 

87 ubound = len(table) - 1 

88 

89 if ucs < table[0][0] or ucs > table[ubound][1]: 

90 return 0 

91 while ubound >= lbound: 

92 mid = (lbound + ubound) // 2 

93 if ucs > table[mid][1]: 

94 lbound = mid + 1 

95 elif ucs < table[mid][0]: 

96 ubound = mid - 1 

97 else: 

98 return 1 

99 

100 return 0 

101 

102 

103@lru_cache(maxsize=1000) 

104def wcwidth(wc, unicode_version='auto'): 

105 r""" 

106 Given one Unicode character, return its printable length on a terminal. 

107 

108 :param str wc: A single Unicode character. 

109 :param str unicode_version: A Unicode version number, such as 

110 ``'6.0.0'``. A list of version levels suported by wcwidth 

111 is returned by :func:`list_versions`. 

112 

113 Any version string may be specified without error -- the nearest 

114 matching version is selected. When ``latest`` (default), the 

115 highest Unicode version level is used. 

116 :return: The width, in cells, necessary to display the character of 

117 Unicode string character, ``wc``. Returns 0 if the ``wc`` argument has 

118 no printable effect on a terminal (such as NUL '\0'), -1 if ``wc`` is 

119 not printable, or has an indeterminate effect on the terminal, such as 

120 a control character. Otherwise, the number of column positions the 

121 character occupies on a graphic terminal (1 or 2) is returned. 

122 :rtype: int 

123 

124 See :ref:`Specification` for details of cell measurement. 

125 """ 

126 ucs = ord(wc) if wc else 0 

127 

128 # small optimization: early return of 1 for printable ASCII, this provides 

129 # approximately 40% performance improvement for mostly-ascii documents, with 

130 # less than 1% impact to others. 

131 if 32 <= ucs < 0x7f: 

132 return 1 

133 

134 # C0/C1 control characters are -1 for compatibility with POSIX-like calls 

135 if ucs and ucs < 32 or 0x07F <= ucs < 0x0A0: 

136 return -1 

137 

138 _unicode_version = _wcmatch_version(unicode_version) 

139 

140 # Zero width 

141 if _bisearch(ucs, ZERO_WIDTH[_unicode_version]): 

142 return 0 

143 

144 # 1 or 2 width 

145 return 1 + _bisearch(ucs, WIDE_EASTASIAN[_unicode_version]) 

146 

147 

148def wcswidth(pwcs, n=None, unicode_version='auto'): 

149 """ 

150 Given a unicode string, return its printable length on a terminal. 

151 

152 :param str pwcs: Measure width of given unicode string. 

153 :param int n: When ``n`` is None (default), return the length of the entire 

154 string, otherwise only the first ``n`` characters are measured. This 

155 argument exists only for compatibility with the C POSIX function 

156 signature. It is suggested instead to use python's string slicing 

157 capability, ``wcswidth(pwcs[:n])`` 

158 :param str unicode_version: An explicit definition of the unicode version 

159 level to use for determination, may be ``auto`` (default), which uses 

160 the Environment Variable, ``UNICODE_VERSION`` if defined, or the latest 

161 available unicode version, otherwise. 

162 :rtype: int 

163 :returns: The width, in cells, needed to display the first ``n`` characters 

164 of the unicode string ``pwcs``. Returns ``-1`` for C0 and C1 control 

165 characters! 

166 

167 See :ref:`Specification` for details of cell measurement. 

168 """ 

169 # this 'n' argument is a holdover for POSIX function 

170 _unicode_version = None 

171 end = len(pwcs) if n is None else n 

172 width = 0 

173 idx = 0 

174 last_measured_char = None 

175 while idx < end: 

176 char = pwcs[idx] 

177 if char == '\u200D': 

178 # Zero Width Joiner, do not measure this or next character 

179 idx += 2 

180 continue 

181 if char == '\uFE0F' and last_measured_char: 

182 # on variation selector 16 (VS16) following another character, 

183 # conditionally add '1' to the measured width if that character is 

184 # known to be converted from narrow to wide by the VS16 character. 

185 if _unicode_version is None: 

186 _unicode_version = _wcversion_value(_wcmatch_version(unicode_version)) 

187 if _unicode_version >= (9, 0, 0): 

188 width += _bisearch(ord(last_measured_char), VS16_NARROW_TO_WIDE["9.0.0"]) 

189 last_measured_char = None 

190 idx += 1 

191 continue 

192 # measure character at current index 

193 wcw = wcwidth(char, unicode_version) 

194 if wcw < 0: 

195 # early return -1 on C0 and C1 control characters 

196 return wcw 

197 if wcw > 0: 

198 # track last character measured to contain a cell, so that 

199 # subsequent VS-16 modifiers may be understood 

200 last_measured_char = char 

201 width += wcw 

202 idx += 1 

203 return width 

204 

205 

206@lru_cache(maxsize=128) 

207def _wcversion_value(ver_string): 

208 """ 

209 Integer-mapped value of given dotted version string. 

210 

211 :param str ver_string: Unicode version string, of form ``n.n.n``. 

212 :rtype: tuple(int) 

213 :returns: tuple of digit tuples, ``tuple(int, [...])``. 

214 """ 

215 retval = tuple(map(int, (ver_string.split('.')))) 

216 return retval 

217 

218 

219@lru_cache(maxsize=8) 

220def _wcmatch_version(given_version): 

221 """ 

222 Return nearest matching supported Unicode version level. 

223 

224 If an exact match is not determined, the nearest lowest version level is 

225 returned after a warning is emitted. For example, given supported levels 

226 ``4.1.0`` and ``5.0.0``, and a version string of ``4.9.9``, then ``4.1.0`` 

227 is selected and returned: 

228 

229 >>> _wcmatch_version('4.9.9') 

230 '4.1.0' 

231 >>> _wcmatch_version('8.0') 

232 '8.0.0' 

233 >>> _wcmatch_version('1') 

234 '4.1.0' 

235 

236 :param str given_version: given version for compare, may be ``auto`` 

237 (default), to select Unicode Version from Environment Variable, 

238 ``UNICODE_VERSION``. If the environment variable is not set, then the 

239 latest is used. 

240 :rtype: str 

241 :returns: unicode string. 

242 """ 

243 # Design note: the choice to return the same type that is given certainly 

244 # complicates it for python 2 str-type, but allows us to define an api that 

245 # uses 'string-type' for unicode version level definitions, so all of our 

246 # example code works with all versions of python. 

247 # 

248 # That, along with the string-to-numeric and comparisons of earliest, 

249 # latest, matching, or nearest, greatly complicates this function. 

250 # Performance is somewhat curbed by memoization. 

251 

252 unicode_versions = list_versions() 

253 latest_version = unicode_versions[-1] 

254 

255 if given_version == 'auto': 

256 given_version = os.environ.get( 

257 'UNICODE_VERSION', 

258 'latest') 

259 

260 if given_version == 'latest': 

261 # default match, when given as 'latest', use the most latest unicode 

262 # version specification level supported. 

263 return latest_version 

264 

265 if given_version in unicode_versions: 

266 # exact match, downstream has specified an explicit matching version 

267 # matching any value of list_versions(). 

268 return given_version 

269 

270 # The user's version is not supported by ours. We return the newest unicode 

271 # version level that we support below their given value. 

272 try: 

273 cmp_given = _wcversion_value(given_version) 

274 

275 except ValueError: 

276 # submitted value raises ValueError in int(), warn and use latest. 

277 warnings.warn("UNICODE_VERSION value, {given_version!r}, is invalid. " 

278 "Value should be in form of `integer[.]+', the latest " 

279 "supported unicode version {latest_version!r} has been " 

280 "inferred.".format(given_version=given_version, 

281 latest_version=latest_version)) 

282 return latest_version 

283 

284 # given version is less than any available version, return earliest 

285 # version. 

286 earliest_version = unicode_versions[0] 

287 cmp_earliest_version = _wcversion_value(earliest_version) 

288 

289 if cmp_given <= cmp_earliest_version: 

290 # this probably isn't what you wanted, the oldest wcwidth.c you will 

291 # find in the wild is likely version 5 or 6, which we both support, 

292 # but it's better than not saying anything at all. 

293 warnings.warn("UNICODE_VERSION value, {given_version!r}, is lower " 

294 "than any available unicode version. Returning lowest " 

295 "version level, {earliest_version!r}".format( 

296 given_version=given_version, 

297 earliest_version=earliest_version)) 

298 return earliest_version 

299 

300 # create list of versions which are less than our equal to given version, 

301 # and return the tail value, which is the highest level we may support, 

302 # or the latest value we support, when completely unmatched or higher 

303 # than any supported version. 

304 # 

305 # function will never complete, always returns. 

306 for idx, unicode_version in enumerate(unicode_versions): 

307 # look ahead to next value 

308 try: 

309 cmp_next_version = _wcversion_value(unicode_versions[idx + 1]) 

310 except IndexError: 

311 # at end of list, return latest version 

312 return latest_version 

313 

314 # Maybe our given version has less parts, as in tuple(8, 0), than the 

315 # next compare version tuple(8, 0, 0). Test for an exact match by 

316 # comparison of only the leading dotted piece(s): (8, 0) == (8, 0). 

317 if cmp_given == cmp_next_version[:len(cmp_given)]: 

318 return unicode_versions[idx + 1] 

319 

320 # Or, if any next value is greater than our given support level 

321 # version, return the current value in index. Even though it must 

322 # be less than the given value, it's our closest possible match. That 

323 # is, 4.1 is returned for given 4.9.9, where 4.1 and 5.0 are available. 

324 if cmp_next_version > cmp_given: 

325 return unicode_version 

326 assert False, ("Code path unreachable", given_version, unicode_versions) # pragma: no cover