Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.11/site-packages/docutils/utils/smartquotes.py: 17%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

244 statements  

1#! /usr/bin/env python3 

2# :Id: $Id$ 

3# :Copyright: © 2010-2023 Günter Milde, 

4# original `SmartyPants`_: © 2003 John Gruber 

5# smartypants.py: © 2004, 2007 Chad Miller 

6# :Maintainer: docutils-develop@lists.sourceforge.net 

7# :License: Released under the terms of the `2-Clause BSD license`_, in short: 

8# 

9# Copying and distribution of this file, with or without modification, 

10# are permitted in any medium without royalty provided the copyright 

11# notices and this notice are preserved. 

12# This file is offered as-is, without any warranty. 

13# 

14# .. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause 

15 

16 

17r""" 

18========================= 

19Smart Quotes for Docutils 

20========================= 

21 

22Synopsis 

23======== 

24 

25"SmartyPants" is a free web publishing plug-in for Movable Type, Blosxom, and 

26BBEdit that easily translates plain ASCII punctuation characters into "smart" 

27typographic punctuation characters. 

28 

29``smartquotes.py`` is an adaption of "SmartyPants" to Docutils_. 

30 

31* Using Unicode instead of HTML entities for typographic punctuation 

32 characters, it works for any output format that supports Unicode. 

33* Supports `language specific quote characters`__. 

34 

35__ https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks 

36 

37 

38Authors 

39======= 

40 

41`John Gruber`_ did all of the hard work of writing this software in Perl for 

42`Movable Type`_ and almost all of this useful documentation. `Chad Miller`_ 

43ported it to Python to use with Pyblosxom_. 

44Adapted to Docutils_ by Günter Milde. 

45 

46Additional Credits 

47================== 

48 

49Portions of the SmartyPants original work are based on Brad Choate's nifty 

50MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to 

51this plug-in. Brad Choate is a fine hacker indeed. 

52 

53`Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta 

54testing of the original SmartyPants. 

55 

56`Rael Dornfest`_ ported SmartyPants to Blosxom. 

57 

58.. _Brad Choate: http://bradchoate.com/ 

59.. _Jeremy Hedley: http://antipixel.com/ 

60.. _Charles Wiltgen: http://playbacktime.com/ 

61.. _Rael Dornfest: http://raelity.org/ 

62 

63 

64Copyright and License 

65===================== 

66 

67SmartyPants_ license (3-Clause BSD license): 

68 

69 Copyright (c) 2003 John Gruber (http://daringfireball.net/) 

70 All rights reserved. 

71 

72 Redistribution and use in source and binary forms, with or without 

73 modification, are permitted provided that the following conditions are 

74 met: 

75 

76 * Redistributions of source code must retain the above copyright 

77 notice, this list of conditions and the following disclaimer. 

78 

79 * Redistributions in binary form must reproduce the above copyright 

80 notice, this list of conditions and the following disclaimer in 

81 the documentation and/or other materials provided with the 

82 distribution. 

83 

84 * Neither the name "SmartyPants" nor the names of its contributors 

85 may be used to endorse or promote products derived from this 

86 software without specific prior written permission. 

87 

88 This software is provided by the copyright holders and contributors 

89 "as is" and any express or implied warranties, including, but not 

90 limited to, the implied warranties of merchantability and fitness for 

91 a particular purpose are disclaimed. In no event shall the copyright 

92 owner or contributors be liable for any direct, indirect, incidental, 

93 special, exemplary, or consequential damages (including, but not 

94 limited to, procurement of substitute goods or services; loss of use, 

95 data, or profits; or business interruption) however caused and on any 

96 theory of liability, whether in contract, strict liability, or tort 

97 (including negligence or otherwise) arising in any way out of the use 

98 of this software, even if advised of the possibility of such damage. 

99 

100smartypants.py license (2-Clause BSD license): 

101 

102 smartypants.py is a derivative work of SmartyPants. 

103 

104 Redistribution and use in source and binary forms, with or without 

105 modification, are permitted provided that the following conditions are 

106 met: 

107 

108 * Redistributions of source code must retain the above copyright 

109 notice, this list of conditions and the following disclaimer. 

110 

111 * Redistributions in binary form must reproduce the above copyright 

112 notice, this list of conditions and the following disclaimer in 

113 the documentation and/or other materials provided with the 

114 distribution. 

115 

116 This software is provided by the copyright holders and contributors 

117 "as is" and any express or implied warranties, including, but not 

118 limited to, the implied warranties of merchantability and fitness for 

119 a particular purpose are disclaimed. In no event shall the copyright 

120 owner or contributors be liable for any direct, indirect, incidental, 

121 special, exemplary, or consequential damages (including, but not 

122 limited to, procurement of substitute goods or services; loss of use, 

123 data, or profits; or business interruption) however caused and on any 

124 theory of liability, whether in contract, strict liability, or tort 

125 (including negligence or otherwise) arising in any way out of the use 

126 of this software, even if advised of the possibility of such damage. 

127 

128.. _John Gruber: http://daringfireball.net/ 

129.. _Chad Miller: http://web.chad.org/ 

130 

131.. _Pyblosxom: http://pyblosxom.bluesock.org/ 

132.. _SmartyPants: http://daringfireball.net/projects/smartypants/ 

133.. _Movable Type: http://www.movabletype.org/ 

134.. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause 

135.. _Docutils: https://docutils.sourceforge.io/ 

136 

137Description 

138=========== 

139 

140SmartyPants can perform the following transformations: 

141 

142- Straight quotes ( " and ' ) into "curly" quote characters 

143- Backticks-style quotes (\`\`like this'') into "curly" quote characters 

144- Dashes (``--`` and ``---``) into en- and em-dash entities 

145- Three consecutive dots (``...`` or ``. . .``) into an ellipsis ``…``. 

146 

147This means you can write, edit, and save your posts using plain old 

148ASCII straight quotes, plain dashes, and plain dots, but your published 

149posts (and final HTML output) will appear with smart quotes, em-dashes, 

150and proper ellipses. 

151 

152Backslash Escapes 

153================= 

154 

155If you need to use literal straight quotes (or plain hyphens and periods), 

156`smartquotes` accepts the following backslash escape sequences to force 

157ASCII-punctuation. Mind, that you need two backslashes in "docstrings", as 

158Python expands them, too. 

159 

160======== ========= 

161Escape Character 

162======== ========= 

163``\\`` \\ 

164``\\"`` \\" 

165``\\'`` \\' 

166``\\.`` \\. 

167``\\-`` \\- 

168``\\``` \\` 

169======== ========= 

170 

171This is useful, for example, when you want to use straight quotes as 

172foot and inch marks: 6\\'2\\" tall; a 17\\" iMac. 

173 

174 

175Caveats 

176======= 

177 

178Why You Might Not Want to Use Smart Quotes in Your Weblog 

179--------------------------------------------------------- 

180 

181For one thing, you might not care. 

182 

183Most normal, mentally stable individuals do not take notice of proper 

184typographic punctuation. Many design and typography nerds, however, break 

185out in a nasty rash when they encounter, say, a restaurant sign that uses 

186a straight apostrophe to spell "Joe's". 

187 

188If you're the sort of person who just doesn't care, you might well want to 

189continue not caring. Using straight quotes -- and sticking to the 7-bit 

190ASCII character set in general -- is certainly a simpler way to live. 

191 

192Even if you *do* care about accurate typography, you still might want to 

193think twice before educating the quote characters in your weblog. One side 

194effect of publishing curly quote characters is that it makes your 

195weblog a bit harder for others to quote from using copy-and-paste. What 

196happens is that when someone copies text from your blog, the copied text 

197contains the 8-bit curly quote characters (as well as the 8-bit characters 

198for em-dashes and ellipses, if you use these options). These characters 

199are not standard across different text encoding methods, which is why they 

200need to be encoded as characters. 

201 

202People copying text from your weblog, however, may not notice that you're 

203using curly quotes, and they'll go ahead and paste the unencoded 8-bit 

204characters copied from their browser into an email message or their own 

205weblog. When pasted as raw "smart quotes", these characters are likely to 

206get mangled beyond recognition. 

207 

208That said, my own opinion is that any decent text editor or email client 

209makes it easy to stupefy smart quote characters into their 7-bit 

210equivalents, and I don't consider it my problem if you're using an 

211indecent text editor or email client. 

212 

213 

214Algorithmic Shortcomings 

215------------------------ 

216 

217One situation in which quotes will get curled the wrong way is when 

218apostrophes are used at the start of leading contractions. For example:: 

219 

220 'Twas the night before Christmas. 

221 

222In the case above, SmartyPants will turn the apostrophe into an opening 

223secondary quote, when in fact it should be the `RIGHT SINGLE QUOTATION MARK` 

224character which is also "the preferred character to use for apostrophe" 

225(Unicode). I don't think this problem can be solved in the general case -- 

226every word processor I've tried gets this wrong as well. In such cases, it's 

227best to inset the `RIGHT SINGLE QUOTATION MARK` (’) by hand. 

228 

229In English, the same character is used for apostrophe and closing secondary 

230quote (both plain and "smart" ones). For other locales (French, Italean, 

231Swiss, ...) "smart" secondary closing quotes differ from the curly apostrophe. 

232 

233 .. class:: language-fr 

234 

235 Il dit : "C'est 'super' !" 

236 

237If the apostrophe is used at the end of a word, it cannot be distinguished 

238from a secondary quote by the algorithm. Therefore, a text like:: 

239 

240 .. class:: language-de-CH 

241 

242 "Er sagt: 'Ich fass' es nicht.'" 

243 

244will get a single closing guillemet instead of an apostrophe. 

245 

246This can be prevented by use use of the `RIGHT SINGLE QUOTATION MARK` in 

247the source:: 

248 

249 - "Er sagt: 'Ich fass' es nicht.'" 

250 + "Er sagt: 'Ich fass’ es nicht.'" 

251 

252 

253Version History 

254=============== 

255 

2561.10 2023-11-18 

257 - Pre-compile regexps once, not with every call of `educateQuotes()` 

258 (patch #206 by Chris Sewell). Simplify regexps. 

259 

2601.9 2022-03-04 

261 - Code cleanup. Require Python 3. 

262 

2631.8.1 2017-10-25 

264 - Use open quote after Unicode whitespace, ZWSP, and ZWNJ. 

265 - Code cleanup. 

266 

2671.8: 2017-04-24 

268 - Command line front-end. 

269 

2701.7.1: 2017-03-19 

271 - Update and extend language-dependent quotes. 

272 - Differentiate apostrophe from single quote. 

273 

2741.7: 2012-11-19 

275 - Internationalization: language-dependent quotes. 

276 

2771.6.1: 2012-11-06 

278 - Refactor code, code cleanup, 

279 - `educate_tokens()` generator as interface for Docutils. 

280 

2811.6: 2010-08-26 

282 - Adaption to Docutils: 

283 - Use Unicode instead of HTML entities, 

284 - Remove code special to pyblosxom. 

285 

2861.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400 

287 - Fixed bug where blocks of precious unalterable text was instead 

288 interpreted. Thanks to Le Roux and Dirk van Oosterbosch. 

289 

2901.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400 

291 - Fix bogus magical quotation when there is no hint that the 

292 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen. 

293 - Be smarter about quotes before terminating numbers in an en-dash'ed 

294 range. 

295 

2961.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500 

297 - Fix a date-processing bug, as reported by jacob childress. 

298 - Begin a test-suite for ensuring correct output. 

299 - Removed import of "string", since I didn't really need it. 

300 (This was my first every Python program. Sue me!) 

301 

3021.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400 

303 - Abort processing if the flavour is in forbidden-list. Default of 

304 [ "rss" ] (Idea of Wolfgang SCHNERRING.) 

305 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING. 

306 

3071.5_1.2: Mon, 24 May 2004 08:14:54 -0400 

308 - Some single quotes weren't replaced properly. Diff-tesuji played 

309 by Benjamin GEIGER. 

310 

3111.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500 

312 - Support upcoming pyblosxom 0.9 plugin verification feature. 

313 

3141.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500 

315 - Initial release 

316""" 

317 

318from __future__ import annotations 

319 

320import re 

321import sys 

322 

323 

324options = r""" 

325Options 

326======= 

327 

328Numeric values are the easiest way to configure SmartyPants' behavior: 

329 

330:0: Suppress all transformations. (Do nothing.) 

331 

332:1: Performs default SmartyPants transformations: quotes (including 

333 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash) 

334 is used to signify an em-dash; there is no support for en-dashes 

335 

336:2: Same as smarty_pants="1", except that it uses the old-school typewriter 

337 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``" 

338 (dash dash dash) 

339 for em-dashes. 

340 

341:3: Same as smarty_pants="2", but inverts the shorthand for dashes: 

342 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for 

343 en-dashes. 

344 

345:-1: Stupefy mode. Reverses the SmartyPants transformation process, turning 

346 the characters produced by SmartyPants into their ASCII equivalents. 

347 E.g. the LEFT DOUBLE QUOTATION MARK (“) is turned into a simple 

348 double-quote (\"), "—" is turned into two dashes, etc. 

349 

350 

351The following single-character attribute values can be combined to toggle 

352individual transformations from within the smarty_pants attribute. For 

353example, ``"1"`` is equivalent to ``"qBde"``. 

354 

355:q: Educates normal quote characters: (") and ('). 

356 

357:b: Educates \`\`backticks'' -style double quotes. 

358 

359:B: Educates \`\`backticks'' -style double quotes and \`single' quotes. 

360 

361:d: Educates em-dashes. 

362 

363:D: Educates em-dashes and en-dashes, using old-school typewriter 

364 shorthand: (dash dash) for en-dashes, (dash dash dash) for em-dashes. 

365 

366:i: Educates em-dashes and en-dashes, using inverted old-school typewriter 

367 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes. 

368 

369:e: Educates ellipses. 

370 

371:w: Translates any instance of ``"`` into a normal double-quote 

372 character. This should be of no interest to most people, but 

373 of particular interest to anyone who writes their posts using 

374 Dreamweaver, as Dreamweaver inexplicably uses this entity to represent 

375 a literal double-quote character. SmartyPants only educates normal 

376 quotes, not entities (because ordinarily, entities are used for 

377 the explicit purpose of representing the specific character they 

378 represent). The "w" option must be used in conjunction with one (or 

379 both) of the other quote options ("q" or "b"). Thus, if you wish to 

380 apply all SmartyPants transformations (quotes, en- and em-dashes, and 

381 ellipses) and also translate ``"`` entities into regular quotes 

382 so SmartyPants can educate them, you should pass the following to the 

383 smarty_pants attribute: 

384""" 

385 

386 

387class smartchars: 

388 """Smart quotes and dashes""" 

389 

390 endash = '–' # EN DASH 

391 emdash = '—' # EM DASH 

392 ellipsis = '…' # HORIZONTAL ELLIPSIS 

393 apostrophe = '’' # RIGHT SINGLE QUOTATION MARK 

394 

395 # quote characters (language-specific, set in __init__()) 

396 # https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks 

397 # https://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen 

398 # https://fr.wikipedia.org/wiki/Guillemet 

399 # https://typographisme.net/post/Les-espaces-typographiques-et-le-web 

400 # https://www.btb.termiumplus.gc.ca/tpv2guides/guides/redac/index-fra.html 

401 # https://en.wikipedia.org/wiki/Hebrew_punctuation#Quotation_marks 

402 # [7] https://www.tustep.uni-tuebingen.de/bi/bi00/bi001t1-anfuehrung.pdf 

403 # [8] https://www.korrekturavdelingen.no/anforselstegn.htm 

404 # [9] Typografisk håndbok. Oslo: Spartacus. 2000. s. 67. ISBN 8243001530. 

405 # [10] https://www.typografi.org/sitat/sitatart.html 

406 # [11] https://mk.wikipedia.org/wiki/Правопис_и_правоговор_на_македонскиот_јазик # noqa:E501 

407 # [12] https://hrvatska-tipografija.com/polunavodnici/ 

408 # [13] https://pl.wikipedia.org/wiki/Cudzys%C5%82%C3%B3w 

409 # 

410 # See also configuration option "smartquote-locales". 

411 quotes = { 

412 'af': '“”‘’', 

413 'af-x-altquot': '„”‚’', 

414 'bg': '„“‚‘', # https://bg.wikipedia.org/wiki/Кавички 

415 'ca': '«»“”', 

416 'ca-x-altquot': '“”‘’', 

417 'cs': '„“‚‘', 

418 'cs-x-altquot': '»«›‹', 

419 'da': '»«›‹', 

420 'da-x-altquot': '„“‚‘', 

421 # 'da-x-altquot2': '””’’', 

422 'de': '„“‚‘', 

423 'de-x-altquot': '»«›‹', 

424 'de-ch': '«»‹›', 

425 'el': '«»“”', # '«»‟”' https://hal.science/hal-02101618 

426 'en': '“”‘’', 

427 'en-uk-x-altquot': '‘’“”', # Attention: " → ‘ and ' → “ ! 

428 'eo': '“”‘’', 

429 'es': '«»“”', 

430 'es-x-altquot': '“”‘’', 

431 'et': '„“‚‘', # no secondary quote listed in 

432 'et-x-altquot': '«»‹›', # the sources above (wikipedia.org) 

433 'eu': '«»‹›', 

434 'fi': '””’’', 

435 'fi-x-altquot': '»»››', 

436 'fr': ('« ', ' »', '“', '”'), # full no-break space 

437 'fr-x-altquot': ('« ', ' »', '“', '”'), # narrow no-break space 

438 'fr-ch': '«»‹›', # https://typoguide.ch/ 

439 'fr-ch-x-altquot': ('« ', ' »', '‹ ', ' ›'), # narrow no-break space # noqa:E501 

440 'gl': '«»“”', 

441 'he': '”“»«', # Hebrew is RTL, test position: 

442 'he-x-altquot': '„”‚’', # low quotation marks are opening. 

443 # 'he-x-altquot': '“„‘‚', # RTL: low quotation marks opening 

444 'hr': '„”‘’', # Croatian [12] 

445 'hr-x-altquot': '»«›‹', 

446 'hsb': '„“‚‘', 

447 'hsb-x-altquot': '»«›‹', 

448 'hu': '„”«»', 

449 'is': '„“‚‘', 

450 'it': '«»“”', 

451 'it-ch': '«»‹›', 

452 'it-x-altquot': '“”‘’', 

453 # 'it-x-altquot2': '“„‘‚', # [7] in headlines 

454 'ja': '「」『』', 

455 'ko': '“”‘’', 

456 'lt': '„“‚‘', 

457 'lv': '„“‚‘', 

458 'mk': '„“‚‘', # Macedonian [11] 

459 'nl': '“”‘’', 

460 'nl-x-altquot': '„”‚’', 

461 # 'nl-x-altquot2': '””’’', 

462 'nb': '«»’’', # Norsk bokmål (canonical form 'no') 

463 'nn': '«»’’', # Nynorsk [10] 

464 'nn-x-altquot': '«»‘’', # [8], [10] 

465 # 'nn-x-altquot2': '«»«»', # [9], [10] 

466 # 'nn-x-altquot3': '„“‚‘', # [10] 

467 'no': '«»’’', # Norsk bokmål [10] 

468 'no-x-altquot': '«»‘’', # [8], [10] 

469 # 'no-x-altquot2': '«»«»', # [9], [10 

470 # 'no-x-altquot3': '„“‚‘', # [10] 

471 'pl': '„”«»', 

472 'pl-x-altquot': '«»‚’', 

473 # 'pl-x-altquot2': '„”‚’', # [13] 

474 'pt': '«»“”', 

475 'pt-br': '“”‘’', 

476 'ro': '„”«»', 

477 'ru': '«»„“', 

478 'sh': '„”‚’', # Serbo-Croatian 

479 'sh-x-altquot': '»«›‹', 

480 'sk': '„“‚‘', # Slovak 

481 'sk-x-altquot': '»«›‹', 

482 'sl': '„“‚‘', # Slovenian 

483 'sl-x-altquot': '»«›‹', 

484 'sq': '«»‹›', # Albanian 

485 'sq-x-altquot': '“„‘‚', 

486 'sr': '„”’’', 

487 'sr-x-altquot': '»«›‹', 

488 'sv': '””’’', 

489 'sv-x-altquot': '»»››', 

490 'tr': '“”‘’', 

491 'tr-x-altquot': '«»‹›', 

492 # 'tr-x-altquot2': '“„‘‚', # [7] antiquated? 

493 'uk': '«»„“', 

494 'uk-x-altquot': '„“‚‘', 

495 'zh-cn': '“”‘’', 

496 'zh-tw': '「」『』', 

497 } 

498 

499 def __init__(self, language='en') -> None: 

500 self.language = language 

501 try: 

502 (self.opquote, self.cpquote, 

503 self.osquote, self.csquote) = self.quotes[language.lower()] 

504 except KeyError: 

505 self.opquote, self.cpquote, self.osquote, self.csquote = '""\'\'' 

506 

507 

508class RegularExpressions: 

509 # character classes: 

510 _CH_CLASSES = {'open': '[([{]', # opening braces 

511 'close': r'[^\s]', # everything except whitespace 

512 'punct': r"""[-!" #\$\%'()*+,.\/:;<=>?\@\[\\\]\^_`{|}~]""", 

513 'dash': r'[-–—]', 

514 'sep': '[\\s\u200B\u200C]', # Whitespace, ZWSP, ZWNJ 

515 } 

516 START_SINGLE = re.compile(r"^'(?=%s\\B)" % _CH_CLASSES['punct']) 

517 START_DOUBLE = re.compile(r'^"(?=%s\\B)' % _CH_CLASSES['punct']) 

518 ADJACENT_1 = re.compile('"\'(?=\\w)') 

519 ADJACENT_2 = re.compile('\'"(?=\\w)') 

520 OPEN_SINGLE = re.compile(r"(%(open)s|%(dash)s)'(?=%(punct)s? )" 

521 % _CH_CLASSES) 

522 OPEN_DOUBLE = re.compile(r'(%(open)s|%(dash)s)"(?=%(punct)s? )' 

523 % _CH_CLASSES) 

524 DECADE = re.compile(r"'(?=\d{2}s)") 

525 APOSTROPHE = re.compile(r"(?<=(\w|\d))'(?=\w)") 

526 OPENING_SECONDARY = re.compile(""" 

527 (# ?<= # look behind fails: requires fixed-width pattern 

528 %(sep)s | # a whitespace char, or 

529 %(open)s | # opening brace, or 

530 %(dash)s # em/en-dash 

531 ) 

532 ' # the quote 

533 (?=\\w|%(punct)s) # word character or punctuation 

534 """ % _CH_CLASSES, re.VERBOSE) 

535 CLOSING_SECONDARY = re.compile(r"(?<!\s)'") 

536 OPENING_PRIMARY = re.compile(""" 

537 ( 

538 %(sep)s | # a whitespace char, or 

539 %(open)s | # zero width separating char, or 

540 %(dash)s # em/en-dash 

541 ) 

542 " # the quote, followed by 

543 (?=\\w|%(punct)s) # a word character or punctuation 

544 """ % _CH_CLASSES, re.VERBOSE) 

545 CLOSING_PRIMARY = re.compile(r""" 

546 ( 

547 (?<!\s)" | # no whitespace before 

548 "(?=\s) # whitespace behind 

549 ) 

550 """, re.VERBOSE) 

551 

552 

553regexes = RegularExpressions() 

554 

555 

556default_smartypants_attr = '1' 

557 

558 

559def smartyPants(text, attr=default_smartypants_attr, language='en'): 

560 """Main function for "traditional" use.""" 

561 

562 return "".join(t for t in educate_tokens(tokenize(text), attr, language)) 

563 

564 

565def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'): 

566 """Return iterator that "educates" the items of `text_tokens`.""" 

567 # Parse attributes: 

568 # 0 : do nothing 

569 # 1 : set all 

570 # 2 : set all, using old school en- and em- dash shortcuts 

571 # 3 : set all, using inverted old school en and em- dash shortcuts 

572 # 

573 # q : quotes 

574 # b : backtick quotes (``double'' only) 

575 # B : backtick quotes (``double'' and `single') 

576 # d : dashes 

577 # D : old school dashes 

578 # i : inverted old school dashes 

579 # e : ellipses 

580 # w : convert &quot; entities to " for Dreamweaver users 

581 

582 convert_quot = False # translate &quot; entities into normal quotes? 

583 do_dashes = False 

584 do_backticks = False 

585 do_quotes = False 

586 do_ellipses = False 

587 do_stupefy = False 

588 

589 # if attr == "0": # pass tokens unchanged (see below). 

590 if attr == '1': # Do everything, turn all options on. 

591 do_quotes = True 

592 do_backticks = True 

593 do_dashes = 1 

594 do_ellipses = True 

595 elif attr == '2': 

596 # Do everything, turn all options on, use old school dash shorthand. 

597 do_quotes = True 

598 do_backticks = True 

599 do_dashes = 2 

600 do_ellipses = True 

601 elif attr == '3': 

602 # Do everything, use inverted old school dash shorthand. 

603 do_quotes = True 

604 do_backticks = True 

605 do_dashes = 3 

606 do_ellipses = True 

607 elif attr == '-1': # Special "stupefy" mode. 

608 do_stupefy = True 

609 else: 

610 if 'q' in attr: do_quotes = True # noqa: E701 

611 if 'b' in attr: do_backticks = True # noqa: E701 

612 if 'B' in attr: do_backticks = 2 # noqa: E701 

613 if 'd' in attr: do_dashes = 1 # noqa: E701 

614 if 'D' in attr: do_dashes = 2 # noqa: E701 

615 if 'i' in attr: do_dashes = 3 # noqa: E701 

616 if 'e' in attr: do_ellipses = True # noqa: E701 

617 if 'w' in attr: convert_quot = True # noqa: E701 

618 

619 prev_token_last_char = ' ' 

620 # Last character of the previous text token. Used as 

621 # context to curl leading quote characters correctly. 

622 

623 for (ttype, text) in text_tokens: 

624 

625 # skip HTML and/or XML tags as well as empty text tokens 

626 # without updating the last character 

627 if ttype == 'tag' or not text: 

628 yield text 

629 continue 

630 

631 # skip literal text (math, literal, raw, ...) 

632 if ttype == 'literal': 

633 prev_token_last_char = text[-1:] 

634 yield text 

635 continue 

636 

637 last_char = text[-1:] # Remember last char before processing. 

638 

639 text = processEscapes(text) 

640 

641 if convert_quot: 

642 text = text.replace('&quot;', '"') 

643 

644 if do_dashes == 1: 

645 text = educateDashes(text) 

646 elif do_dashes == 2: 

647 text = educateDashesOldSchool(text) 

648 elif do_dashes == 3: 

649 text = educateDashesOldSchoolInverted(text) 

650 

651 if do_ellipses: 

652 text = educateEllipses(text) 

653 

654 # Note: backticks need to be processed before quotes. 

655 if do_backticks: 

656 text = educateBackticks(text, language) 

657 

658 if do_backticks == 2: 

659 text = educateSingleBackticks(text, language) 

660 

661 if do_quotes: 

662 # Replace plain quotes in context to prevent conversion to 

663 # 2-character sequence in French. 

664 context = prev_token_last_char.replace('"', ';').replace("'", ';') 

665 text = educateQuotes(context+text, language)[1:] 

666 

667 if do_stupefy: 

668 text = stupefyEntities(text, language) 

669 

670 # Remember last char as context for the next token 

671 prev_token_last_char = last_char 

672 

673 text = processEscapes(text, restore=True) 

674 

675 yield text 

676 

677 

678def educateQuotes(text, language='en'): 

679 """ 

680 Parameter: - text string (unicode or bytes). 

681 - language (`BCP 47` language tag.) 

682 Returns: The `text`, with "educated" curly quote characters. 

683 

684 Example input: "Isn't this fun?" 

685 Example output: “Isn’t this fun?“ 

686 """ 

687 smart = smartchars(language) 

688 

689 if not re.search('[-"\']', text): 

690 return text 

691 

692 # Special case if the very first character is a quote 

693 # followed by punctuation at a non-word-break. Use closing quotes. 

694 # TODO: example (when does this match?) 

695 text = regexes.START_SINGLE.sub(smart.csquote, text) 

696 text = regexes.START_DOUBLE.sub(smart.cpquote, text) 

697 

698 # Special case for adjacent quotes 

699 # like "'Quoted' words in a larger quote." 

700 text = regexes.ADJACENT_1.sub(smart.opquote+smart.osquote, text) 

701 text = regexes.ADJACENT_2.sub(smart.osquote+smart.opquote, text) 

702 

703 # Special case: "opening character" followed by quote, 

704 # optional punctuation and space like "[", '(', or '-'. 

705 text = regexes.OPEN_SINGLE.sub(r'\1%s'%smart.csquote, text) 

706 text = regexes.OPEN_DOUBLE.sub(r'\1%s'%smart.cpquote, text) 

707 

708 # Special case for decade abbreviations (the '80s): 

709 if language.startswith('en'): # TODO similar cases in other languages? 

710 text = regexes.DECADE.sub(smart.apostrophe, text) 

711 

712 # Get most opening secondary quotes: 

713 text = regexes.OPENING_SECONDARY.sub(r'\1'+smart.osquote, text) 

714 

715 # In many locales, secondary closing quotes are different from apostrophe: 

716 if smart.csquote != smart.apostrophe: 

717 text = regexes.APOSTROPHE.sub(smart.apostrophe, text) 

718 # TODO: keep track of quoting level to recognize apostrophe in, e.g., 

719 # "Ich fass' es nicht." 

720 

721 text = regexes.CLOSING_SECONDARY.sub(smart.csquote, text) 

722 

723 # Any remaining secondary quotes should be opening ones: 

724 text = text.replace(r"'", smart.osquote) 

725 

726 # Get most opening primary quotes: 

727 text = regexes.OPENING_PRIMARY.sub(r'\1'+smart.opquote, text) 

728 

729 # primary closing quotes: 

730 text = regexes.CLOSING_PRIMARY.sub(smart.cpquote, text) 

731 

732 # Any remaining quotes should be opening ones. 

733 text = text.replace(r'"', smart.opquote) 

734 

735 return text 

736 

737 

738def educateBackticks(text, language='en'): 

739 """ 

740 Parameter: String (unicode or bytes). 

741 Returns: The `text`, with ``backticks'' -style double quotes 

742 translated into HTML curly quote entities. 

743 Example input: ``Isn't this fun?'' 

744 Example output: “Isn't this fun?“ 

745 """ 

746 smart = smartchars(language) 

747 

748 text = text.replace(r'``', smart.opquote) 

749 text = text.replace(r"''", smart.cpquote) 

750 return text 

751 

752 

753def educateSingleBackticks(text, language='en'): 

754 """ 

755 Parameter: String (unicode or bytes). 

756 Returns: The `text`, with `backticks' -style single quotes 

757 translated into HTML curly quote entities. 

758 

759 Example input: `Isn't this fun?' 

760 Example output: ‘Isn’t this fun?’ 

761 """ 

762 smart = smartchars(language) 

763 

764 text = text.replace(r'`', smart.osquote) 

765 text = text.replace(r"'", smart.csquote) 

766 return text 

767 

768 

769def educateDashes(text): 

770 """ 

771 Parameter: String (unicode or bytes). 

772 Returns: The `text`, with each instance of "--" translated to 

773 an em-dash character. 

774 """ 

775 

776 text = text.replace(r'---', smartchars.endash) # en (yes, backwards) 

777 text = text.replace(r'--', smartchars.emdash) # em (yes, backwards) 

778 return text 

779 

780 

781def educateDashesOldSchool(text): 

782 """ 

783 Parameter: String (unicode or bytes). 

784 Returns: The `text`, with each instance of "--" translated to 

785 an en-dash character, and each "---" translated to 

786 an em-dash character. 

787 """ 

788 

789 text = text.replace(r'---', smartchars.emdash) 

790 text = text.replace(r'--', smartchars.endash) 

791 return text 

792 

793 

794def educateDashesOldSchoolInverted(text): 

795 """ 

796 Parameter: String (unicode or bytes). 

797 Returns: The `text`, with each instance of "--" translated to 

798 an em-dash character, and each "---" translated to 

799 an en-dash character. Two reasons why: First, unlike the 

800 en- and em-dash syntax supported by 

801 EducateDashesOldSchool(), it's compatible with existing 

802 entries written before SmartyPants 1.1, back when "--" was 

803 only used for em-dashes. Second, em-dashes are more 

804 common than en-dashes, and so it sort of makes sense that 

805 the shortcut should be shorter to type. (Thanks to Aaron 

806 Swartz for the idea.) 

807 """ 

808 text = text.replace(r'---', smartchars.endash) # em 

809 text = text.replace(r'--', smartchars.emdash) # en 

810 return text 

811 

812 

813def educateEllipses(text): 

814 """ 

815 Parameter: String (unicode or bytes). 

816 Returns: The `text`, with each instance of "..." translated to 

817 an ellipsis character. 

818 

819 Example input: Huh...? 

820 Example output: Huh…? 

821 """ 

822 

823 text = text.replace(r'...', smartchars.ellipsis) 

824 text = text.replace(r'. . .', smartchars.ellipsis) 

825 return text 

826 

827 

828def stupefyEntities(text, language='en'): 

829 """ 

830 Parameter: String (unicode or bytes). 

831 Returns: The `text`, with each SmartyPants character translated to 

832 its ASCII counterpart. 

833 

834 Example input: “Hello — world.” 

835 Example output: "Hello -- world." 

836 """ 

837 smart = smartchars(language) 

838 

839 text = text.replace(smart.endash, "-") 

840 text = text.replace(smart.emdash, "--") 

841 text = text.replace(smart.osquote, "'") # open secondary quote 

842 text = text.replace(smart.csquote, "'") # close secondary quote 

843 text = text.replace(smart.opquote, '"') # open primary quote 

844 text = text.replace(smart.cpquote, '"') # close primary quote 

845 text = text.replace(smart.ellipsis, '...') 

846 

847 return text 

848 

849 

850def processEscapes(text, restore=False): 

851 r""" 

852 Parameter: String (unicode or bytes). 

853 Returns: The `text`, with after processing the following backslash 

854 escape sequences. This is useful if you want to force a "dumb" 

855 quote or other character to appear. 

856 

857 Escape Value 

858 ------ ----- 

859 \\ &#92; 

860 \" &#34; 

861 \' &#39; 

862 \. &#46; 

863 \- &#45; 

864 \` &#96; 

865 """ 

866 replacements = ((r'\\', r'&#92;'), 

867 (r'\"', r'&#34;'), 

868 (r"\'", r'&#39;'), 

869 (r'\.', r'&#46;'), 

870 (r'\-', r'&#45;'), 

871 (r'\`', r'&#96;')) 

872 if restore: 

873 for (ch, rep) in replacements: 

874 text = text.replace(rep, ch[1]) 

875 else: 

876 for (ch, rep) in replacements: 

877 text = text.replace(ch, rep) 

878 

879 return text 

880 

881 

882def tokenize(text): 

883 """ 

884 Parameter: String containing HTML markup. 

885 Returns: An iterator that yields the tokens comprising the input 

886 string. Each token is either a tag (possibly with nested, 

887 tags contained therein, such as <a href="<MTFoo>">, or a 

888 run of text between tags. Each yielded element is a 

889 two-element tuple; the first is either 'tag' or 'text'; 

890 the second is the actual value. 

891 

892 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin. 

893 """ 

894 tag_soup = re.compile(r'([^<]*)(<[^>]*>)') 

895 token_match = tag_soup.search(text) 

896 previous_end = 0 

897 

898 while token_match is not None: 

899 if token_match.group(1): 

900 yield 'text', token_match.group(1) 

901 yield 'tag', token_match.group(2) 

902 previous_end = token_match.end() 

903 token_match = tag_soup.search(text, token_match.end()) 

904 

905 if previous_end < len(text): 

906 yield 'text', text[previous_end:] 

907 

908 

909if __name__ == "__main__": 

910 

911 import itertools 

912 import locale 

913 try: 

914 locale.setlocale(locale.LC_ALL, '') # set to user defaults 

915 defaultlanguage = locale.getlocale()[0] 

916 except: # NoQA: E722 (catchall) 

917 defaultlanguage = 'en' 

918 

919 # Normalize and drop unsupported subtags: 

920 defaultlanguage = defaultlanguage.lower().replace('-', '_') 

921 # split (except singletons, which mark the following tag as non-standard): 

922 defaultlanguage = re.sub(r'_([a-zA-Z0-9])_', r'_\1-', defaultlanguage) 

923 _subtags = list(defaultlanguage.split('_')) 

924 _basetag = _subtags.pop(0) 

925 # find all combinations of subtags 

926 for n in range(len(_subtags), 0, -1): 

927 for tags in itertools.combinations(_subtags, n): 

928 _tag = '-'.join((_basetag, *tags)) 

929 if _tag in smartchars.quotes: 

930 defaultlanguage = _tag 

931 break 

932 else: 

933 if _basetag in smartchars.quotes: 

934 defaultlanguage = _basetag 

935 else: 

936 defaultlanguage = 'en' 

937 

938 import argparse 

939 parser = argparse.ArgumentParser( 

940 description='Filter <input> making ASCII punctuation "smart".') 

941 # TODO: require input arg or other means to print USAGE instead of waiting. 

942 # parser.add_argument("input", help="Input stream, use '-' for stdin.") 

943 parser.add_argument("-a", "--action", default="1", 

944 help="what to do with the input (see --actionhelp)") 

945 parser.add_argument("-e", "--encoding", default="utf-8", 

946 help="text encoding") 

947 parser.add_argument("-l", "--language", default=defaultlanguage, 

948 help="text language (BCP47 tag), " 

949 f"Default: {defaultlanguage}") 

950 parser.add_argument("-q", "--alternative-quotes", action="store_true", 

951 help="use alternative quote style") 

952 parser.add_argument("--doc", action="store_true", 

953 help="print documentation") 

954 parser.add_argument("--actionhelp", action="store_true", 

955 help="list available actions") 

956 parser.add_argument("--stylehelp", action="store_true", 

957 help="list available quote styles") 

958 parser.add_argument("--test", action="store_true", 

959 help="perform short self-test") 

960 args = parser.parse_args() 

961 

962 if args.doc: 

963 print(__doc__) 

964 elif args.actionhelp: 

965 print(options) 

966 elif args.stylehelp: 

967 print() 

968 print("Available styles (primary open/close, secondary open/close)") 

969 print("language tag quotes") 

970 print("============ ======") 

971 for key in sorted(smartchars.quotes.keys()): 

972 print("%-14s %s" % (key, smartchars.quotes[key])) 

973 elif args.test: 

974 # Unit test output goes to stderr. 

975 import unittest 

976 

977 class TestSmartypantsAllAttributes(unittest.TestCase): 

978 # the default attribute is "1", which means "all". 

979 def test_dates(self) -> None: 

980 self.assertEqual(smartyPants("1440-80's"), "1440-80’s") 

981 self.assertEqual(smartyPants("1440-'80s"), "1440-’80s") 

982 self.assertEqual(smartyPants("1440---'80s"), "1440–’80s") 

983 self.assertEqual(smartyPants("1960's"), "1960’s") 

984 self.assertEqual(smartyPants("one two '60s"), "one two ’60s") 

985 self.assertEqual(smartyPants("'60s"), "’60s") 

986 

987 def test_educated_quotes(self) -> None: 

988 self.assertEqual(smartyPants('"Isn\'t this fun?"'), 

989 '“Isn’t this fun?”') 

990 

991 def test_html_tags(self) -> None: 

992 text = '<a src="foo">more</a>' 

993 self.assertEqual(smartyPants(text), text) 

994 

995 suite = unittest.TestLoader().loadTestsFromTestCase( 

996 TestSmartypantsAllAttributes) 

997 unittest.TextTestRunner().run(suite) 

998 

999 else: 

1000 if args.alternative_quotes: 

1001 if '-x-altquot' in args.language: 

1002 args.language = args.language.replace('-x-altquot', '') 

1003 else: 

1004 args.language += '-x-altquot' 

1005 text = sys.stdin.read() 

1006 print(smartyPants(text, attr=args.action, language=args.language))