Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/docutils/utils/smartquotes.py: 46%

231 statements  

« prev     ^ index     » next       coverage.py v7.2.7, created at 2023-06-07 06:06 +0000

1#!/usr/bin/python3 

2# :Id: $Id$ 

3# :Copyright: © 2010 Günter Milde, 

4# original `SmartyPants`_: © 2003 John Gruber 

5# smartypants.py: © 2004, 2007 Chad Miller 

6# :Maintainer: docutils-develop@lists.sourceforge.net 

7# :License: Released under the terms of the `2-Clause BSD license`_, in short: 

8# 

9# Copying and distribution of this file, with or without modification, 

10# are permitted in any medium without royalty provided the copyright 

11# notices and this notice are preserved. 

12# This file is offered as-is, without any warranty. 

13# 

14# .. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause 

15 

16 

17r""" 

18========================= 

19Smart Quotes for Docutils 

20========================= 

21 

22Synopsis 

23======== 

24 

25"SmartyPants" is a free web publishing plug-in for Movable Type, Blosxom, and 

26BBEdit that easily translates plain ASCII punctuation characters into "smart" 

27typographic punctuation characters. 

28 

29``smartquotes.py`` is an adaption of "SmartyPants" to Docutils_. 

30 

31* Using Unicode instead of HTML entities for typographic punctuation 

32 characters, it works for any output format that supports Unicode. 

33* Supports `language specific quote characters`__. 

34 

35__ https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks 

36 

37 

38Authors 

39======= 

40 

41`John Gruber`_ did all of the hard work of writing this software in Perl for 

42`Movable Type`_ and almost all of this useful documentation. `Chad Miller`_ 

43ported it to Python to use with Pyblosxom_. 

44Adapted to Docutils_ by Günter Milde. 

45 

46Additional Credits 

47================== 

48 

49Portions of the SmartyPants original work are based on Brad Choate's nifty 

50MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to 

51this plug-in. Brad Choate is a fine hacker indeed. 

52 

53`Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta 

54testing of the original SmartyPants. 

55 

56`Rael Dornfest`_ ported SmartyPants to Blosxom. 

57 

58.. _Brad Choate: http://bradchoate.com/ 

59.. _Jeremy Hedley: http://antipixel.com/ 

60.. _Charles Wiltgen: http://playbacktime.com/ 

61.. _Rael Dornfest: http://raelity.org/ 

62 

63 

64Copyright and License 

65===================== 

66 

67SmartyPants_ license (3-Clause BSD license): 

68 

69 Copyright (c) 2003 John Gruber (http://daringfireball.net/) 

70 All rights reserved. 

71 

72 Redistribution and use in source and binary forms, with or without 

73 modification, are permitted provided that the following conditions are 

74 met: 

75 

76 * Redistributions of source code must retain the above copyright 

77 notice, this list of conditions and the following disclaimer. 

78 

79 * Redistributions in binary form must reproduce the above copyright 

80 notice, this list of conditions and the following disclaimer in 

81 the documentation and/or other materials provided with the 

82 distribution. 

83 

84 * Neither the name "SmartyPants" nor the names of its contributors 

85 may be used to endorse or promote products derived from this 

86 software without specific prior written permission. 

87 

88 This software is provided by the copyright holders and contributors 

89 "as is" and any express or implied warranties, including, but not 

90 limited to, the implied warranties of merchantability and fitness for 

91 a particular purpose are disclaimed. In no event shall the copyright 

92 owner or contributors be liable for any direct, indirect, incidental, 

93 special, exemplary, or consequential damages (including, but not 

94 limited to, procurement of substitute goods or services; loss of use, 

95 data, or profits; or business interruption) however caused and on any 

96 theory of liability, whether in contract, strict liability, or tort 

97 (including negligence or otherwise) arising in any way out of the use 

98 of this software, even if advised of the possibility of such damage. 

99 

100smartypants.py license (2-Clause BSD license): 

101 

102 smartypants.py is a derivative work of SmartyPants. 

103 

104 Redistribution and use in source and binary forms, with or without 

105 modification, are permitted provided that the following conditions are 

106 met: 

107 

108 * Redistributions of source code must retain the above copyright 

109 notice, this list of conditions and the following disclaimer. 

110 

111 * Redistributions in binary form must reproduce the above copyright 

112 notice, this list of conditions and the following disclaimer in 

113 the documentation and/or other materials provided with the 

114 distribution. 

115 

116 This software is provided by the copyright holders and contributors 

117 "as is" and any express or implied warranties, including, but not 

118 limited to, the implied warranties of merchantability and fitness for 

119 a particular purpose are disclaimed. In no event shall the copyright 

120 owner or contributors be liable for any direct, indirect, incidental, 

121 special, exemplary, or consequential damages (including, but not 

122 limited to, procurement of substitute goods or services; loss of use, 

123 data, or profits; or business interruption) however caused and on any 

124 theory of liability, whether in contract, strict liability, or tort 

125 (including negligence or otherwise) arising in any way out of the use 

126 of this software, even if advised of the possibility of such damage. 

127 

128.. _John Gruber: http://daringfireball.net/ 

129.. _Chad Miller: http://web.chad.org/ 

130 

131.. _Pyblosxom: http://pyblosxom.bluesock.org/ 

132.. _SmartyPants: http://daringfireball.net/projects/smartypants/ 

133.. _Movable Type: http://www.movabletype.org/ 

134.. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause 

135.. _Docutils: https://docutils.sourceforge.io/ 

136 

137Description 

138=========== 

139 

140SmartyPants can perform the following transformations: 

141 

142- Straight quotes ( " and ' ) into "curly" quote characters 

143- Backticks-style quotes (\`\`like this'') into "curly" quote characters 

144- Dashes (``--`` and ``---``) into en- and em-dash entities 

145- Three consecutive dots (``...`` or ``. . .``) into an ellipsis entity 

146 

147This means you can write, edit, and save your posts using plain old 

148ASCII straight quotes, plain dashes, and plain dots, but your published 

149posts (and final HTML output) will appear with smart quotes, em-dashes, 

150and proper ellipses. 

151 

152SmartyPants does not modify characters within ``<pre>``, ``<code>``, ``<kbd>``, 

153``<math>`` or ``<script>`` tag blocks. Typically, these tags are used to 

154display text where smart quotes and other "smart punctuation" would not be 

155appropriate, such as source code or example markup. 

156 

157 

158Backslash Escapes 

159================= 

160 

161If you need to use literal straight quotes (or plain hyphens and periods), 

162`smartquotes` accepts the following backslash escape sequences to force 

163ASCII-punctuation. Mind, that you need two backslashes as Docutils expands it, 

164too. 

165 

166======== ========= 

167Escape Character 

168======== ========= 

169``\\`` \\ 

170``\\"`` \\" 

171``\\'`` \\' 

172``\\.`` \\. 

173``\\-`` \\- 

174``\\``` \\` 

175======== ========= 

176 

177This is useful, for example, when you want to use straight quotes as 

178foot and inch marks: 6\\'2\\" tall; a 17\\" iMac. 

179 

180 

181Caveats 

182======= 

183 

184Why You Might Not Want to Use Smart Quotes in Your Weblog 

185--------------------------------------------------------- 

186 

187For one thing, you might not care. 

188 

189Most normal, mentally stable individuals do not take notice of proper 

190typographic punctuation. Many design and typography nerds, however, break 

191out in a nasty rash when they encounter, say, a restaurant sign that uses 

192a straight apostrophe to spell "Joe's". 

193 

194If you're the sort of person who just doesn't care, you might well want to 

195continue not caring. Using straight quotes -- and sticking to the 7-bit 

196ASCII character set in general -- is certainly a simpler way to live. 

197 

198Even if you *do* care about accurate typography, you still might want to 

199think twice before educating the quote characters in your weblog. One side 

200effect of publishing curly quote characters is that it makes your 

201weblog a bit harder for others to quote from using copy-and-paste. What 

202happens is that when someone copies text from your blog, the copied text 

203contains the 8-bit curly quote characters (as well as the 8-bit characters 

204for em-dashes and ellipses, if you use these options). These characters 

205are not standard across different text encoding methods, which is why they 

206need to be encoded as characters. 

207 

208People copying text from your weblog, however, may not notice that you're 

209using curly quotes, and they'll go ahead and paste the unencoded 8-bit 

210characters copied from their browser into an email message or their own 

211weblog. When pasted as raw "smart quotes", these characters are likely to 

212get mangled beyond recognition. 

213 

214That said, my own opinion is that any decent text editor or email client 

215makes it easy to stupefy smart quote characters into their 7-bit 

216equivalents, and I don't consider it my problem if you're using an 

217indecent text editor or email client. 

218 

219 

220Algorithmic Shortcomings 

221------------------------ 

222 

223One situation in which quotes will get curled the wrong way is when 

224apostrophes are used at the start of leading contractions. For example:: 

225 

226 'Twas the night before Christmas. 

227 

228In the case above, SmartyPants will turn the apostrophe into an opening 

229secondary quote, when in fact it should be the `RIGHT SINGLE QUOTATION MARK` 

230character which is also "the preferred character to use for apostrophe" 

231(Unicode). I don't think this problem can be solved in the general case -- 

232every word processor I've tried gets this wrong as well. In such cases, it's 

233best to inset the `RIGHT SINGLE QUOTATION MARK` (’) by hand. 

234 

235In English, the same character is used for apostrophe and closing secondary 

236quote (both plain and "smart" ones). For other locales (French, Italean, 

237Swiss, ...) "smart" secondary closing quotes differ from the curly apostrophe. 

238 

239 .. class:: language-fr 

240 

241 Il dit : "C'est 'super' !" 

242 

243If the apostrophe is used at the end of a word, it cannot be distinguished 

244from a secondary quote by the algorithm. Therefore, a text like:: 

245 

246 .. class:: language-de-CH 

247 

248 "Er sagt: 'Ich fass' es nicht.'" 

249 

250will get a single closing guillemet instead of an apostrophe. 

251 

252This can be prevented by use use of the `RIGHT SINGLE QUOTATION MARK` in 

253the source:: 

254 

255 - "Er sagt: 'Ich fass' es nicht.'" 

256 + "Er sagt: 'Ich fass’ es nicht.'" 

257 

258 

259Version History 

260=============== 

261 

2621.9 2022-03-04 

263 - Code cleanup. Require Python 3. 

264 

2651.8.1 2017-10-25 

266 - Use open quote after Unicode whitespace, ZWSP, and ZWNJ. 

267 - Code cleanup. 

268 

2691.8: 2017-04-24 

270 - Command line front-end. 

271 

2721.7.1: 2017-03-19 

273 - Update and extend language-dependent quotes. 

274 - Differentiate apostrophe from single quote. 

275 

2761.7: 2012-11-19 

277 - Internationalization: language-dependent quotes. 

278 

2791.6.1: 2012-11-06 

280 - Refactor code, code cleanup, 

281 - `educate_tokens()` generator as interface for Docutils. 

282 

2831.6: 2010-08-26 

284 - Adaption to Docutils: 

285 - Use Unicode instead of HTML entities, 

286 - Remove code special to pyblosxom. 

287 

2881.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400 

289 - Fixed bug where blocks of precious unalterable text was instead 

290 interpreted. Thanks to Le Roux and Dirk van Oosterbosch. 

291 

2921.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400 

293 - Fix bogus magical quotation when there is no hint that the 

294 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen. 

295 - Be smarter about quotes before terminating numbers in an en-dash'ed 

296 range. 

297 

2981.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500 

299 - Fix a date-processing bug, as reported by jacob childress. 

300 - Begin a test-suite for ensuring correct output. 

301 - Removed import of "string", since I didn't really need it. 

302 (This was my first every Python program. Sue me!) 

303 

3041.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400 

305 - Abort processing if the flavour is in forbidden-list. Default of 

306 [ "rss" ] (Idea of Wolfgang SCHNERRING.) 

307 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING. 

308 

3091.5_1.2: Mon, 24 May 2004 08:14:54 -0400 

310 - Some single quotes weren't replaced properly. Diff-tesuji played 

311 by Benjamin GEIGER. 

312 

3131.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500 

314 - Support upcoming pyblosxom 0.9 plugin verification feature. 

315 

3161.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500 

317 - Initial release 

318""" 

319 

320import re 

321import sys 

322 

323 

324options = r""" 

325Options 

326======= 

327 

328Numeric values are the easiest way to configure SmartyPants' behavior: 

329 

330:0: Suppress all transformations. (Do nothing.) 

331 

332:1: Performs default SmartyPants transformations: quotes (including 

333 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash) 

334 is used to signify an em-dash; there is no support for en-dashes 

335 

336:2: Same as smarty_pants="1", except that it uses the old-school typewriter 

337 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``" 

338 (dash dash dash) 

339 for em-dashes. 

340 

341:3: Same as smarty_pants="2", but inverts the shorthand for dashes: 

342 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for 

343 en-dashes. 

344 

345:-1: Stupefy mode. Reverses the SmartyPants transformation process, turning 

346 the characters produced by SmartyPants into their ASCII equivalents. 

347 E.g. the LEFT DOUBLE QUOTATION MARK (“) is turned into a simple 

348 double-quote (\"), "—" is turned into two dashes, etc. 

349 

350 

351The following single-character attribute values can be combined to toggle 

352individual transformations from within the smarty_pants attribute. For 

353example, ``"1"`` is equivalent to ``"qBde"``. 

354 

355:q: Educates normal quote characters: (") and ('). 

356 

357:b: Educates \`\`backticks'' -style double quotes. 

358 

359:B: Educates \`\`backticks'' -style double quotes and \`single' quotes. 

360 

361:d: Educates em-dashes. 

362 

363:D: Educates em-dashes and en-dashes, using old-school typewriter 

364 shorthand: (dash dash) for en-dashes, (dash dash dash) for em-dashes. 

365 

366:i: Educates em-dashes and en-dashes, using inverted old-school typewriter 

367 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes. 

368 

369:e: Educates ellipses. 

370 

371:w: Translates any instance of ``&quot;`` into a normal double-quote 

372 character. This should be of no interest to most people, but 

373 of particular interest to anyone who writes their posts using 

374 Dreamweaver, as Dreamweaver inexplicably uses this entity to represent 

375 a literal double-quote character. SmartyPants only educates normal 

376 quotes, not entities (because ordinarily, entities are used for 

377 the explicit purpose of representing the specific character they 

378 represent). The "w" option must be used in conjunction with one (or 

379 both) of the other quote options ("q" or "b"). Thus, if you wish to 

380 apply all SmartyPants transformations (quotes, en- and em-dashes, and 

381 ellipses) and also translate ``&quot;`` entities into regular quotes 

382 so SmartyPants can educate them, you should pass the following to the 

383 smarty_pants attribute: 

384""" 

385 

386 

387class smartchars: 

388 """Smart quotes and dashes""" 

389 

390 endash = '–' # "&#8211;" EN DASH 

391 emdash = '—' # "&#8212;" EM DASH 

392 ellipsis = '…' # "&#8230;" HORIZONTAL ELLIPSIS 

393 apostrophe = '’' # "&#8217;" RIGHT SINGLE QUOTATION MARK 

394 

395 # quote characters (language-specific, set in __init__()) 

396 # https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks 

397 # https://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen 

398 # https://fr.wikipedia.org/wiki/Guillemet 

399 # https://typographisme.net/post/Les-espaces-typographiques-et-le-web 

400 # https://www.btb.termiumplus.gc.ca/tpv2guides/guides/redac/index-fra.html 

401 # https://en.wikipedia.org/wiki/Hebrew_punctuation#Quotation_marks 

402 # [7] https://www.tustep.uni-tuebingen.de/bi/bi00/bi001t1-anfuehrung.pdf 

403 # [8] https://www.korrekturavdelingen.no/anforselstegn.htm 

404 # [9] Typografisk håndbok. Oslo: Spartacus. 2000. s. 67. ISBN 8243001530. 

405 # [10] https://www.typografi.org/sitat/sitatart.html 

406 # [11] https://mk.wikipedia.org/wiki/Правопис_и_правоговор_на_македонскиот_јазик # noqa:E501 

407 # [12] https://hrvatska-tipografija.com/polunavodnici/ 

408 # [13] https://pl.wikipedia.org/wiki/Cudzys%C5%82%C3%B3w 

409 # 

410 # See also configuration option "smartquote-locales". 

411 quotes = { 

412 'af': '“”‘’', 

413 'af-x-altquot': '„”‚’', 

414 'bg': '„“‚‘', # https://bg.wikipedia.org/wiki/Кавички 

415 'ca': '«»“”', 

416 'ca-x-altquot': '“”‘’', 

417 'cs': '„“‚‘', 

418 'cs-x-altquot': '»«›‹', 

419 'da': '»«›‹', 

420 'da-x-altquot': '„“‚‘', 

421 # 'da-x-altquot2': '””’’', 

422 'de': '„“‚‘', 

423 'de-x-altquot': '»«›‹', 

424 'de-ch': '«»‹›', 

425 'el': '«»“”', # '«»‟”' https://hal.science/hal-02101618 

426 'en': '“”‘’', 

427 'en-uk-x-altquot': '‘’“”', # Attention: " → ‘ and ' → “ ! 

428 'eo': '“”‘’', 

429 'es': '«»“”', 

430 'es-x-altquot': '“”‘’', 

431 'et': '„“‚‘', # no secondary quote listed in 

432 'et-x-altquot': '«»‹›', # the sources above (wikipedia.org) 

433 'eu': '«»‹›', 

434 'fi': '””’’', 

435 'fi-x-altquot': '»»››', 

436 'fr': ('« ', ' »', '“', '”'), # full no-break space 

437 'fr-x-altquot': ('« ', ' »', '“', '”'), # narrow no-break space 

438 'fr-ch': '«»‹›', # https://typoguide.ch/ 

439 'fr-ch-x-altquot': ('« ', ' »', '‹ ', ' ›'), # narrow no-break space # noqa:E501 

440 'gl': '«»“”', 

441 'he': '”“»«', # Hebrew is RTL, test position: 

442 'he-x-altquot': '„”‚’', # low quotation marks are opening. 

443 # 'he-x-altquot': '“„‘‚', # RTL: low quotation marks opening 

444 'hr': '„”‘’', # Croatian [12] 

445 'hr-x-altquot': '»«›‹', 

446 'hsb': '„“‚‘', 

447 'hsb-x-altquot': '»«›‹', 

448 'hu': '„”«»', 

449 'is': '„“‚‘', 

450 'it': '«»“”', 

451 'it-ch': '«»‹›', 

452 'it-x-altquot': '“”‘’', 

453 # 'it-x-altquot2': '“„‘‚', # [7] in headlines 

454 'ja': '「」『』', 

455 'ko': '“”‘’', 

456 'lt': '„“‚‘', 

457 'lv': '„“‚‘', 

458 'mk': '„“‚‘', # Macedonian [11] 

459 'nl': '“”‘’', 

460 'nl-x-altquot': '„”‚’', 

461 # 'nl-x-altquot2': '””’’', 

462 'nb': '«»’’', # Norsk bokmål (canonical form 'no') 

463 'nn': '«»’’', # Nynorsk [10] 

464 'nn-x-altquot': '«»‘’', # [8], [10] 

465 # 'nn-x-altquot2': '«»«»', # [9], [10] 

466 # 'nn-x-altquot3': '„“‚‘', # [10] 

467 'no': '«»’’', # Norsk bokmål [10] 

468 'no-x-altquot': '«»‘’', # [8], [10] 

469 # 'no-x-altquot2': '«»«»', # [9], [10 

470 # 'no-x-altquot3': '„“‚‘', # [10] 

471 'pl': '„”«»', 

472 'pl-x-altquot': '«»‚’', 

473 # 'pl-x-altquot2': '„”‚’', # [13] 

474 'pt': '«»“”', 

475 'pt-br': '“”‘’', 

476 'ro': '„”«»', 

477 'ru': '«»„“', 

478 'sh': '„”‚’', # Serbo-Croatian 

479 'sh-x-altquot': '»«›‹', 

480 'sk': '„“‚‘', # Slovak 

481 'sk-x-altquot': '»«›‹', 

482 'sl': '„“‚‘', # Slovenian 

483 'sl-x-altquot': '»«›‹', 

484 'sq': '«»‹›', # Albanian 

485 'sq-x-altquot': '“„‘‚', 

486 'sr': '„”’’', 

487 'sr-x-altquot': '»«›‹', 

488 'sv': '””’’', 

489 'sv-x-altquot': '»»››', 

490 'tr': '“”‘’', 

491 'tr-x-altquot': '«»‹›', 

492 # 'tr-x-altquot2': '“„‘‚', # [7] antiquated? 

493 'uk': '«»„“', 

494 'uk-x-altquot': '„“‚‘', 

495 'zh-cn': '“”‘’', 

496 'zh-tw': '「」『』', 

497 } 

498 

499 def __init__(self, language='en'): 

500 self.language = language 

501 try: 

502 (self.opquote, self.cpquote, 

503 self.osquote, self.csquote) = self.quotes[language.lower()] 

504 except KeyError: 

505 self.opquote, self.cpquote, self.osquote, self.csquote = '""\'\'' 

506 

507 

508default_smartypants_attr = '1' 

509 

510 

511def smartyPants(text, attr=default_smartypants_attr, language='en'): 

512 """Main function for "traditional" use.""" 

513 

514 return "".join(t for t in educate_tokens(tokenize(text), attr, language)) 

515 

516 

517def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'): 

518 """Return iterator that "educates" the items of `text_tokens`.""" 

519 # Parse attributes: 

520 # 0 : do nothing 

521 # 1 : set all 

522 # 2 : set all, using old school en- and em- dash shortcuts 

523 # 3 : set all, using inverted old school en and em- dash shortcuts 

524 # 

525 # q : quotes 

526 # b : backtick quotes (``double'' only) 

527 # B : backtick quotes (``double'' and `single') 

528 # d : dashes 

529 # D : old school dashes 

530 # i : inverted old school dashes 

531 # e : ellipses 

532 # w : convert &quot; entities to " for Dreamweaver users 

533 

534 convert_quot = False # translate &quot; entities into normal quotes? 

535 do_dashes = False 

536 do_backticks = False 

537 do_quotes = False 

538 do_ellipses = False 

539 do_stupefy = False 

540 

541 # if attr == "0": # pass tokens unchanged (see below). 

542 if attr == '1': # Do everything, turn all options on. 

543 do_quotes = True 

544 do_backticks = True 

545 do_dashes = 1 

546 do_ellipses = True 

547 elif attr == '2': 

548 # Do everything, turn all options on, use old school dash shorthand. 

549 do_quotes = True 

550 do_backticks = True 

551 do_dashes = 2 

552 do_ellipses = True 

553 elif attr == '3': 

554 # Do everything, use inverted old school dash shorthand. 

555 do_quotes = True 

556 do_backticks = True 

557 do_dashes = 3 

558 do_ellipses = True 

559 elif attr == '-1': # Special "stupefy" mode. 

560 do_stupefy = True 

561 else: 

562 if 'q' in attr: do_quotes = True # noqa: E701 

563 if 'b' in attr: do_backticks = True # noqa: E701 

564 if 'B' in attr: do_backticks = 2 # noqa: E701 

565 if 'd' in attr: do_dashes = 1 # noqa: E701 

566 if 'D' in attr: do_dashes = 2 # noqa: E701 

567 if 'i' in attr: do_dashes = 3 # noqa: E701 

568 if 'e' in attr: do_ellipses = True # noqa: E701 

569 if 'w' in attr: convert_quot = True # noqa: E701 

570 

571 prev_token_last_char = ' ' 

572 # Last character of the previous text token. Used as 

573 # context to curl leading quote characters correctly. 

574 

575 for (ttype, text) in text_tokens: 

576 

577 # skip HTML and/or XML tags as well as empty text tokens 

578 # without updating the last character 

579 if ttype == 'tag' or not text: 

580 yield text 

581 continue 

582 

583 # skip literal text (math, literal, raw, ...) 

584 if ttype == 'literal': 

585 prev_token_last_char = text[-1:] 

586 yield text 

587 continue 

588 

589 last_char = text[-1:] # Remember last char before processing. 

590 

591 text = processEscapes(text) 

592 

593 if convert_quot: 

594 text = text.replace('&quot;', '"') 

595 

596 if do_dashes == 1: 

597 text = educateDashes(text) 

598 elif do_dashes == 2: 

599 text = educateDashesOldSchool(text) 

600 elif do_dashes == 3: 

601 text = educateDashesOldSchoolInverted(text) 

602 

603 if do_ellipses: 

604 text = educateEllipses(text) 

605 

606 # Note: backticks need to be processed before quotes. 

607 if do_backticks: 

608 text = educateBackticks(text, language) 

609 

610 if do_backticks == 2: 

611 text = educateSingleBackticks(text, language) 

612 

613 if do_quotes: 

614 # Replace plain quotes in context to prevent conversion to 

615 # 2-character sequence in French. 

616 context = prev_token_last_char.replace('"', ';').replace("'", ';') 

617 text = educateQuotes(context+text, language)[1:] 

618 

619 if do_stupefy: 

620 text = stupefyEntities(text, language) 

621 

622 # Remember last char as context for the next token 

623 prev_token_last_char = last_char 

624 

625 text = processEscapes(text, restore=True) 

626 

627 yield text 

628 

629 

630def educateQuotes(text, language='en'): 

631 """ 

632 Parameter: - text string (unicode or bytes). 

633 - language (`BCP 47` language tag.) 

634 Returns: The `text`, with "educated" curly quote characters. 

635 

636 Example input: "Isn't this fun?" 

637 Example output: “Isn’t this fun?“; 

638 """ 

639 

640 smart = smartchars(language) 

641 ch_classes = {'open': '[([{]', # opening braces 

642 'close': r'[^\s]', # everything except whitespace 

643 'punct': r"""[-!" #\$\%'()*+,.\/:;<=>?\@\[\\\]\^_`{|}~]""", 

644 'dash': '[-–—]' # hyphen and em/en dashes 

645 r'|&[mn]dash;|&\#8211;|&\#8212;|&\#x201[34];', 

646 'sep': '[\\s\u200B\u200C]|&nbsp;', # Whitespace, ZWSP, ZWNJ 

647 } 

648 

649 # Special case if the very first character is a quote 

650 # followed by punctuation at a non-word-break. Use closing quotes. 

651 # TODO: example (when does this match?) 

652 text = re.sub(r"^'(?=%s\\B)" % ch_classes['punct'], smart.csquote, text) 

653 text = re.sub(r'^"(?=%s\\B)' % ch_classes['punct'], smart.cpquote, text) 

654 

655 # Special case for adjacent quotes 

656 # like "'Quoted' words in a larger quote." 

657 text = re.sub('"\'(?=\\w)', smart.opquote+smart.osquote, text) 

658 text = re.sub('\'"(?=\\w)', smart.osquote+smart.opquote, text) 

659 

660 # Special case: "opening character" followed by quote, 

661 # optional punctuation and space like "[", '(', or '-'. 

662 text = re.sub(r"(%(open)s|%(dash)s)'(?=%(punct)s? )" % ch_classes, 

663 r'\1%s'%smart.csquote, text) 

664 text = re.sub(r'(%(open)s|%(dash)s)"(?=%(punct)s? )' % ch_classes, 

665 r'\1%s'%smart.cpquote, text) 

666 

667 # Special case for decade abbreviations (the '80s): 

668 if language.startswith('en'): # TODO similar cases in other languages? 

669 text = re.sub(r"'(?=\d{2}s)", smart.apostrophe, text) 

670 

671 # Get most opening secondary quotes: 

672 opening_secondary_quotes_regex = re.compile(""" 

673 (# ?<= # look behind fails: requires fixed-width pattern 

674 %(sep)s | # a whitespace char, or 

675 %(open)s | # opening brace, or 

676 %(dash)s # em/en-dash 

677 ) 

678 ' # the quote 

679 (?=\\w|%(punct)s) # word character or punctuation 

680 """ % ch_classes, re.VERBOSE) 

681 

682 text = opening_secondary_quotes_regex.sub(r'\1'+smart.osquote, text) 

683 

684 # In many locales, secondary closing quotes are different from apostrophe: 

685 if smart.csquote != smart.apostrophe: 

686 apostrophe_regex = re.compile(r"(?<=(\w|\d))'(?=\w)") 

687 text = apostrophe_regex.sub(smart.apostrophe, text) 

688 # TODO: keep track of quoting level to recognize apostrophe in, e.g., 

689 # "Ich fass' es nicht." 

690 

691 closing_secondary_quotes_regex = re.compile(r"(?<!\s)'") 

692 text = closing_secondary_quotes_regex.sub(smart.csquote, text) 

693 

694 # Any remaining secondary quotes should be opening ones: 

695 text = text.replace(r"'", smart.osquote) 

696 

697 # Get most opening primary quotes: 

698 opening_primary_quotes_regex = re.compile(""" 

699 ( 

700 %(sep)s | # a whitespace char, or 

701 %(open)s | # zero width separating char, or 

702 %(dash)s # em/en-dash 

703 ) 

704 " # the quote, followed by 

705 (?=\\w|%(punct)s) # a word character or punctuation 

706 """ % ch_classes, re.VERBOSE) 

707 

708 text = opening_primary_quotes_regex.sub(r'\1'+smart.opquote, text) 

709 

710 # primary closing quotes: 

711 closing_primary_quotes_regex = re.compile(r""" 

712 ( 

713 (?<!\s)" | # no whitespace before 

714 "(?=\s) # whitespace behind 

715 ) 

716 """, re.VERBOSE) 

717 text = closing_primary_quotes_regex.sub(smart.cpquote, text) 

718 

719 # Any remaining quotes should be opening ones. 

720 text = text.replace(r'"', smart.opquote) 

721 

722 return text 

723 

724 

725def educateBackticks(text, language='en'): 

726 """ 

727 Parameter: String (unicode or bytes). 

728 Returns: The `text`, with ``backticks'' -style double quotes 

729 translated into HTML curly quote entities. 

730 Example input: ``Isn't this fun?'' 

731 Example output: “Isn't this fun?“; 

732 """ 

733 smart = smartchars(language) 

734 

735 text = text.replace(r'``', smart.opquote) 

736 text = text.replace(r"''", smart.cpquote) 

737 return text 

738 

739 

740def educateSingleBackticks(text, language='en'): 

741 """ 

742 Parameter: String (unicode or bytes). 

743 Returns: The `text`, with `backticks' -style single quotes 

744 translated into HTML curly quote entities. 

745 

746 Example input: `Isn't this fun?' 

747 Example output: ‘Isn’t this fun?’ 

748 """ 

749 smart = smartchars(language) 

750 

751 text = text.replace(r'`', smart.osquote) 

752 text = text.replace(r"'", smart.csquote) 

753 return text 

754 

755 

756def educateDashes(text): 

757 """ 

758 Parameter: String (unicode or bytes). 

759 Returns: The `text`, with each instance of "--" translated to 

760 an em-dash character. 

761 """ 

762 

763 text = text.replace(r'---', smartchars.endash) # en (yes, backwards) 

764 text = text.replace(r'--', smartchars.emdash) # em (yes, backwards) 

765 return text 

766 

767 

768def educateDashesOldSchool(text): 

769 """ 

770 Parameter: String (unicode or bytes). 

771 Returns: The `text`, with each instance of "--" translated to 

772 an en-dash character, and each "---" translated to 

773 an em-dash character. 

774 """ 

775 

776 text = text.replace(r'---', smartchars.emdash) 

777 text = text.replace(r'--', smartchars.endash) 

778 return text 

779 

780 

781def educateDashesOldSchoolInverted(text): 

782 """ 

783 Parameter: String (unicode or bytes). 

784 Returns: The `text`, with each instance of "--" translated to 

785 an em-dash character, and each "---" translated to 

786 an en-dash character. Two reasons why: First, unlike the 

787 en- and em-dash syntax supported by 

788 EducateDashesOldSchool(), it's compatible with existing 

789 entries written before SmartyPants 1.1, back when "--" was 

790 only used for em-dashes. Second, em-dashes are more 

791 common than en-dashes, and so it sort of makes sense that 

792 the shortcut should be shorter to type. (Thanks to Aaron 

793 Swartz for the idea.) 

794 """ 

795 text = text.replace(r'---', smartchars.endash) # em 

796 text = text.replace(r'--', smartchars.emdash) # en 

797 return text 

798 

799 

800def educateEllipses(text): 

801 """ 

802 Parameter: String (unicode or bytes). 

803 Returns: The `text`, with each instance of "..." translated to 

804 an ellipsis character. 

805 

806 Example input: Huh...? 

807 Example output: Huh&#8230;? 

808 """ 

809 

810 text = text.replace(r'...', smartchars.ellipsis) 

811 text = text.replace(r'. . .', smartchars.ellipsis) 

812 return text 

813 

814 

815def stupefyEntities(text, language='en'): 

816 """ 

817 Parameter: String (unicode or bytes). 

818 Returns: The `text`, with each SmartyPants character translated to 

819 its ASCII counterpart. 

820 

821 Example input: “Hello — world.” 

822 Example output: "Hello -- world." 

823 """ 

824 smart = smartchars(language) 

825 

826 text = text.replace(smart.endash, "-") 

827 text = text.replace(smart.emdash, "--") 

828 text = text.replace(smart.osquote, "'") # open secondary quote 

829 text = text.replace(smart.csquote, "'") # close secondary quote 

830 text = text.replace(smart.opquote, '"') # open primary quote 

831 text = text.replace(smart.cpquote, '"') # close primary quote 

832 text = text.replace(smart.ellipsis, '...') 

833 

834 return text 

835 

836 

837def processEscapes(text, restore=False): 

838 r""" 

839 Parameter: String (unicode or bytes). 

840 Returns: The `text`, with after processing the following backslash 

841 escape sequences. This is useful if you want to force a "dumb" 

842 quote or other character to appear. 

843 

844 Escape Value 

845 ------ ----- 

846 \\ &#92; 

847 \" &#34; 

848 \' &#39; 

849 \. &#46; 

850 \- &#45; 

851 \` &#96; 

852 """ 

853 replacements = ((r'\\', r'&#92;'), 

854 (r'\"', r'&#34;'), 

855 (r"\'", r'&#39;'), 

856 (r'\.', r'&#46;'), 

857 (r'\-', r'&#45;'), 

858 (r'\`', r'&#96;')) 

859 if restore: 

860 for (ch, rep) in replacements: 

861 text = text.replace(rep, ch[1]) 

862 else: 

863 for (ch, rep) in replacements: 

864 text = text.replace(ch, rep) 

865 

866 return text 

867 

868 

869def tokenize(text): 

870 """ 

871 Parameter: String containing HTML markup. 

872 Returns: An iterator that yields the tokens comprising the input 

873 string. Each token is either a tag (possibly with nested, 

874 tags contained therein, such as <a href="<MTFoo>">, or a 

875 run of text between tags. Each yielded element is a 

876 two-element tuple; the first is either 'tag' or 'text'; 

877 the second is the actual value. 

878 

879 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin. 

880 """ 

881 tag_soup = re.compile(r'([^<]*)(<[^>]*>)') 

882 token_match = tag_soup.search(text) 

883 previous_end = 0 

884 

885 while token_match is not None: 

886 if token_match.group(1): 

887 yield 'text', token_match.group(1) 

888 yield 'tag', token_match.group(2) 

889 previous_end = token_match.end() 

890 token_match = tag_soup.search(text, token_match.end()) 

891 

892 if previous_end < len(text): 

893 yield 'text', text[previous_end:] 

894 

895 

896if __name__ == "__main__": 

897 

898 import itertools 

899 import locale 

900 try: 

901 locale.setlocale(locale.LC_ALL, '') # set to user defaults 

902 defaultlanguage = locale.getlocale()[0] 

903 except: # noqa catchall 

904 defaultlanguage = 'en' 

905 

906 # Normalize and drop unsupported subtags: 

907 defaultlanguage = defaultlanguage.lower().replace('-', '_') 

908 # split (except singletons, which mark the following tag as non-standard): 

909 defaultlanguage = re.sub(r'_([a-zA-Z0-9])_', r'_\1-', defaultlanguage) 

910 _subtags = [subtag for subtag in defaultlanguage.split('_')] 

911 _basetag = _subtags.pop(0) 

912 # find all combinations of subtags 

913 for n in range(len(_subtags), 0, -1): 

914 for tags in itertools.combinations(_subtags, n): 

915 _tag = '-'.join((_basetag, *tags)) 

916 if _tag in smartchars.quotes: 

917 defaultlanguage = _tag 

918 break 

919 else: 

920 if _basetag in smartchars.quotes: 

921 defaultlanguage = _basetag 

922 else: 

923 defaultlanguage = 'en' 

924 

925 import argparse 

926 parser = argparse.ArgumentParser( 

927 description='Filter <input> making ASCII punctuation "smart".') 

928 # TODO: require input arg or other means to print USAGE instead of waiting. 

929 # parser.add_argument("input", help="Input stream, use '-' for stdin.") 

930 parser.add_argument("-a", "--action", default="1", 

931 help="what to do with the input (see --actionhelp)") 

932 parser.add_argument("-e", "--encoding", default="utf-8", 

933 help="text encoding") 

934 parser.add_argument("-l", "--language", default=defaultlanguage, 

935 help="text language (BCP47 tag), " 

936 f"Default: {defaultlanguage}") 

937 parser.add_argument("-q", "--alternative-quotes", action="store_true", 

938 help="use alternative quote style") 

939 parser.add_argument("--doc", action="store_true", 

940 help="print documentation") 

941 parser.add_argument("--actionhelp", action="store_true", 

942 help="list available actions") 

943 parser.add_argument("--stylehelp", action="store_true", 

944 help="list available quote styles") 

945 parser.add_argument("--test", action="store_true", 

946 help="perform short self-test") 

947 args = parser.parse_args() 

948 

949 if args.doc: 

950 print(__doc__) 

951 elif args.actionhelp: 

952 print(options) 

953 elif args.stylehelp: 

954 print() 

955 print("Available styles (primary open/close, secondary open/close)") 

956 print("language tag quotes") 

957 print("============ ======") 

958 for key in sorted(smartchars.quotes.keys()): 

959 print("%-14s %s" % (key, smartchars.quotes[key])) 

960 elif args.test: 

961 # Unit test output goes to stderr. 

962 import unittest 

963 

964 class TestSmartypantsAllAttributes(unittest.TestCase): 

965 # the default attribute is "1", which means "all". 

966 def test_dates(self): 

967 self.assertEqual(smartyPants("1440-80's"), "1440-80’s") 

968 self.assertEqual(smartyPants("1440-'80s"), "1440-’80s") 

969 self.assertEqual(smartyPants("1440---'80s"), "1440–’80s") 

970 self.assertEqual(smartyPants("1960's"), "1960’s") 

971 self.assertEqual(smartyPants("one two '60s"), "one two ’60s") 

972 self.assertEqual(smartyPants("'60s"), "’60s") 

973 

974 def test_educated_quotes(self): 

975 self.assertEqual(smartyPants('"Isn\'t this fun?"'), 

976 '“Isn’t this fun?”') 

977 

978 def test_html_tags(self): 

979 text = '<a src="foo">more</a>' 

980 self.assertEqual(smartyPants(text), text) 

981 

982 suite = unittest.TestLoader().loadTestsFromTestCase( 

983 TestSmartypantsAllAttributes) 

984 unittest.TextTestRunner().run(suite) 

985 

986 else: 

987 if args.alternative_quotes: 

988 if '-x-altquot' in args.language: 

989 args.language = args.language.replace('-x-altquot', '') 

990 else: 

991 args.language += '-x-altquot' 

992 text = sys.stdin.read() 

993 print(smartyPants(text, attr=args.action, language=args.language))