1#! /usr/bin/env python3
2# :Id: $Id$
3# :Copyright: © 2010-2023 Günter Milde,
4# original `SmartyPants`_: © 2003 John Gruber
5# smartypants.py: © 2004, 2007 Chad Miller
6# :Maintainer: docutils-develop@lists.sourceforge.net
7# :License: Released under the terms of the `2-Clause BSD license`_, in short:
8#
9# Copying and distribution of this file, with or without modification,
10# are permitted in any medium without royalty provided the copyright
11# notices and this notice are preserved.
12# This file is offered as-is, without any warranty.
13#
14# .. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause
15
16
17r"""
18=========================
19Smart Quotes for Docutils
20=========================
21
22Synopsis
23========
24
25"SmartyPants" is a free web publishing plug-in for Movable Type, Blosxom, and
26BBEdit that easily translates plain ASCII punctuation characters into "smart"
27typographic punctuation characters.
28
29``smartquotes.py`` is an adaption of "SmartyPants" to Docutils_.
30
31* Using Unicode instead of HTML entities for typographic punctuation
32 characters, it works for any output format that supports Unicode.
33* Supports `language specific quote characters`__.
34
35__ https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
36
37
38Authors
39=======
40
41`John Gruber`_ did all of the hard work of writing this software in Perl for
42`Movable Type`_ and almost all of this useful documentation. `Chad Miller`_
43ported it to Python to use with Pyblosxom_.
44Adapted to Docutils_ by Günter Milde.
45
46Additional Credits
47==================
48
49Portions of the SmartyPants original work are based on Brad Choate's nifty
50MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to
51this plug-in. Brad Choate is a fine hacker indeed.
52
53`Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta
54testing of the original SmartyPants.
55
56`Rael Dornfest`_ ported SmartyPants to Blosxom.
57
58.. _Brad Choate: http://bradchoate.com/
59.. _Jeremy Hedley: http://antipixel.com/
60.. _Charles Wiltgen: http://playbacktime.com/
61.. _Rael Dornfest: http://raelity.org/
62
63
64Copyright and License
65=====================
66
67SmartyPants_ license (3-Clause BSD license):
68
69 Copyright (c) 2003 John Gruber (http://daringfireball.net/)
70 All rights reserved.
71
72 Redistribution and use in source and binary forms, with or without
73 modification, are permitted provided that the following conditions are
74 met:
75
76 * Redistributions of source code must retain the above copyright
77 notice, this list of conditions and the following disclaimer.
78
79 * Redistributions in binary form must reproduce the above copyright
80 notice, this list of conditions and the following disclaimer in
81 the documentation and/or other materials provided with the
82 distribution.
83
84 * Neither the name "SmartyPants" nor the names of its contributors
85 may be used to endorse or promote products derived from this
86 software without specific prior written permission.
87
88 This software is provided by the copyright holders and contributors
89 "as is" and any express or implied warranties, including, but not
90 limited to, the implied warranties of merchantability and fitness for
91 a particular purpose are disclaimed. In no event shall the copyright
92 owner or contributors be liable for any direct, indirect, incidental,
93 special, exemplary, or consequential damages (including, but not
94 limited to, procurement of substitute goods or services; loss of use,
95 data, or profits; or business interruption) however caused and on any
96 theory of liability, whether in contract, strict liability, or tort
97 (including negligence or otherwise) arising in any way out of the use
98 of this software, even if advised of the possibility of such damage.
99
100smartypants.py license (2-Clause BSD license):
101
102 smartypants.py is a derivative work of SmartyPants.
103
104 Redistribution and use in source and binary forms, with or without
105 modification, are permitted provided that the following conditions are
106 met:
107
108 * Redistributions of source code must retain the above copyright
109 notice, this list of conditions and the following disclaimer.
110
111 * Redistributions in binary form must reproduce the above copyright
112 notice, this list of conditions and the following disclaimer in
113 the documentation and/or other materials provided with the
114 distribution.
115
116 This software is provided by the copyright holders and contributors
117 "as is" and any express or implied warranties, including, but not
118 limited to, the implied warranties of merchantability and fitness for
119 a particular purpose are disclaimed. In no event shall the copyright
120 owner or contributors be liable for any direct, indirect, incidental,
121 special, exemplary, or consequential damages (including, but not
122 limited to, procurement of substitute goods or services; loss of use,
123 data, or profits; or business interruption) however caused and on any
124 theory of liability, whether in contract, strict liability, or tort
125 (including negligence or otherwise) arising in any way out of the use
126 of this software, even if advised of the possibility of such damage.
127
128.. _John Gruber: http://daringfireball.net/
129.. _Chad Miller: http://web.chad.org/
130
131.. _Pyblosxom: http://pyblosxom.bluesock.org/
132.. _SmartyPants: http://daringfireball.net/projects/smartypants/
133.. _Movable Type: http://www.movabletype.org/
134.. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause
135.. _Docutils: https://docutils.sourceforge.io/
136
137Description
138===========
139
140SmartyPants can perform the following transformations:
141
142- Straight quotes ( " and ' ) into "curly" quote characters
143- Backticks-style quotes (\`\`like this'') into "curly" quote characters
144- Dashes (``--`` and ``---``) into en- and em-dash entities
145- Three consecutive dots (``...`` or ``. . .``) into an ellipsis ``…``.
146
147This means you can write, edit, and save your posts using plain old
148ASCII straight quotes, plain dashes, and plain dots, but your published
149posts (and final HTML output) will appear with smart quotes, em-dashes,
150and proper ellipses.
151
152Backslash Escapes
153=================
154
155If you need to use literal straight quotes (or plain hyphens and periods),
156`smartquotes` accepts the following backslash escape sequences to force
157ASCII-punctuation. Mind, that you need two backslashes in "docstrings", as
158Python expands them, too.
159
160======== =========
161Escape Character
162======== =========
163``\\`` \\
164``\\"`` \\"
165``\\'`` \\'
166``\\.`` \\.
167``\\-`` \\-
168``\\``` \\`
169======== =========
170
171This is useful, for example, when you want to use straight quotes as
172foot and inch marks: 6\\'2\\" tall; a 17\\" iMac.
173
174
175Caveats
176=======
177
178Why You Might Not Want to Use Smart Quotes in Your Weblog
179---------------------------------------------------------
180
181For one thing, you might not care.
182
183Most normal, mentally stable individuals do not take notice of proper
184typographic punctuation. Many design and typography nerds, however, break
185out in a nasty rash when they encounter, say, a restaurant sign that uses
186a straight apostrophe to spell "Joe's".
187
188If you're the sort of person who just doesn't care, you might well want to
189continue not caring. Using straight quotes -- and sticking to the 7-bit
190ASCII character set in general -- is certainly a simpler way to live.
191
192Even if you *do* care about accurate typography, you still might want to
193think twice before educating the quote characters in your weblog. One side
194effect of publishing curly quote characters is that it makes your
195weblog a bit harder for others to quote from using copy-and-paste. What
196happens is that when someone copies text from your blog, the copied text
197contains the 8-bit curly quote characters (as well as the 8-bit characters
198for em-dashes and ellipses, if you use these options). These characters
199are not standard across different text encoding methods, which is why they
200need to be encoded as characters.
201
202People copying text from your weblog, however, may not notice that you're
203using curly quotes, and they'll go ahead and paste the unencoded 8-bit
204characters copied from their browser into an email message or their own
205weblog. When pasted as raw "smart quotes", these characters are likely to
206get mangled beyond recognition.
207
208That said, my own opinion is that any decent text editor or email client
209makes it easy to stupefy smart quote characters into their 7-bit
210equivalents, and I don't consider it my problem if you're using an
211indecent text editor or email client.
212
213
214Algorithmic Shortcomings
215------------------------
216
217One situation in which quotes will get curled the wrong way is when
218apostrophes are used at the start of leading contractions. For example::
219
220 'Twas the night before Christmas.
221
222In the case above, SmartyPants will turn the apostrophe into an opening
223secondary quote, when in fact it should be the `RIGHT SINGLE QUOTATION MARK`
224character which is also "the preferred character to use for apostrophe"
225(Unicode). I don't think this problem can be solved in the general case --
226every word processor I've tried gets this wrong as well. In such cases, it's
227best to inset the `RIGHT SINGLE QUOTATION MARK` (’) by hand.
228
229In English, the same character is used for apostrophe and closing secondary
230quote (both plain and "smart" ones). For other locales (French, Italean,
231Swiss, ...) "smart" secondary closing quotes differ from the curly apostrophe.
232
233 .. class:: language-fr
234
235 Il dit : "C'est 'super' !"
236
237If the apostrophe is used at the end of a word, it cannot be distinguished
238from a secondary quote by the algorithm. Therefore, a text like::
239
240 .. class:: language-de-CH
241
242 "Er sagt: 'Ich fass' es nicht.'"
243
244will get a single closing guillemet instead of an apostrophe.
245
246This can be prevented by use use of the `RIGHT SINGLE QUOTATION MARK` in
247the source::
248
249 - "Er sagt: 'Ich fass' es nicht.'"
250 + "Er sagt: 'Ich fass’ es nicht.'"
251
252
253Version History
254===============
255
2561.10 2023-11-18
257 - Pre-compile regexps once, not with every call of `educateQuotes()`
258 (patch #206 by Chris Sewell). Simplify regexps.
259
2601.9 2022-03-04
261 - Code cleanup. Require Python 3.
262
2631.8.1 2017-10-25
264 - Use open quote after Unicode whitespace, ZWSP, and ZWNJ.
265 - Code cleanup.
266
2671.8: 2017-04-24
268 - Command line front-end.
269
2701.7.1: 2017-03-19
271 - Update and extend language-dependent quotes.
272 - Differentiate apostrophe from single quote.
273
2741.7: 2012-11-19
275 - Internationalization: language-dependent quotes.
276
2771.6.1: 2012-11-06
278 - Refactor code, code cleanup,
279 - `educate_tokens()` generator as interface for Docutils.
280
2811.6: 2010-08-26
282 - Adaption to Docutils:
283 - Use Unicode instead of HTML entities,
284 - Remove code special to pyblosxom.
285
2861.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400
287 - Fixed bug where blocks of precious unalterable text was instead
288 interpreted. Thanks to Le Roux and Dirk van Oosterbosch.
289
2901.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400
291 - Fix bogus magical quotation when there is no hint that the
292 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen.
293 - Be smarter about quotes before terminating numbers in an en-dash'ed
294 range.
295
2961.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500
297 - Fix a date-processing bug, as reported by jacob childress.
298 - Begin a test-suite for ensuring correct output.
299 - Removed import of "string", since I didn't really need it.
300 (This was my first every Python program. Sue me!)
301
3021.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400
303 - Abort processing if the flavour is in forbidden-list. Default of
304 [ "rss" ] (Idea of Wolfgang SCHNERRING.)
305 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING.
306
3071.5_1.2: Mon, 24 May 2004 08:14:54 -0400
308 - Some single quotes weren't replaced properly. Diff-tesuji played
309 by Benjamin GEIGER.
310
3111.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500
312 - Support upcoming pyblosxom 0.9 plugin verification feature.
313
3141.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500
315 - Initial release
316"""
317
318import re
319import sys
320
321
322options = r"""
323Options
324=======
325
326Numeric values are the easiest way to configure SmartyPants' behavior:
327
328:0: Suppress all transformations. (Do nothing.)
329
330:1: Performs default SmartyPants transformations: quotes (including
331 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash)
332 is used to signify an em-dash; there is no support for en-dashes
333
334:2: Same as smarty_pants="1", except that it uses the old-school typewriter
335 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``"
336 (dash dash dash)
337 for em-dashes.
338
339:3: Same as smarty_pants="2", but inverts the shorthand for dashes:
340 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for
341 en-dashes.
342
343:-1: Stupefy mode. Reverses the SmartyPants transformation process, turning
344 the characters produced by SmartyPants into their ASCII equivalents.
345 E.g. the LEFT DOUBLE QUOTATION MARK (“) is turned into a simple
346 double-quote (\"), "—" is turned into two dashes, etc.
347
348
349The following single-character attribute values can be combined to toggle
350individual transformations from within the smarty_pants attribute. For
351example, ``"1"`` is equivalent to ``"qBde"``.
352
353:q: Educates normal quote characters: (") and (').
354
355:b: Educates \`\`backticks'' -style double quotes.
356
357:B: Educates \`\`backticks'' -style double quotes and \`single' quotes.
358
359:d: Educates em-dashes.
360
361:D: Educates em-dashes and en-dashes, using old-school typewriter
362 shorthand: (dash dash) for en-dashes, (dash dash dash) for em-dashes.
363
364:i: Educates em-dashes and en-dashes, using inverted old-school typewriter
365 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes.
366
367:e: Educates ellipses.
368
369:w: Translates any instance of ``"`` into a normal double-quote
370 character. This should be of no interest to most people, but
371 of particular interest to anyone who writes their posts using
372 Dreamweaver, as Dreamweaver inexplicably uses this entity to represent
373 a literal double-quote character. SmartyPants only educates normal
374 quotes, not entities (because ordinarily, entities are used for
375 the explicit purpose of representing the specific character they
376 represent). The "w" option must be used in conjunction with one (or
377 both) of the other quote options ("q" or "b"). Thus, if you wish to
378 apply all SmartyPants transformations (quotes, en- and em-dashes, and
379 ellipses) and also translate ``"`` entities into regular quotes
380 so SmartyPants can educate them, you should pass the following to the
381 smarty_pants attribute:
382"""
383
384
385class smartchars:
386 """Smart quotes and dashes"""
387
388 endash = '–' # EN DASH
389 emdash = '—' # EM DASH
390 ellipsis = '…' # HORIZONTAL ELLIPSIS
391 apostrophe = '’' # RIGHT SINGLE QUOTATION MARK
392
393 # quote characters (language-specific, set in __init__())
394 # https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
395 # https://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen
396 # https://fr.wikipedia.org/wiki/Guillemet
397 # https://typographisme.net/post/Les-espaces-typographiques-et-le-web
398 # https://www.btb.termiumplus.gc.ca/tpv2guides/guides/redac/index-fra.html
399 # https://en.wikipedia.org/wiki/Hebrew_punctuation#Quotation_marks
400 # [7] https://www.tustep.uni-tuebingen.de/bi/bi00/bi001t1-anfuehrung.pdf
401 # [8] https://www.korrekturavdelingen.no/anforselstegn.htm
402 # [9] Typografisk håndbok. Oslo: Spartacus. 2000. s. 67. ISBN 8243001530.
403 # [10] https://www.typografi.org/sitat/sitatart.html
404 # [11] https://mk.wikipedia.org/wiki/Правопис_и_правоговор_на_македонскиот_јазик # noqa:E501
405 # [12] https://hrvatska-tipografija.com/polunavodnici/
406 # [13] https://pl.wikipedia.org/wiki/Cudzys%C5%82%C3%B3w
407 #
408 # See also configuration option "smartquote-locales".
409 quotes = {
410 'af': '“”‘’',
411 'af-x-altquot': '„”‚’',
412 'bg': '„“‚‘', # https://bg.wikipedia.org/wiki/Кавички
413 'ca': '«»“”',
414 'ca-x-altquot': '“”‘’',
415 'cs': '„“‚‘',
416 'cs-x-altquot': '»«›‹',
417 'da': '»«›‹',
418 'da-x-altquot': '„“‚‘',
419 # 'da-x-altquot2': '””’’',
420 'de': '„“‚‘',
421 'de-x-altquot': '»«›‹',
422 'de-ch': '«»‹›',
423 'el': '«»“”', # '«»‟”' https://hal.science/hal-02101618
424 'en': '“”‘’',
425 'en-uk-x-altquot': '‘’“”', # Attention: " → ‘ and ' → “ !
426 'eo': '“”‘’',
427 'es': '«»“”',
428 'es-x-altquot': '“”‘’',
429 'et': '„“‚‘', # no secondary quote listed in
430 'et-x-altquot': '«»‹›', # the sources above (wikipedia.org)
431 'eu': '«»‹›',
432 'fi': '””’’',
433 'fi-x-altquot': '»»››',
434 'fr': ('« ', ' »', '“', '”'), # full no-break space
435 'fr-x-altquot': ('« ', ' »', '“', '”'), # narrow no-break space
436 'fr-ch': '«»‹›', # https://typoguide.ch/
437 'fr-ch-x-altquot': ('« ', ' »', '‹ ', ' ›'), # narrow no-break space # noqa:E501
438 'gl': '«»“”',
439 'he': '”“»«', # Hebrew is RTL, test position:
440 'he-x-altquot': '„”‚’', # low quotation marks are opening.
441 # 'he-x-altquot': '“„‘‚', # RTL: low quotation marks opening
442 'hr': '„”‘’', # Croatian [12]
443 'hr-x-altquot': '»«›‹',
444 'hsb': '„“‚‘',
445 'hsb-x-altquot': '»«›‹',
446 'hu': '„”«»',
447 'is': '„“‚‘',
448 'it': '«»“”',
449 'it-ch': '«»‹›',
450 'it-x-altquot': '“”‘’',
451 # 'it-x-altquot2': '“„‘‚', # [7] in headlines
452 'ja': '「」『』',
453 'ko': '“”‘’',
454 'lt': '„“‚‘',
455 'lv': '„“‚‘',
456 'mk': '„“‚‘', # Macedonian [11]
457 'nl': '“”‘’',
458 'nl-x-altquot': '„”‚’',
459 # 'nl-x-altquot2': '””’’',
460 'nb': '«»’’', # Norsk bokmål (canonical form 'no')
461 'nn': '«»’’', # Nynorsk [10]
462 'nn-x-altquot': '«»‘’', # [8], [10]
463 # 'nn-x-altquot2': '«»«»', # [9], [10]
464 # 'nn-x-altquot3': '„“‚‘', # [10]
465 'no': '«»’’', # Norsk bokmål [10]
466 'no-x-altquot': '«»‘’', # [8], [10]
467 # 'no-x-altquot2': '«»«»', # [9], [10
468 # 'no-x-altquot3': '„“‚‘', # [10]
469 'pl': '„”«»',
470 'pl-x-altquot': '«»‚’',
471 # 'pl-x-altquot2': '„”‚’', # [13]
472 'pt': '«»“”',
473 'pt-br': '“”‘’',
474 'ro': '„”«»',
475 'ru': '«»„“',
476 'sh': '„”‚’', # Serbo-Croatian
477 'sh-x-altquot': '»«›‹',
478 'sk': '„“‚‘', # Slovak
479 'sk-x-altquot': '»«›‹',
480 'sl': '„“‚‘', # Slovenian
481 'sl-x-altquot': '»«›‹',
482 'sq': '«»‹›', # Albanian
483 'sq-x-altquot': '“„‘‚',
484 'sr': '„”’’',
485 'sr-x-altquot': '»«›‹',
486 'sv': '””’’',
487 'sv-x-altquot': '»»››',
488 'tr': '“”‘’',
489 'tr-x-altquot': '«»‹›',
490 # 'tr-x-altquot2': '“„‘‚', # [7] antiquated?
491 'uk': '«»„“',
492 'uk-x-altquot': '„“‚‘',
493 'zh-cn': '“”‘’',
494 'zh-tw': '「」『』',
495 }
496
497 def __init__(self, language='en') -> None:
498 self.language = language
499 try:
500 (self.opquote, self.cpquote,
501 self.osquote, self.csquote) = self.quotes[language.lower()]
502 except KeyError:
503 self.opquote, self.cpquote, self.osquote, self.csquote = '""\'\''
504
505
506class RegularExpressions:
507 # character classes:
508 _CH_CLASSES = {'open': '[([{]', # opening braces
509 'close': r'[^\s]', # everything except whitespace
510 'punct': r"""[-!" #\$\%'()*+,.\/:;<=>?\@\[\\\]\^_`{|}~]""",
511 'dash': r'[-–—]',
512 'sep': '[\\s\u200B\u200C]', # Whitespace, ZWSP, ZWNJ
513 }
514 START_SINGLE = re.compile(r"^'(?=%s\\B)" % _CH_CLASSES['punct'])
515 START_DOUBLE = re.compile(r'^"(?=%s\\B)' % _CH_CLASSES['punct'])
516 ADJACENT_1 = re.compile('"\'(?=\\w)')
517 ADJACENT_2 = re.compile('\'"(?=\\w)')
518 OPEN_SINGLE = re.compile(r"(%(open)s|%(dash)s)'(?=%(punct)s? )"
519 % _CH_CLASSES)
520 OPEN_DOUBLE = re.compile(r'(%(open)s|%(dash)s)"(?=%(punct)s? )'
521 % _CH_CLASSES)
522 DECADE = re.compile(r"'(?=\d{2}s)")
523 APOSTROPHE = re.compile(r"(?<=(\w|\d))'(?=\w)")
524 OPENING_SECONDARY = re.compile("""
525 (# ?<= # look behind fails: requires fixed-width pattern
526 %(sep)s | # a whitespace char, or
527 %(open)s | # opening brace, or
528 %(dash)s # em/en-dash
529 )
530 ' # the quote
531 (?=\\w|%(punct)s) # word character or punctuation
532 """ % _CH_CLASSES, re.VERBOSE)
533 CLOSING_SECONDARY = re.compile(r"(?<!\s)'")
534 OPENING_PRIMARY = re.compile("""
535 (
536 %(sep)s | # a whitespace char, or
537 %(open)s | # zero width separating char, or
538 %(dash)s # em/en-dash
539 )
540 " # the quote, followed by
541 (?=\\w|%(punct)s) # a word character or punctuation
542 """ % _CH_CLASSES, re.VERBOSE)
543 CLOSING_PRIMARY = re.compile(r"""
544 (
545 (?<!\s)" | # no whitespace before
546 "(?=\s) # whitespace behind
547 )
548 """, re.VERBOSE)
549
550
551regexes = RegularExpressions()
552
553
554default_smartypants_attr = '1'
555
556
557def smartyPants(text, attr=default_smartypants_attr, language='en'):
558 """Main function for "traditional" use."""
559
560 return "".join(t for t in educate_tokens(tokenize(text), attr, language))
561
562
563def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'):
564 """Return iterator that "educates" the items of `text_tokens`."""
565 # Parse attributes:
566 # 0 : do nothing
567 # 1 : set all
568 # 2 : set all, using old school en- and em- dash shortcuts
569 # 3 : set all, using inverted old school en and em- dash shortcuts
570 #
571 # q : quotes
572 # b : backtick quotes (``double'' only)
573 # B : backtick quotes (``double'' and `single')
574 # d : dashes
575 # D : old school dashes
576 # i : inverted old school dashes
577 # e : ellipses
578 # w : convert " entities to " for Dreamweaver users
579
580 convert_quot = False # translate " entities into normal quotes?
581 do_dashes = False
582 do_backticks = False
583 do_quotes = False
584 do_ellipses = False
585 do_stupefy = False
586
587 # if attr == "0": # pass tokens unchanged (see below).
588 if attr == '1': # Do everything, turn all options on.
589 do_quotes = True
590 do_backticks = True
591 do_dashes = 1
592 do_ellipses = True
593 elif attr == '2':
594 # Do everything, turn all options on, use old school dash shorthand.
595 do_quotes = True
596 do_backticks = True
597 do_dashes = 2
598 do_ellipses = True
599 elif attr == '3':
600 # Do everything, use inverted old school dash shorthand.
601 do_quotes = True
602 do_backticks = True
603 do_dashes = 3
604 do_ellipses = True
605 elif attr == '-1': # Special "stupefy" mode.
606 do_stupefy = True
607 else:
608 if 'q' in attr: do_quotes = True # noqa: E701
609 if 'b' in attr: do_backticks = True # noqa: E701
610 if 'B' in attr: do_backticks = 2 # noqa: E701
611 if 'd' in attr: do_dashes = 1 # noqa: E701
612 if 'D' in attr: do_dashes = 2 # noqa: E701
613 if 'i' in attr: do_dashes = 3 # noqa: E701
614 if 'e' in attr: do_ellipses = True # noqa: E701
615 if 'w' in attr: convert_quot = True # noqa: E701
616
617 prev_token_last_char = ' '
618 # Last character of the previous text token. Used as
619 # context to curl leading quote characters correctly.
620
621 for (ttype, text) in text_tokens:
622
623 # skip HTML and/or XML tags as well as empty text tokens
624 # without updating the last character
625 if ttype == 'tag' or not text:
626 yield text
627 continue
628
629 # skip literal text (math, literal, raw, ...)
630 if ttype == 'literal':
631 prev_token_last_char = text[-1:]
632 yield text
633 continue
634
635 last_char = text[-1:] # Remember last char before processing.
636
637 text = processEscapes(text)
638
639 if convert_quot:
640 text = text.replace('"', '"')
641
642 if do_dashes == 1:
643 text = educateDashes(text)
644 elif do_dashes == 2:
645 text = educateDashesOldSchool(text)
646 elif do_dashes == 3:
647 text = educateDashesOldSchoolInverted(text)
648
649 if do_ellipses:
650 text = educateEllipses(text)
651
652 # Note: backticks need to be processed before quotes.
653 if do_backticks:
654 text = educateBackticks(text, language)
655
656 if do_backticks == 2:
657 text = educateSingleBackticks(text, language)
658
659 if do_quotes:
660 # Replace plain quotes in context to prevent conversion to
661 # 2-character sequence in French.
662 context = prev_token_last_char.replace('"', ';').replace("'", ';')
663 text = educateQuotes(context+text, language)[1:]
664
665 if do_stupefy:
666 text = stupefyEntities(text, language)
667
668 # Remember last char as context for the next token
669 prev_token_last_char = last_char
670
671 text = processEscapes(text, restore=True)
672
673 yield text
674
675
676def educateQuotes(text, language='en'):
677 """
678 Parameter: - text string (unicode or bytes).
679 - language (`BCP 47` language tag.)
680 Returns: The `text`, with "educated" curly quote characters.
681
682 Example input: "Isn't this fun?"
683 Example output: “Isn’t this fun?“
684 """
685 smart = smartchars(language)
686
687 if not re.search('[-"\']', text):
688 return text
689
690 # Special case if the very first character is a quote
691 # followed by punctuation at a non-word-break. Use closing quotes.
692 # TODO: example (when does this match?)
693 text = regexes.START_SINGLE.sub(smart.csquote, text)
694 text = regexes.START_DOUBLE.sub(smart.cpquote, text)
695
696 # Special case for adjacent quotes
697 # like "'Quoted' words in a larger quote."
698 text = regexes.ADJACENT_1.sub(smart.opquote+smart.osquote, text)
699 text = regexes.ADJACENT_2.sub(smart.osquote+smart.opquote, text)
700
701 # Special case: "opening character" followed by quote,
702 # optional punctuation and space like "[", '(', or '-'.
703 text = regexes.OPEN_SINGLE.sub(r'\1%s'%smart.csquote, text)
704 text = regexes.OPEN_DOUBLE.sub(r'\1%s'%smart.cpquote, text)
705
706 # Special case for decade abbreviations (the '80s):
707 if language.startswith('en'): # TODO similar cases in other languages?
708 text = regexes.DECADE.sub(smart.apostrophe, text)
709
710 # Get most opening secondary quotes:
711 text = regexes.OPENING_SECONDARY.sub(r'\1'+smart.osquote, text)
712
713 # In many locales, secondary closing quotes are different from apostrophe:
714 if smart.csquote != smart.apostrophe:
715 text = regexes.APOSTROPHE.sub(smart.apostrophe, text)
716 # TODO: keep track of quoting level to recognize apostrophe in, e.g.,
717 # "Ich fass' es nicht."
718
719 text = regexes.CLOSING_SECONDARY.sub(smart.csquote, text)
720
721 # Any remaining secondary quotes should be opening ones:
722 text = text.replace(r"'", smart.osquote)
723
724 # Get most opening primary quotes:
725 text = regexes.OPENING_PRIMARY.sub(r'\1'+smart.opquote, text)
726
727 # primary closing quotes:
728 text = regexes.CLOSING_PRIMARY.sub(smart.cpquote, text)
729
730 # Any remaining quotes should be opening ones.
731 text = text.replace(r'"', smart.opquote)
732
733 return text
734
735
736def educateBackticks(text, language='en'):
737 """
738 Parameter: String (unicode or bytes).
739 Returns: The `text`, with ``backticks'' -style double quotes
740 translated into HTML curly quote entities.
741 Example input: ``Isn't this fun?''
742 Example output: “Isn't this fun?“
743 """
744 smart = smartchars(language)
745
746 text = text.replace(r'``', smart.opquote)
747 text = text.replace(r"''", smart.cpquote)
748 return text
749
750
751def educateSingleBackticks(text, language='en'):
752 """
753 Parameter: String (unicode or bytes).
754 Returns: The `text`, with `backticks' -style single quotes
755 translated into HTML curly quote entities.
756
757 Example input: `Isn't this fun?'
758 Example output: ‘Isn’t this fun?’
759 """
760 smart = smartchars(language)
761
762 text = text.replace(r'`', smart.osquote)
763 text = text.replace(r"'", smart.csquote)
764 return text
765
766
767def educateDashes(text):
768 """
769 Parameter: String (unicode or bytes).
770 Returns: The `text`, with each instance of "--" translated to
771 an em-dash character.
772 """
773
774 text = text.replace(r'---', smartchars.endash) # en (yes, backwards)
775 text = text.replace(r'--', smartchars.emdash) # em (yes, backwards)
776 return text
777
778
779def educateDashesOldSchool(text):
780 """
781 Parameter: String (unicode or bytes).
782 Returns: The `text`, with each instance of "--" translated to
783 an en-dash character, and each "---" translated to
784 an em-dash character.
785 """
786
787 text = text.replace(r'---', smartchars.emdash)
788 text = text.replace(r'--', smartchars.endash)
789 return text
790
791
792def educateDashesOldSchoolInverted(text):
793 """
794 Parameter: String (unicode or bytes).
795 Returns: The `text`, with each instance of "--" translated to
796 an em-dash character, and each "---" translated to
797 an en-dash character. Two reasons why: First, unlike the
798 en- and em-dash syntax supported by
799 EducateDashesOldSchool(), it's compatible with existing
800 entries written before SmartyPants 1.1, back when "--" was
801 only used for em-dashes. Second, em-dashes are more
802 common than en-dashes, and so it sort of makes sense that
803 the shortcut should be shorter to type. (Thanks to Aaron
804 Swartz for the idea.)
805 """
806 text = text.replace(r'---', smartchars.endash) # em
807 text = text.replace(r'--', smartchars.emdash) # en
808 return text
809
810
811def educateEllipses(text):
812 """
813 Parameter: String (unicode or bytes).
814 Returns: The `text`, with each instance of "..." translated to
815 an ellipsis character.
816
817 Example input: Huh...?
818 Example output: Huh…?
819 """
820
821 text = text.replace(r'...', smartchars.ellipsis)
822 text = text.replace(r'. . .', smartchars.ellipsis)
823 return text
824
825
826def stupefyEntities(text, language='en'):
827 """
828 Parameter: String (unicode or bytes).
829 Returns: The `text`, with each SmartyPants character translated to
830 its ASCII counterpart.
831
832 Example input: “Hello — world.”
833 Example output: "Hello -- world."
834 """
835 smart = smartchars(language)
836
837 text = text.replace(smart.endash, "-")
838 text = text.replace(smart.emdash, "--")
839 text = text.replace(smart.osquote, "'") # open secondary quote
840 text = text.replace(smart.csquote, "'") # close secondary quote
841 text = text.replace(smart.opquote, '"') # open primary quote
842 text = text.replace(smart.cpquote, '"') # close primary quote
843 text = text.replace(smart.ellipsis, '...')
844
845 return text
846
847
848def processEscapes(text, restore=False):
849 r"""
850 Parameter: String (unicode or bytes).
851 Returns: The `text`, with after processing the following backslash
852 escape sequences. This is useful if you want to force a "dumb"
853 quote or other character to appear.
854
855 Escape Value
856 ------ -----
857 \\ \
858 \" "
859 \' '
860 \. .
861 \- -
862 \` `
863 """
864 replacements = ((r'\\', r'\'),
865 (r'\"', r'"'),
866 (r"\'", r'''),
867 (r'\.', r'.'),
868 (r'\-', r'-'),
869 (r'\`', r'`'))
870 if restore:
871 for (ch, rep) in replacements:
872 text = text.replace(rep, ch[1])
873 else:
874 for (ch, rep) in replacements:
875 text = text.replace(ch, rep)
876
877 return text
878
879
880def tokenize(text):
881 """
882 Parameter: String containing HTML markup.
883 Returns: An iterator that yields the tokens comprising the input
884 string. Each token is either a tag (possibly with nested,
885 tags contained therein, such as <a href="<MTFoo>">, or a
886 run of text between tags. Each yielded element is a
887 two-element tuple; the first is either 'tag' or 'text';
888 the second is the actual value.
889
890 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin.
891 """
892 tag_soup = re.compile(r'([^<]*)(<[^>]*>)')
893 token_match = tag_soup.search(text)
894 previous_end = 0
895
896 while token_match is not None:
897 if token_match.group(1):
898 yield 'text', token_match.group(1)
899 yield 'tag', token_match.group(2)
900 previous_end = token_match.end()
901 token_match = tag_soup.search(text, token_match.end())
902
903 if previous_end < len(text):
904 yield 'text', text[previous_end:]
905
906
907if __name__ == "__main__":
908
909 import itertools
910 import locale
911 try:
912 locale.setlocale(locale.LC_ALL, '') # set to user defaults
913 defaultlanguage = locale.getlocale()[0]
914 except: # NoQA: E722 (catchall)
915 defaultlanguage = 'en'
916
917 # Normalize and drop unsupported subtags:
918 defaultlanguage = defaultlanguage.lower().replace('-', '_')
919 # split (except singletons, which mark the following tag as non-standard):
920 defaultlanguage = re.sub(r'_([a-zA-Z0-9])_', r'_\1-', defaultlanguage)
921 _subtags = list(defaultlanguage.split('_'))
922 _basetag = _subtags.pop(0)
923 # find all combinations of subtags
924 for n in range(len(_subtags), 0, -1):
925 for tags in itertools.combinations(_subtags, n):
926 _tag = '-'.join((_basetag, *tags))
927 if _tag in smartchars.quotes:
928 defaultlanguage = _tag
929 break
930 else:
931 if _basetag in smartchars.quotes:
932 defaultlanguage = _basetag
933 else:
934 defaultlanguage = 'en'
935
936 import argparse
937 parser = argparse.ArgumentParser(
938 description='Filter <input> making ASCII punctuation "smart".')
939 # TODO: require input arg or other means to print USAGE instead of waiting.
940 # parser.add_argument("input", help="Input stream, use '-' for stdin.")
941 parser.add_argument("-a", "--action", default="1",
942 help="what to do with the input (see --actionhelp)")
943 parser.add_argument("-e", "--encoding", default="utf-8",
944 help="text encoding")
945 parser.add_argument("-l", "--language", default=defaultlanguage,
946 help="text language (BCP47 tag), "
947 f"Default: {defaultlanguage}")
948 parser.add_argument("-q", "--alternative-quotes", action="store_true",
949 help="use alternative quote style")
950 parser.add_argument("--doc", action="store_true",
951 help="print documentation")
952 parser.add_argument("--actionhelp", action="store_true",
953 help="list available actions")
954 parser.add_argument("--stylehelp", action="store_true",
955 help="list available quote styles")
956 parser.add_argument("--test", action="store_true",
957 help="perform short self-test")
958 args = parser.parse_args()
959
960 if args.doc:
961 print(__doc__)
962 elif args.actionhelp:
963 print(options)
964 elif args.stylehelp:
965 print()
966 print("Available styles (primary open/close, secondary open/close)")
967 print("language tag quotes")
968 print("============ ======")
969 for key in sorted(smartchars.quotes.keys()):
970 print("%-14s %s" % (key, smartchars.quotes[key]))
971 elif args.test:
972 # Unit test output goes to stderr.
973 import unittest
974
975 class TestSmartypantsAllAttributes(unittest.TestCase):
976 # the default attribute is "1", which means "all".
977 def test_dates(self) -> None:
978 self.assertEqual(smartyPants("1440-80's"), "1440-80’s")
979 self.assertEqual(smartyPants("1440-'80s"), "1440-’80s")
980 self.assertEqual(smartyPants("1440---'80s"), "1440–’80s")
981 self.assertEqual(smartyPants("1960's"), "1960’s")
982 self.assertEqual(smartyPants("one two '60s"), "one two ’60s")
983 self.assertEqual(smartyPants("'60s"), "’60s")
984
985 def test_educated_quotes(self) -> None:
986 self.assertEqual(smartyPants('"Isn\'t this fun?"'),
987 '“Isn’t this fun?”')
988
989 def test_html_tags(self) -> None:
990 text = '<a src="foo">more</a>'
991 self.assertEqual(smartyPants(text), text)
992
993 suite = unittest.TestLoader().loadTestsFromTestCase(
994 TestSmartypantsAllAttributes)
995 unittest.TextTestRunner().run(suite)
996
997 else:
998 if args.alternative_quotes:
999 if '-x-altquot' in args.language:
1000 args.language = args.language.replace('-x-altquot', '')
1001 else:
1002 args.language += '-x-altquot'
1003 text = sys.stdin.read()
1004 print(smartyPants(text, attr=args.action, language=args.language))