Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.8/site-packages/docutils/utils/smartquotes.py: 46%
231 statements
« prev ^ index » next coverage.py v7.2.7, created at 2023-06-07 06:06 +0000
« prev ^ index » next coverage.py v7.2.7, created at 2023-06-07 06:06 +0000
1#!/usr/bin/python3
2# :Id: $Id$
3# :Copyright: © 2010 Günter Milde,
4# original `SmartyPants`_: © 2003 John Gruber
5# smartypants.py: © 2004, 2007 Chad Miller
6# :Maintainer: docutils-develop@lists.sourceforge.net
7# :License: Released under the terms of the `2-Clause BSD license`_, in short:
8#
9# Copying and distribution of this file, with or without modification,
10# are permitted in any medium without royalty provided the copyright
11# notices and this notice are preserved.
12# This file is offered as-is, without any warranty.
13#
14# .. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause
17r"""
18=========================
19Smart Quotes for Docutils
20=========================
22Synopsis
23========
25"SmartyPants" is a free web publishing plug-in for Movable Type, Blosxom, and
26BBEdit that easily translates plain ASCII punctuation characters into "smart"
27typographic punctuation characters.
29``smartquotes.py`` is an adaption of "SmartyPants" to Docutils_.
31* Using Unicode instead of HTML entities for typographic punctuation
32 characters, it works for any output format that supports Unicode.
33* Supports `language specific quote characters`__.
35__ https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
38Authors
39=======
41`John Gruber`_ did all of the hard work of writing this software in Perl for
42`Movable Type`_ and almost all of this useful documentation. `Chad Miller`_
43ported it to Python to use with Pyblosxom_.
44Adapted to Docutils_ by Günter Milde.
46Additional Credits
47==================
49Portions of the SmartyPants original work are based on Brad Choate's nifty
50MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to
51this plug-in. Brad Choate is a fine hacker indeed.
53`Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta
54testing of the original SmartyPants.
56`Rael Dornfest`_ ported SmartyPants to Blosxom.
58.. _Brad Choate: http://bradchoate.com/
59.. _Jeremy Hedley: http://antipixel.com/
60.. _Charles Wiltgen: http://playbacktime.com/
61.. _Rael Dornfest: http://raelity.org/
64Copyright and License
65=====================
67SmartyPants_ license (3-Clause BSD license):
69 Copyright (c) 2003 John Gruber (http://daringfireball.net/)
70 All rights reserved.
72 Redistribution and use in source and binary forms, with or without
73 modification, are permitted provided that the following conditions are
74 met:
76 * Redistributions of source code must retain the above copyright
77 notice, this list of conditions and the following disclaimer.
79 * Redistributions in binary form must reproduce the above copyright
80 notice, this list of conditions and the following disclaimer in
81 the documentation and/or other materials provided with the
82 distribution.
84 * Neither the name "SmartyPants" nor the names of its contributors
85 may be used to endorse or promote products derived from this
86 software without specific prior written permission.
88 This software is provided by the copyright holders and contributors
89 "as is" and any express or implied warranties, including, but not
90 limited to, the implied warranties of merchantability and fitness for
91 a particular purpose are disclaimed. In no event shall the copyright
92 owner or contributors be liable for any direct, indirect, incidental,
93 special, exemplary, or consequential damages (including, but not
94 limited to, procurement of substitute goods or services; loss of use,
95 data, or profits; or business interruption) however caused and on any
96 theory of liability, whether in contract, strict liability, or tort
97 (including negligence or otherwise) arising in any way out of the use
98 of this software, even if advised of the possibility of such damage.
100smartypants.py license (2-Clause BSD license):
102 smartypants.py is a derivative work of SmartyPants.
104 Redistribution and use in source and binary forms, with or without
105 modification, are permitted provided that the following conditions are
106 met:
108 * Redistributions of source code must retain the above copyright
109 notice, this list of conditions and the following disclaimer.
111 * Redistributions in binary form must reproduce the above copyright
112 notice, this list of conditions and the following disclaimer in
113 the documentation and/or other materials provided with the
114 distribution.
116 This software is provided by the copyright holders and contributors
117 "as is" and any express or implied warranties, including, but not
118 limited to, the implied warranties of merchantability and fitness for
119 a particular purpose are disclaimed. In no event shall the copyright
120 owner or contributors be liable for any direct, indirect, incidental,
121 special, exemplary, or consequential damages (including, but not
122 limited to, procurement of substitute goods or services; loss of use,
123 data, or profits; or business interruption) however caused and on any
124 theory of liability, whether in contract, strict liability, or tort
125 (including negligence or otherwise) arising in any way out of the use
126 of this software, even if advised of the possibility of such damage.
128.. _John Gruber: http://daringfireball.net/
129.. _Chad Miller: http://web.chad.org/
131.. _Pyblosxom: http://pyblosxom.bluesock.org/
132.. _SmartyPants: http://daringfireball.net/projects/smartypants/
133.. _Movable Type: http://www.movabletype.org/
134.. _2-Clause BSD license: https://opensource.org/licenses/BSD-2-Clause
135.. _Docutils: https://docutils.sourceforge.io/
137Description
138===========
140SmartyPants can perform the following transformations:
142- Straight quotes ( " and ' ) into "curly" quote characters
143- Backticks-style quotes (\`\`like this'') into "curly" quote characters
144- Dashes (``--`` and ``---``) into en- and em-dash entities
145- Three consecutive dots (``...`` or ``. . .``) into an ellipsis entity
147This means you can write, edit, and save your posts using plain old
148ASCII straight quotes, plain dashes, and plain dots, but your published
149posts (and final HTML output) will appear with smart quotes, em-dashes,
150and proper ellipses.
152SmartyPants does not modify characters within ``<pre>``, ``<code>``, ``<kbd>``,
153``<math>`` or ``<script>`` tag blocks. Typically, these tags are used to
154display text where smart quotes and other "smart punctuation" would not be
155appropriate, such as source code or example markup.
158Backslash Escapes
159=================
161If you need to use literal straight quotes (or plain hyphens and periods),
162`smartquotes` accepts the following backslash escape sequences to force
163ASCII-punctuation. Mind, that you need two backslashes as Docutils expands it,
164too.
166======== =========
167Escape Character
168======== =========
169``\\`` \\
170``\\"`` \\"
171``\\'`` \\'
172``\\.`` \\.
173``\\-`` \\-
174``\\``` \\`
175======== =========
177This is useful, for example, when you want to use straight quotes as
178foot and inch marks: 6\\'2\\" tall; a 17\\" iMac.
181Caveats
182=======
184Why You Might Not Want to Use Smart Quotes in Your Weblog
185---------------------------------------------------------
187For one thing, you might not care.
189Most normal, mentally stable individuals do not take notice of proper
190typographic punctuation. Many design and typography nerds, however, break
191out in a nasty rash when they encounter, say, a restaurant sign that uses
192a straight apostrophe to spell "Joe's".
194If you're the sort of person who just doesn't care, you might well want to
195continue not caring. Using straight quotes -- and sticking to the 7-bit
196ASCII character set in general -- is certainly a simpler way to live.
198Even if you *do* care about accurate typography, you still might want to
199think twice before educating the quote characters in your weblog. One side
200effect of publishing curly quote characters is that it makes your
201weblog a bit harder for others to quote from using copy-and-paste. What
202happens is that when someone copies text from your blog, the copied text
203contains the 8-bit curly quote characters (as well as the 8-bit characters
204for em-dashes and ellipses, if you use these options). These characters
205are not standard across different text encoding methods, which is why they
206need to be encoded as characters.
208People copying text from your weblog, however, may not notice that you're
209using curly quotes, and they'll go ahead and paste the unencoded 8-bit
210characters copied from their browser into an email message or their own
211weblog. When pasted as raw "smart quotes", these characters are likely to
212get mangled beyond recognition.
214That said, my own opinion is that any decent text editor or email client
215makes it easy to stupefy smart quote characters into their 7-bit
216equivalents, and I don't consider it my problem if you're using an
217indecent text editor or email client.
220Algorithmic Shortcomings
221------------------------
223One situation in which quotes will get curled the wrong way is when
224apostrophes are used at the start of leading contractions. For example::
226 'Twas the night before Christmas.
228In the case above, SmartyPants will turn the apostrophe into an opening
229secondary quote, when in fact it should be the `RIGHT SINGLE QUOTATION MARK`
230character which is also "the preferred character to use for apostrophe"
231(Unicode). I don't think this problem can be solved in the general case --
232every word processor I've tried gets this wrong as well. In such cases, it's
233best to inset the `RIGHT SINGLE QUOTATION MARK` (’) by hand.
235In English, the same character is used for apostrophe and closing secondary
236quote (both plain and "smart" ones). For other locales (French, Italean,
237Swiss, ...) "smart" secondary closing quotes differ from the curly apostrophe.
239 .. class:: language-fr
241 Il dit : "C'est 'super' !"
243If the apostrophe is used at the end of a word, it cannot be distinguished
244from a secondary quote by the algorithm. Therefore, a text like::
246 .. class:: language-de-CH
248 "Er sagt: 'Ich fass' es nicht.'"
250will get a single closing guillemet instead of an apostrophe.
252This can be prevented by use use of the `RIGHT SINGLE QUOTATION MARK` in
253the source::
255 - "Er sagt: 'Ich fass' es nicht.'"
256 + "Er sagt: 'Ich fass’ es nicht.'"
259Version History
260===============
2621.9 2022-03-04
263 - Code cleanup. Require Python 3.
2651.8.1 2017-10-25
266 - Use open quote after Unicode whitespace, ZWSP, and ZWNJ.
267 - Code cleanup.
2691.8: 2017-04-24
270 - Command line front-end.
2721.7.1: 2017-03-19
273 - Update and extend language-dependent quotes.
274 - Differentiate apostrophe from single quote.
2761.7: 2012-11-19
277 - Internationalization: language-dependent quotes.
2791.6.1: 2012-11-06
280 - Refactor code, code cleanup,
281 - `educate_tokens()` generator as interface for Docutils.
2831.6: 2010-08-26
284 - Adaption to Docutils:
285 - Use Unicode instead of HTML entities,
286 - Remove code special to pyblosxom.
2881.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400
289 - Fixed bug where blocks of precious unalterable text was instead
290 interpreted. Thanks to Le Roux and Dirk van Oosterbosch.
2921.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400
293 - Fix bogus magical quotation when there is no hint that the
294 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen.
295 - Be smarter about quotes before terminating numbers in an en-dash'ed
296 range.
2981.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500
299 - Fix a date-processing bug, as reported by jacob childress.
300 - Begin a test-suite for ensuring correct output.
301 - Removed import of "string", since I didn't really need it.
302 (This was my first every Python program. Sue me!)
3041.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400
305 - Abort processing if the flavour is in forbidden-list. Default of
306 [ "rss" ] (Idea of Wolfgang SCHNERRING.)
307 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING.
3091.5_1.2: Mon, 24 May 2004 08:14:54 -0400
310 - Some single quotes weren't replaced properly. Diff-tesuji played
311 by Benjamin GEIGER.
3131.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500
314 - Support upcoming pyblosxom 0.9 plugin verification feature.
3161.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500
317 - Initial release
318"""
320import re
321import sys
324options = r"""
325Options
326=======
328Numeric values are the easiest way to configure SmartyPants' behavior:
330:0: Suppress all transformations. (Do nothing.)
332:1: Performs default SmartyPants transformations: quotes (including
333 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash)
334 is used to signify an em-dash; there is no support for en-dashes
336:2: Same as smarty_pants="1", except that it uses the old-school typewriter
337 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``"
338 (dash dash dash)
339 for em-dashes.
341:3: Same as smarty_pants="2", but inverts the shorthand for dashes:
342 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for
343 en-dashes.
345:-1: Stupefy mode. Reverses the SmartyPants transformation process, turning
346 the characters produced by SmartyPants into their ASCII equivalents.
347 E.g. the LEFT DOUBLE QUOTATION MARK (“) is turned into a simple
348 double-quote (\"), "—" is turned into two dashes, etc.
351The following single-character attribute values can be combined to toggle
352individual transformations from within the smarty_pants attribute. For
353example, ``"1"`` is equivalent to ``"qBde"``.
355:q: Educates normal quote characters: (") and (').
357:b: Educates \`\`backticks'' -style double quotes.
359:B: Educates \`\`backticks'' -style double quotes and \`single' quotes.
361:d: Educates em-dashes.
363:D: Educates em-dashes and en-dashes, using old-school typewriter
364 shorthand: (dash dash) for en-dashes, (dash dash dash) for em-dashes.
366:i: Educates em-dashes and en-dashes, using inverted old-school typewriter
367 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes.
369:e: Educates ellipses.
371:w: Translates any instance of ``"`` into a normal double-quote
372 character. This should be of no interest to most people, but
373 of particular interest to anyone who writes their posts using
374 Dreamweaver, as Dreamweaver inexplicably uses this entity to represent
375 a literal double-quote character. SmartyPants only educates normal
376 quotes, not entities (because ordinarily, entities are used for
377 the explicit purpose of representing the specific character they
378 represent). The "w" option must be used in conjunction with one (or
379 both) of the other quote options ("q" or "b"). Thus, if you wish to
380 apply all SmartyPants transformations (quotes, en- and em-dashes, and
381 ellipses) and also translate ``"`` entities into regular quotes
382 so SmartyPants can educate them, you should pass the following to the
383 smarty_pants attribute:
384"""
387class smartchars:
388 """Smart quotes and dashes"""
390 endash = '–' # "–" EN DASH
391 emdash = '—' # "—" EM DASH
392 ellipsis = '…' # "…" HORIZONTAL ELLIPSIS
393 apostrophe = '’' # "’" RIGHT SINGLE QUOTATION MARK
395 # quote characters (language-specific, set in __init__())
396 # https://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
397 # https://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen
398 # https://fr.wikipedia.org/wiki/Guillemet
399 # https://typographisme.net/post/Les-espaces-typographiques-et-le-web
400 # https://www.btb.termiumplus.gc.ca/tpv2guides/guides/redac/index-fra.html
401 # https://en.wikipedia.org/wiki/Hebrew_punctuation#Quotation_marks
402 # [7] https://www.tustep.uni-tuebingen.de/bi/bi00/bi001t1-anfuehrung.pdf
403 # [8] https://www.korrekturavdelingen.no/anforselstegn.htm
404 # [9] Typografisk håndbok. Oslo: Spartacus. 2000. s. 67. ISBN 8243001530.
405 # [10] https://www.typografi.org/sitat/sitatart.html
406 # [11] https://mk.wikipedia.org/wiki/Правопис_и_правоговор_на_македонскиот_јазик # noqa:E501
407 # [12] https://hrvatska-tipografija.com/polunavodnici/
408 # [13] https://pl.wikipedia.org/wiki/Cudzys%C5%82%C3%B3w
409 #
410 # See also configuration option "smartquote-locales".
411 quotes = {
412 'af': '“”‘’',
413 'af-x-altquot': '„”‚’',
414 'bg': '„“‚‘', # https://bg.wikipedia.org/wiki/Кавички
415 'ca': '«»“”',
416 'ca-x-altquot': '“”‘’',
417 'cs': '„“‚‘',
418 'cs-x-altquot': '»«›‹',
419 'da': '»«›‹',
420 'da-x-altquot': '„“‚‘',
421 # 'da-x-altquot2': '””’’',
422 'de': '„“‚‘',
423 'de-x-altquot': '»«›‹',
424 'de-ch': '«»‹›',
425 'el': '«»“”', # '«»‟”' https://hal.science/hal-02101618
426 'en': '“”‘’',
427 'en-uk-x-altquot': '‘’“”', # Attention: " → ‘ and ' → “ !
428 'eo': '“”‘’',
429 'es': '«»“”',
430 'es-x-altquot': '“”‘’',
431 'et': '„“‚‘', # no secondary quote listed in
432 'et-x-altquot': '«»‹›', # the sources above (wikipedia.org)
433 'eu': '«»‹›',
434 'fi': '””’’',
435 'fi-x-altquot': '»»››',
436 'fr': ('« ', ' »', '“', '”'), # full no-break space
437 'fr-x-altquot': ('« ', ' »', '“', '”'), # narrow no-break space
438 'fr-ch': '«»‹›', # https://typoguide.ch/
439 'fr-ch-x-altquot': ('« ', ' »', '‹ ', ' ›'), # narrow no-break space # noqa:E501
440 'gl': '«»“”',
441 'he': '”“»«', # Hebrew is RTL, test position:
442 'he-x-altquot': '„”‚’', # low quotation marks are opening.
443 # 'he-x-altquot': '“„‘‚', # RTL: low quotation marks opening
444 'hr': '„”‘’', # Croatian [12]
445 'hr-x-altquot': '»«›‹',
446 'hsb': '„“‚‘',
447 'hsb-x-altquot': '»«›‹',
448 'hu': '„”«»',
449 'is': '„“‚‘',
450 'it': '«»“”',
451 'it-ch': '«»‹›',
452 'it-x-altquot': '“”‘’',
453 # 'it-x-altquot2': '“„‘‚', # [7] in headlines
454 'ja': '「」『』',
455 'ko': '“”‘’',
456 'lt': '„“‚‘',
457 'lv': '„“‚‘',
458 'mk': '„“‚‘', # Macedonian [11]
459 'nl': '“”‘’',
460 'nl-x-altquot': '„”‚’',
461 # 'nl-x-altquot2': '””’’',
462 'nb': '«»’’', # Norsk bokmål (canonical form 'no')
463 'nn': '«»’’', # Nynorsk [10]
464 'nn-x-altquot': '«»‘’', # [8], [10]
465 # 'nn-x-altquot2': '«»«»', # [9], [10]
466 # 'nn-x-altquot3': '„“‚‘', # [10]
467 'no': '«»’’', # Norsk bokmål [10]
468 'no-x-altquot': '«»‘’', # [8], [10]
469 # 'no-x-altquot2': '«»«»', # [9], [10
470 # 'no-x-altquot3': '„“‚‘', # [10]
471 'pl': '„”«»',
472 'pl-x-altquot': '«»‚’',
473 # 'pl-x-altquot2': '„”‚’', # [13]
474 'pt': '«»“”',
475 'pt-br': '“”‘’',
476 'ro': '„”«»',
477 'ru': '«»„“',
478 'sh': '„”‚’', # Serbo-Croatian
479 'sh-x-altquot': '»«›‹',
480 'sk': '„“‚‘', # Slovak
481 'sk-x-altquot': '»«›‹',
482 'sl': '„“‚‘', # Slovenian
483 'sl-x-altquot': '»«›‹',
484 'sq': '«»‹›', # Albanian
485 'sq-x-altquot': '“„‘‚',
486 'sr': '„”’’',
487 'sr-x-altquot': '»«›‹',
488 'sv': '””’’',
489 'sv-x-altquot': '»»››',
490 'tr': '“”‘’',
491 'tr-x-altquot': '«»‹›',
492 # 'tr-x-altquot2': '“„‘‚', # [7] antiquated?
493 'uk': '«»„“',
494 'uk-x-altquot': '„“‚‘',
495 'zh-cn': '“”‘’',
496 'zh-tw': '「」『』',
497 }
499 def __init__(self, language='en'):
500 self.language = language
501 try:
502 (self.opquote, self.cpquote,
503 self.osquote, self.csquote) = self.quotes[language.lower()]
504 except KeyError:
505 self.opquote, self.cpquote, self.osquote, self.csquote = '""\'\''
508default_smartypants_attr = '1'
511def smartyPants(text, attr=default_smartypants_attr, language='en'):
512 """Main function for "traditional" use."""
514 return "".join(t for t in educate_tokens(tokenize(text), attr, language))
517def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'):
518 """Return iterator that "educates" the items of `text_tokens`."""
519 # Parse attributes:
520 # 0 : do nothing
521 # 1 : set all
522 # 2 : set all, using old school en- and em- dash shortcuts
523 # 3 : set all, using inverted old school en and em- dash shortcuts
524 #
525 # q : quotes
526 # b : backtick quotes (``double'' only)
527 # B : backtick quotes (``double'' and `single')
528 # d : dashes
529 # D : old school dashes
530 # i : inverted old school dashes
531 # e : ellipses
532 # w : convert " entities to " for Dreamweaver users
534 convert_quot = False # translate " entities into normal quotes?
535 do_dashes = False
536 do_backticks = False
537 do_quotes = False
538 do_ellipses = False
539 do_stupefy = False
541 # if attr == "0": # pass tokens unchanged (see below).
542 if attr == '1': # Do everything, turn all options on.
543 do_quotes = True
544 do_backticks = True
545 do_dashes = 1
546 do_ellipses = True
547 elif attr == '2':
548 # Do everything, turn all options on, use old school dash shorthand.
549 do_quotes = True
550 do_backticks = True
551 do_dashes = 2
552 do_ellipses = True
553 elif attr == '3':
554 # Do everything, use inverted old school dash shorthand.
555 do_quotes = True
556 do_backticks = True
557 do_dashes = 3
558 do_ellipses = True
559 elif attr == '-1': # Special "stupefy" mode.
560 do_stupefy = True
561 else:
562 if 'q' in attr: do_quotes = True # noqa: E701
563 if 'b' in attr: do_backticks = True # noqa: E701
564 if 'B' in attr: do_backticks = 2 # noqa: E701
565 if 'd' in attr: do_dashes = 1 # noqa: E701
566 if 'D' in attr: do_dashes = 2 # noqa: E701
567 if 'i' in attr: do_dashes = 3 # noqa: E701
568 if 'e' in attr: do_ellipses = True # noqa: E701
569 if 'w' in attr: convert_quot = True # noqa: E701
571 prev_token_last_char = ' '
572 # Last character of the previous text token. Used as
573 # context to curl leading quote characters correctly.
575 for (ttype, text) in text_tokens:
577 # skip HTML and/or XML tags as well as empty text tokens
578 # without updating the last character
579 if ttype == 'tag' or not text:
580 yield text
581 continue
583 # skip literal text (math, literal, raw, ...)
584 if ttype == 'literal':
585 prev_token_last_char = text[-1:]
586 yield text
587 continue
589 last_char = text[-1:] # Remember last char before processing.
591 text = processEscapes(text)
593 if convert_quot:
594 text = text.replace('"', '"')
596 if do_dashes == 1:
597 text = educateDashes(text)
598 elif do_dashes == 2:
599 text = educateDashesOldSchool(text)
600 elif do_dashes == 3:
601 text = educateDashesOldSchoolInverted(text)
603 if do_ellipses:
604 text = educateEllipses(text)
606 # Note: backticks need to be processed before quotes.
607 if do_backticks:
608 text = educateBackticks(text, language)
610 if do_backticks == 2:
611 text = educateSingleBackticks(text, language)
613 if do_quotes:
614 # Replace plain quotes in context to prevent conversion to
615 # 2-character sequence in French.
616 context = prev_token_last_char.replace('"', ';').replace("'", ';')
617 text = educateQuotes(context+text, language)[1:]
619 if do_stupefy:
620 text = stupefyEntities(text, language)
622 # Remember last char as context for the next token
623 prev_token_last_char = last_char
625 text = processEscapes(text, restore=True)
627 yield text
630def educateQuotes(text, language='en'):
631 """
632 Parameter: - text string (unicode or bytes).
633 - language (`BCP 47` language tag.)
634 Returns: The `text`, with "educated" curly quote characters.
636 Example input: "Isn't this fun?"
637 Example output: “Isn’t this fun?“;
638 """
640 smart = smartchars(language)
641 ch_classes = {'open': '[([{]', # opening braces
642 'close': r'[^\s]', # everything except whitespace
643 'punct': r"""[-!" #\$\%'()*+,.\/:;<=>?\@\[\\\]\^_`{|}~]""",
644 'dash': '[-–—]' # hyphen and em/en dashes
645 r'|&[mn]dash;|&\#8211;|&\#8212;|&\#x201[34];',
646 'sep': '[\\s\u200B\u200C]| ', # Whitespace, ZWSP, ZWNJ
647 }
649 # Special case if the very first character is a quote
650 # followed by punctuation at a non-word-break. Use closing quotes.
651 # TODO: example (when does this match?)
652 text = re.sub(r"^'(?=%s\\B)" % ch_classes['punct'], smart.csquote, text)
653 text = re.sub(r'^"(?=%s\\B)' % ch_classes['punct'], smart.cpquote, text)
655 # Special case for adjacent quotes
656 # like "'Quoted' words in a larger quote."
657 text = re.sub('"\'(?=\\w)', smart.opquote+smart.osquote, text)
658 text = re.sub('\'"(?=\\w)', smart.osquote+smart.opquote, text)
660 # Special case: "opening character" followed by quote,
661 # optional punctuation and space like "[", '(', or '-'.
662 text = re.sub(r"(%(open)s|%(dash)s)'(?=%(punct)s? )" % ch_classes,
663 r'\1%s'%smart.csquote, text)
664 text = re.sub(r'(%(open)s|%(dash)s)"(?=%(punct)s? )' % ch_classes,
665 r'\1%s'%smart.cpquote, text)
667 # Special case for decade abbreviations (the '80s):
668 if language.startswith('en'): # TODO similar cases in other languages?
669 text = re.sub(r"'(?=\d{2}s)", smart.apostrophe, text)
671 # Get most opening secondary quotes:
672 opening_secondary_quotes_regex = re.compile("""
673 (# ?<= # look behind fails: requires fixed-width pattern
674 %(sep)s | # a whitespace char, or
675 %(open)s | # opening brace, or
676 %(dash)s # em/en-dash
677 )
678 ' # the quote
679 (?=\\w|%(punct)s) # word character or punctuation
680 """ % ch_classes, re.VERBOSE)
682 text = opening_secondary_quotes_regex.sub(r'\1'+smart.osquote, text)
684 # In many locales, secondary closing quotes are different from apostrophe:
685 if smart.csquote != smart.apostrophe:
686 apostrophe_regex = re.compile(r"(?<=(\w|\d))'(?=\w)")
687 text = apostrophe_regex.sub(smart.apostrophe, text)
688 # TODO: keep track of quoting level to recognize apostrophe in, e.g.,
689 # "Ich fass' es nicht."
691 closing_secondary_quotes_regex = re.compile(r"(?<!\s)'")
692 text = closing_secondary_quotes_regex.sub(smart.csquote, text)
694 # Any remaining secondary quotes should be opening ones:
695 text = text.replace(r"'", smart.osquote)
697 # Get most opening primary quotes:
698 opening_primary_quotes_regex = re.compile("""
699 (
700 %(sep)s | # a whitespace char, or
701 %(open)s | # zero width separating char, or
702 %(dash)s # em/en-dash
703 )
704 " # the quote, followed by
705 (?=\\w|%(punct)s) # a word character or punctuation
706 """ % ch_classes, re.VERBOSE)
708 text = opening_primary_quotes_regex.sub(r'\1'+smart.opquote, text)
710 # primary closing quotes:
711 closing_primary_quotes_regex = re.compile(r"""
712 (
713 (?<!\s)" | # no whitespace before
714 "(?=\s) # whitespace behind
715 )
716 """, re.VERBOSE)
717 text = closing_primary_quotes_regex.sub(smart.cpquote, text)
719 # Any remaining quotes should be opening ones.
720 text = text.replace(r'"', smart.opquote)
722 return text
725def educateBackticks(text, language='en'):
726 """
727 Parameter: String (unicode or bytes).
728 Returns: The `text`, with ``backticks'' -style double quotes
729 translated into HTML curly quote entities.
730 Example input: ``Isn't this fun?''
731 Example output: “Isn't this fun?“;
732 """
733 smart = smartchars(language)
735 text = text.replace(r'``', smart.opquote)
736 text = text.replace(r"''", smart.cpquote)
737 return text
740def educateSingleBackticks(text, language='en'):
741 """
742 Parameter: String (unicode or bytes).
743 Returns: The `text`, with `backticks' -style single quotes
744 translated into HTML curly quote entities.
746 Example input: `Isn't this fun?'
747 Example output: ‘Isn’t this fun?’
748 """
749 smart = smartchars(language)
751 text = text.replace(r'`', smart.osquote)
752 text = text.replace(r"'", smart.csquote)
753 return text
756def educateDashes(text):
757 """
758 Parameter: String (unicode or bytes).
759 Returns: The `text`, with each instance of "--" translated to
760 an em-dash character.
761 """
763 text = text.replace(r'---', smartchars.endash) # en (yes, backwards)
764 text = text.replace(r'--', smartchars.emdash) # em (yes, backwards)
765 return text
768def educateDashesOldSchool(text):
769 """
770 Parameter: String (unicode or bytes).
771 Returns: The `text`, with each instance of "--" translated to
772 an en-dash character, and each "---" translated to
773 an em-dash character.
774 """
776 text = text.replace(r'---', smartchars.emdash)
777 text = text.replace(r'--', smartchars.endash)
778 return text
781def educateDashesOldSchoolInverted(text):
782 """
783 Parameter: String (unicode or bytes).
784 Returns: The `text`, with each instance of "--" translated to
785 an em-dash character, and each "---" translated to
786 an en-dash character. Two reasons why: First, unlike the
787 en- and em-dash syntax supported by
788 EducateDashesOldSchool(), it's compatible with existing
789 entries written before SmartyPants 1.1, back when "--" was
790 only used for em-dashes. Second, em-dashes are more
791 common than en-dashes, and so it sort of makes sense that
792 the shortcut should be shorter to type. (Thanks to Aaron
793 Swartz for the idea.)
794 """
795 text = text.replace(r'---', smartchars.endash) # em
796 text = text.replace(r'--', smartchars.emdash) # en
797 return text
800def educateEllipses(text):
801 """
802 Parameter: String (unicode or bytes).
803 Returns: The `text`, with each instance of "..." translated to
804 an ellipsis character.
806 Example input: Huh...?
807 Example output: Huh…?
808 """
810 text = text.replace(r'...', smartchars.ellipsis)
811 text = text.replace(r'. . .', smartchars.ellipsis)
812 return text
815def stupefyEntities(text, language='en'):
816 """
817 Parameter: String (unicode or bytes).
818 Returns: The `text`, with each SmartyPants character translated to
819 its ASCII counterpart.
821 Example input: “Hello — world.”
822 Example output: "Hello -- world."
823 """
824 smart = smartchars(language)
826 text = text.replace(smart.endash, "-")
827 text = text.replace(smart.emdash, "--")
828 text = text.replace(smart.osquote, "'") # open secondary quote
829 text = text.replace(smart.csquote, "'") # close secondary quote
830 text = text.replace(smart.opquote, '"') # open primary quote
831 text = text.replace(smart.cpquote, '"') # close primary quote
832 text = text.replace(smart.ellipsis, '...')
834 return text
837def processEscapes(text, restore=False):
838 r"""
839 Parameter: String (unicode or bytes).
840 Returns: The `text`, with after processing the following backslash
841 escape sequences. This is useful if you want to force a "dumb"
842 quote or other character to appear.
844 Escape Value
845 ------ -----
846 \\ \
847 \" "
848 \' '
849 \. .
850 \- -
851 \` `
852 """
853 replacements = ((r'\\', r'\'),
854 (r'\"', r'"'),
855 (r"\'", r'''),
856 (r'\.', r'.'),
857 (r'\-', r'-'),
858 (r'\`', r'`'))
859 if restore:
860 for (ch, rep) in replacements:
861 text = text.replace(rep, ch[1])
862 else:
863 for (ch, rep) in replacements:
864 text = text.replace(ch, rep)
866 return text
869def tokenize(text):
870 """
871 Parameter: String containing HTML markup.
872 Returns: An iterator that yields the tokens comprising the input
873 string. Each token is either a tag (possibly with nested,
874 tags contained therein, such as <a href="<MTFoo>">, or a
875 run of text between tags. Each yielded element is a
876 two-element tuple; the first is either 'tag' or 'text';
877 the second is the actual value.
879 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin.
880 """
881 tag_soup = re.compile(r'([^<]*)(<[^>]*>)')
882 token_match = tag_soup.search(text)
883 previous_end = 0
885 while token_match is not None:
886 if token_match.group(1):
887 yield 'text', token_match.group(1)
888 yield 'tag', token_match.group(2)
889 previous_end = token_match.end()
890 token_match = tag_soup.search(text, token_match.end())
892 if previous_end < len(text):
893 yield 'text', text[previous_end:]
896if __name__ == "__main__":
898 import itertools
899 import locale
900 try:
901 locale.setlocale(locale.LC_ALL, '') # set to user defaults
902 defaultlanguage = locale.getlocale()[0]
903 except: # noqa catchall
904 defaultlanguage = 'en'
906 # Normalize and drop unsupported subtags:
907 defaultlanguage = defaultlanguage.lower().replace('-', '_')
908 # split (except singletons, which mark the following tag as non-standard):
909 defaultlanguage = re.sub(r'_([a-zA-Z0-9])_', r'_\1-', defaultlanguage)
910 _subtags = [subtag for subtag in defaultlanguage.split('_')]
911 _basetag = _subtags.pop(0)
912 # find all combinations of subtags
913 for n in range(len(_subtags), 0, -1):
914 for tags in itertools.combinations(_subtags, n):
915 _tag = '-'.join((_basetag, *tags))
916 if _tag in smartchars.quotes:
917 defaultlanguage = _tag
918 break
919 else:
920 if _basetag in smartchars.quotes:
921 defaultlanguage = _basetag
922 else:
923 defaultlanguage = 'en'
925 import argparse
926 parser = argparse.ArgumentParser(
927 description='Filter <input> making ASCII punctuation "smart".')
928 # TODO: require input arg or other means to print USAGE instead of waiting.
929 # parser.add_argument("input", help="Input stream, use '-' for stdin.")
930 parser.add_argument("-a", "--action", default="1",
931 help="what to do with the input (see --actionhelp)")
932 parser.add_argument("-e", "--encoding", default="utf-8",
933 help="text encoding")
934 parser.add_argument("-l", "--language", default=defaultlanguage,
935 help="text language (BCP47 tag), "
936 f"Default: {defaultlanguage}")
937 parser.add_argument("-q", "--alternative-quotes", action="store_true",
938 help="use alternative quote style")
939 parser.add_argument("--doc", action="store_true",
940 help="print documentation")
941 parser.add_argument("--actionhelp", action="store_true",
942 help="list available actions")
943 parser.add_argument("--stylehelp", action="store_true",
944 help="list available quote styles")
945 parser.add_argument("--test", action="store_true",
946 help="perform short self-test")
947 args = parser.parse_args()
949 if args.doc:
950 print(__doc__)
951 elif args.actionhelp:
952 print(options)
953 elif args.stylehelp:
954 print()
955 print("Available styles (primary open/close, secondary open/close)")
956 print("language tag quotes")
957 print("============ ======")
958 for key in sorted(smartchars.quotes.keys()):
959 print("%-14s %s" % (key, smartchars.quotes[key]))
960 elif args.test:
961 # Unit test output goes to stderr.
962 import unittest
964 class TestSmartypantsAllAttributes(unittest.TestCase):
965 # the default attribute is "1", which means "all".
966 def test_dates(self):
967 self.assertEqual(smartyPants("1440-80's"), "1440-80’s")
968 self.assertEqual(smartyPants("1440-'80s"), "1440-’80s")
969 self.assertEqual(smartyPants("1440---'80s"), "1440–’80s")
970 self.assertEqual(smartyPants("1960's"), "1960’s")
971 self.assertEqual(smartyPants("one two '60s"), "one two ’60s")
972 self.assertEqual(smartyPants("'60s"), "’60s")
974 def test_educated_quotes(self):
975 self.assertEqual(smartyPants('"Isn\'t this fun?"'),
976 '“Isn’t this fun?”')
978 def test_html_tags(self):
979 text = '<a src="foo">more</a>'
980 self.assertEqual(smartyPants(text), text)
982 suite = unittest.TestLoader().loadTestsFromTestCase(
983 TestSmartypantsAllAttributes)
984 unittest.TextTestRunner().run(suite)
986 else:
987 if args.alternative_quotes:
988 if '-x-altquot' in args.language:
989 args.language = args.language.replace('-x-altquot', '')
990 else:
991 args.language += '-x-altquot'
992 text = sys.stdin.read()
993 print(smartyPants(text, attr=args.action, language=args.language))