degrotesque — A tiny web type setter
Abstract: A small script to improve the appearance of HTML text.
Topics: tools, libraries, web, typography, downloads, open source, programming languages/Python
© Copyright Daniel Krajzewicz, 19.07.2021 21:13, cc by
Introduction
I write most of my HTML pages using a plain text editor. Albeit I use quotes and different punctuation marks, they originally always come as plain ASCII-characters. For making the text a bit prettier, I wrote a small script which parses a HTML text and replaces those plain ASCII characters by some nicer, typographic representation. In the following, you may find some notes about this tool.
But maybe you'd like to have an example:
The script would convert "Well - that's not what I had expected." into: “Well — that's not what I had expected.” I think it looks nicer.
(Uhm, uhm, for those who don't see it, the starting and ending quotes have been replaced by “ and ”, respectively, the ' by ' and the - by an —.)
Download and Installation
The current version is degrotesque-1.4.
You may install degrotesque using
python -m pip install degrotesque
You may download a copy or fork the code at the degrotesque's github page.
Besides, you may download the current release degrotesque-1.4 here:
Licence
The tool is licensed under the LGPL v3.
Summary
Well, have fun. If you have any comments / ideas / issues, please submit them to degrotesque's issue tracker on github.
Documentation
Usage
degrotesque is currently implemented in Python. It is started on the command line. The option -i <PATH> / --input <PATH> tells the script which file(s) shall be read — you may name a file or a folder, here. If the option -r / --recursive is set, the given folder will be processed recursively.
The tool processes only HTML-files and its derivatives. The extensions of those file types that are processed are given in Appendix A. But you may name the extensions of files to process using the -e <EXTENSION>[,<EXTENSION>]* / --extensions <EXTENSION>[,<EXTENSION>]* option.
The files are read one by one and the replacement of plain ASCII-chars by some nicer ones is based upon a chosen set of “actions”. Known and default actions are given in Appendix B. You may select the actions to apply using the -a <ACTION>[,<ACTION>]* / --actions <ACTION>[,<ACTION>]* option.
The files are assumed to be encoded as “UTF-8” per default. You may change the encoding using the option -E <ENCODING> / --encoding <ENCODING>.
The script does not change the quotation marks of HTML elements, of course. As well, the contents of several elements, given in Appendix C are skipped. You may change the list of elements which contents shall not be processed using the option -s <ELEMENT_NAME>[,<ELEMENT_NAME>]* / --skip <ELEMENT_NAME>[,<ELEMENT_NAME>]*.
After the actions have been applied to its contents, the file is saved. By default, the original file is saved under the same name, with the appendix “.orig”. You may omit the creation of these backup files using the option -B / --no-backup.
The default actions are: quotes.english, dashes, ellipsis, math, apostrophe.
Options
- --input/-i <PATH>: the file or the folder to process
- --encoding/-E <ENCODING>: The assumed encoding of the files
- --recursive/-r: Set if the folder — if given — shall be processed recursively
- --no-backup/-B: Set if no backup files shall be generated
- --actions/-a <ACTION_NAME>[,<ACTION_NAME>]*: Name the actions that shall be applied
- --extensions/-e <EXTENSION>[,<EXTENSION>]*: The extensions of files that shall be processed
- --skip/-s <ELEMENT_NAME>[,<ELEMENT_NAME>]*: Elements which contents shall not be changed
Further Documentation
- The web page is located at: http://www.krajzewicz.de/blog/degrotesque.php
- The PyPI page is located at: https://pypi.org/project/degrotesque/
- The github repository is located at: https://github.com/dkrajzew/degrotesque
- The issue tracker is located at: https://github.com/dkrajzew/degrotesque/issues
- The Travis CI page is located at: https://travis-ci.com/github/dkrajzew/degrotesque
- The code documentation (pydoc) is located at: http://www.krajzewicz.de/blog/degrotesque.html
Application Programming Interface - API
You may as well embed degrotesque within your own applications. The usage is very straightforward:
import degrotesque # build the degrotesque instance with default values degrotesque = degrotesque.Degrotesque() # apply degroteque prettyString = degrotesque.prettify(uglyString)
The default values can be replaced using some of the class' interfaces (methods):
# change the actions to apply (by naming them) # here: apply french quotes and math symbols degrotesque.setActions("quotes.french,math") # change the elements which contents shall be skipped # here: skip the contents of "code", "script", and "style" elements degrotesque.setToSkip("code,script,style")
You may as well consult the degrotesque pydoc code documentation.
Implementation Notes
- I tried Genshi, BeautifulSoup, and lxml. All missed in keeping the code unchanged. So the parser just skips HTML-elements and the contents of some special elements, see above. Works in most cases.
Appendices
Appendix A: Default Extensions
Files with the following extensions are parsed per default:
- html, htm, xhtml,
- php, phtml, phtm, php2, php3, php4, php5,
- asp,
- jsp, jspx,
- shtml, shtm, sht, stm,
- vbhtml,
- ppthtml,
- ssp, jhtml
Appendix B: Named Actions
The following action sets are currently implemented.
Please note that the actions are realised using regular expressions. I decided not to show them in the following for a better readability and show the visible changes only.
From Opening String | From Closing String | To Opening String | To Closing String |
---|---|---|---|
quotes.english | |||
' | ' | ‘ | ’ |
" | " | “ | ” |
quotes.french | |||
< | > | ‹ | › |
<< | >> | « | » |
quotes.german | |||
' | ' | ‚ | ’ |
" | " | „ | ” |
to_quotes | |||
' | ' | <q> | </q> |
" | " | <q> | </q> |
<< | >> | <q> | </q> |
< | > | <q> | </q> |
commercial | |||
(c) | © | ||
(r) | ® | ||
(tm) | ™ | ||
dashes | |||
- | — | ||
<NUMBER>-<NUMBER> | <NUMBER>–<NUMBER> | ||
bullets | |||
* | • | ||
ellipsis | |||
... | … | ||
apostrophe | |||
' | ' | ||
math | |||
+/- | ± | ||
1/2 | ½ | ||
1/4 | ¼ | ||
3/4 | ¾ | ||
~ | ≈ | ||
!= | ≠ | ||
<= | ≤ | ||
>= | ≥ | ||
<NUMBER>*<NUMBER> | <NUMBER>×<NUMBER> | ||
<NUMBER>x<NUMBER> | <NUMBER>×<NUMBER> | ||
<NUMBER>/<NUMBER> | <NUMBER>÷<NUMBER> | ||
dagger | |||
** | ‡ | ||
* | † |
Appendix C: Skipped Elements
The contents of the following elements are not processed by default:
- script
- code
- style
- pre
- ?
- ?php
- %
- %=
- %@
- %--
- %!
- !--
Appendix D: Masking Action Set
The “masks” action set is masking some patterns to avoid replacements. When matching, the matching string is kept. The actions are given in the following. Please note that the numbers in { } brackets give the number of subsequent elements.
- 978-<NUMBER>-<NUMBER>-<NUMBER>-<NUMBER>{1}<NO_NUMBER>: avoid ISBN replacement
- 979-<NUMBER>-<NUMBER>-<NUMBER>-<NUMBER>{1}<NO_NUMBER>: avoid ISBN replacement
- <NUMBER>-<NUMBER>-<NUMBER>-<NUMBER>{1}<NO_NUMBER>: avoid ISBN replacement
- ISSN <NUMBER>{4}-<NUMBER>{4}: avoid ISSN replacement
