Escapology for
Pythonistas
Alex Willmer
This talk is not about Unicode
Watch http://bit.ly/unipain to stop your Unicode pain
Special Characters
u'\0' (NUL)
u'\t' u'\n' (Control characters)
< > & ' ",
é æ €
SQL
Special characters
' String literal delimiter " ` [ ] Identifier delimiter /* */ // -- %
({ }) Comment delimiter
More...
Escape mechanism
- Parameterised query (Bind variables)
Photo:
Glenn Euloth
Wrong
cursor.execute(
u"SELECT name, ... FROM users "
u"WHERE username = '" + username + u"'")
cursor.execute(
u"SELECT name, ... FROM users "
u"WHERE username = '%s'"
% (username,))
def escape(s):
return s.replace(u"'", u'\\')
cursor.execute(
u"SELECT name, ... FROM users "
u"WHERE username = '%s'"
% (escape(username),))
Correct
cursor.execute(u"SELECT name, ... FROM users "
u"WHERE username = ?",
(username,))
Gotchas/tips
- paramstyle depends on database
- Dynamic SQL have to string interpolate identifiers don't interpolate data from external sources
- Don't use BLOBs - write binary data to a file, store the full path in the database
- For automatic encoding/decoding of Unicode in Postgresql register two type handlers
import psycopg2 import psycopg2.extensions psycopg2.extensions.register_type(psycopg2.extensions.UNICODE) psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY) conn = psycopg2.connect(...) cursor = conn.cursor() cursor.execute(u"SELECT name, ... FROM users " u"WHERE username = %s", (username,))
CSV
Special characters:
, \t | : Field delimiter
\n \n\f \f Row delimiter " ' Field quote Space
Escape mechanisms
- Quote fields
- Repeat quote character within fields
Wrong
data = [[42.0, u'Tartan paint',
u"Björk's DIY", u'€19.95'], [1, u'Hammer, left handed',
u'"Crazy" Eddie', u'¥2000']] with open('output.csv', 'w') as csv_out: for row in data:
item = unicode(item) line = ','.join(item.encode('utf-8') for item in row) csv_out.write(line + '\n')
Correct, but fragile
# -*- coding: utf-8 -*- import csv, codecs cStringIO class UTF8Recoder: """Iterator that reads an encoded stream and reencodes the input to UTF-8""" ... class UnicodeWriter: """A CSV writer which will write rows to CSV file "f", which is encoded in the given encoding."""
... data = [[42.0, u'Tartan paint', u"Björk's DIY", u'€19.95'], [1, u'Hammer, left handed', u'"Crazy" Eddie', u'¥2000']] with codecs.open('wildgeese.csv', 'wb') as csv_out: csv_out.write(codecs.BOM_UTF8) writer = UnicodeWriter(csv_out) writer.writerows(data)
Gotchas
- There's no universal standard for CSV, just common practise
- Excel will still spoil decimals & digit grouping, due to regional settings
e.g. 3.14159 in UK, 3,14159 in Sweden
Correct
Give up on TSV and CSV, use .xls with xlwt
# -*- coding: utf-8 -*- import xlwt data = [[4.0, u'Tartan paint',
u"Björk's DIY", u'€19.95'], [1, u'Hammer, left handed',
u'"Crazy" Eddie', u'¥2000']] workbook = xlwt.Workbook(encoding='utf-8') sheet = workbook.add_sheet(u'Sheet 1¼') for i, row in enumerate(data): for j, item in enumerate(row): sheet.write(i, j, item) workbook.save('wildgeese.xls')
JSON
What's special
{ } Object delimiter
[ ] Array delimiter
" String delimiter>
Escape mechanisms
- Quote all strings
- \uNNNN escape within strings
Wrong
data = {u'text': u'ℙƴ☂ℌøἤ',
u'number': 42,
u'bool': True,
u'sequence': (1,2,3,4)}
repr(data)
Correct
data = {u'text': u'ℙƴ☂ℌøἤ', u'number': 42, u'bool': True, u'sequence': (1,2,3,4)}
json.dumps(data)
Correct
data = {u'text': u'ℙƴ☂ℌøἤ', u'number': 42, u'bool': True, u'sequence': (1,2,3,4)}
json.dumps(data, ensure_ascii=False)
Gotchas
-
The json module will serialise and deserialise JSON that's invalid according to the RFC
-
JSON has no date type or tuple type
HTML/XML
Special characters:
& < > ' " Syntactic elements
` (parsed by Internet Explorer)
Escape mechanisms:
- Quote attributes (id="mainform")
- Use character entities 10 ÷ 2 = 5
-
Use character entity references 10 - 12 < 0
Wrong:
data = {u'name': u'Alice "Ali" Gator', u'role': u'CEO, <&> design Inc.'}
u'<a href="..." title="%(name)s">%(role)s</a>' % data
Wrong:
data = {u'name':
cgi.escape(u'Alice "Ali" Gator'), u'role':
cgi.escape(u'CEO, <&> design Inc.')}
u'<a href="..." title="%(name)s">%(role)s</a>' % data
Wrong
data = {u'name':
cgi.escape(u'Bob "Croc" O\'Doyle',
quote=True), u'role':
cgi.escape(u'CEO, <&> design Inc.')}
u'<a href="..." title="%(name)s">%(role)s</a>' % data
Correct (but fragile):
from xml.sax.saxutils import quoteattr
data = {
u'name':
quoteattr(u'Bob "Croc" O\'Doyle'), u'role':
cgi.escape(u'CEO, <&> design Inc.')}
u'<a href="..." title="%(name)s">%(role)s</a>' % data
Correct:
import jinja2 data = {u'name': u'Bob "Croc" O\'Doyle', u'role': u'CEO, <&> design Inc.'}
env = jinja2.Environment(autoescape=True) template = env.from_string( u'<a href="..." title="{{name}}">'
u'{{role}}</a>') template.render(data)
Correct:
import genshi.template
data = {u'name': u'Bob "Croc" O\'Doyle', u'role': u'CEO, <&> design Inc.'} template = genshi.template.MarkupTemplate( u'<a href="..." title="${name}">'
u'${role}</a>') stream = template.generate(**data) stream.render('html')
Gotchas
HTML
<![CDATA[ ]]> is not a useful escape mechanism,
it's just a convenience for hand written markup
XML 1.0 (XHTML)
Control characters are not valid,
even escaped as character entities
http://www.w3.org/TR/REC-xml/#charsets
XML 1.1
All unicode characters execept NUL are valid, some restricted characters must be escaped
http://www.w3.org/TR/xml11/#charsets
URL/POST
Special characters
Nearly everything, depending on section of the URL and schema
Escape mechanis
- Encode as UTF8
- Percent escape reserved and non-ascii bytes
Regular Expressions
Wrong
needle = u'needle+thread@example.com' haystack = (u'hay@straw.com\n' u'needle+thread@example.com\n' u'christian@bale.com') match = re.search(u'^%s$' % needle,
haystack,
re.UNICODE|re.MULTILINE)
Correct
needle = u'needle+thread@example.com' haystack = (u'hay@straw.com\n' u'needle+thread@example.com\n' u'christian@bale.com') match = re.search(u'^%s$'
% re.escape(needle),
haystack, re.UNICODE|re.MULTILINE)
Gotchas
- re.escape() is only written/tested for Python's re engine
- The re module uses a back tracking engine, re.search() can take exponential time for even simple patterns
Tips
- Use the re2 package for services open to DOS attacks
https://pypi.python.org/pypi/re2/
id attribute
HTML 4
([A-Za-z]) ([A-Za-z] | [0-9] | [-_:.])+
XHTML 1
(Letter | '_' | ':') (Letter | Digit | [-_:.] | CombiningChar | Extender)
HTML 5
([^ #x09#x0A#x0C#x0D])+
class attribute
HTML 4
Any characters (CDATA list), I recommend no whitespace
HTML 5
([^ #x09#x0A#x0C#x0D])+
XHTML 1
Any characters (with < and & escaped), I recommend no whitespace
Thank you
Slides:
https://bit.ly/escapology
Email:
al.willmer@cgi.com
Twitter:
@moreati
Escapology for Pythonistas
By Alex Willmer
Escapology for Pythonistas
The finer points of getting bytes in and out of JSON, SQL, URLs and other hard to reach places
- 1,414