Escapology for

Pythonistas


Alex Willmer

This talk is not about Unicode


Watch http://bit.ly/unipain to stop your Unicode pain


Special Characters

    u'\0' (NUL) 
    u'\t' u'\n' (Control characters)
    < > & ' ", 
    é æ €


SQL

Special characters

'             String literal delimiter
" ` [ ]       Identifier delimiter
/* */ // -- % ({ })         Comment delimiterMore...


Escape mechanism

  1. Parameterised query (Bind variables)
 
Photo:  Glenn Euloth

Wrong

cursor.execute(
    u"SELECT name, ... FROM users "
    u"WHERE username = '" + username + u"'")

    cursor.execute(
    u"SELECT name, ... FROM users "
    u"WHERE username = '%s'"
    % (username,))

    def escape(s):
    return s.replace(u"'", u'\\')

cursor.execute(
    u"SELECT name, ... FROM users "
    u"WHERE username = '%s'"
    % (escape(username),))

Correct

cursor.execute(u"SELECT name, ... FROM users "
               u"WHERE username = ?",
               (username,))

Gotchas/tips

  • paramstyle depends on database
  • Dynamic SQL have to string interpolate identifiers don't interpolate data from external sources
  • Don't use BLOBs - write binary data to a file, store the full path in the database
  • For automatic encoding/decoding of Unicode in Postgresql register two type handlers
    import psycopg2
    import psycopg2.extensions
    psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
    psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
    conn = psycopg2.connect(...)
    cursor = conn.cursor()
    cursor.execute(u"SELECT name, ... FROM users "
                   u"WHERE username = %s",
                   (username,)) 

CSV

Special characters:

, \t | :    Field delimiter
\n \n\f \f  Row delimiter " '         Field quote             Space

Escape mechanisms

  1. Quote fields
  2. Repeat quote character within fields
Wrong
data = [[42.0, u'Tartan paint',         u"Björk's DIY", u'€19.95'],
        [1, u'Hammer, left handed',          u'"Crazy" Eddie', u'¥2000']]

with open('output.csv', 'w') as csv_out:
    for row in data:        item = unicode(item)
        line = ','.join(item.encode('utf-8')
                        for item in row)
    csv_out.write(line + '\n')

Correct, but fragile

# -*- coding: utf-8 -*-
import csv, codecs cStringIO

class UTF8Recoder:
    """Iterator that reads an encoded stream and reencodes the input to UTF-8"""
    ...

class UnicodeWriter:
    """A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding."""    ...
data = [[42.0, u'Tartan paint', u"Björk's DIY", u'€19.95'],
        [1, u'Hammer, left handed', u'"Crazy" Eddie', u'¥2000']]

with codecs.open('wildgeese.csv', 'wb') as csv_out:
    csv_out.write(codecs.BOM_UTF8)
    writer = UnicodeWriter(csv_out)
    writer.writerows(data)

Gotchas

  • There's no universal standard for CSV, just common practise
  • Excel will still spoil decimals & digit grouping, due to regional settings
    e.g. 3.14159 in UK, 3,14159 in Sweden

Correct

Give up on TSV and CSV, use .xls with xlwt

# -*- coding: utf-8 -*-
import xlwt

data = [[4.0, u'Tartan paint',         u"Björk's DIY", u'€19.95'],
        [1, u'Hammer, left handed',         u'"Crazy" Eddie', u'¥2000']]

workbook = xlwt.Workbook(encoding='utf-8')
sheet = workbook.add_sheet(u'Sheet 1¼')
for i, row in enumerate(data):
	for j, item in enumerate(row):
		sheet.write(i, j, item)

workbook.save('wildgeese.xls')

JSON

What's special

{ }     Object delimiter
[ ]     Array delimiter
"       String delimiter> 

Escape mechanisms

  1. Quote all strings
  2. \uNNNN escape within strings

Wrong

data = {u'text': u'ℙƴ☂ℌøἤ', 
        u'number': 42,
        u'bool': True,
        u'sequence': (1,2,3,4)}
repr(data)

Correct

data = {u'text': u'ℙƴ☂ℌøἤ', 
        u'number': 42,
        u'bool': True,
        u'sequence': (1,2,3,4)}
json.dumps(data)

Correct

data = {u'text': u'ℙƴ☂ℌøἤ', 
        u'number': 42,
        u'bool': True,
        u'sequence': (1,2,3,4)}
json.dumps(data, ensure_ascii=False) 

Gotchas

  • The json module will serialise and deserialise JSON that's invalid according to the RFC
  • JSON has no date type or tuple type


HTML/XML

Special characters:

& < > ' "   Syntactic elements
`           (parsed by Internet Explorer)‫

Escape mechanisms:

  1. Quote attributes (id="mainform")
  2. Use character entities 10 &#247; 2 = 5 
  3. Use character entity references 10 - 12 &lt; 0


Wrong:

data = {u'name': u'Alice "Ali" Gator',
        u'role': u'CEO, <&> design Inc.'}u'<a href="..." title="%(name)s">%(role)s</a>' % data

Wrong:

data = {u'name':        cgi.escape(u'Alice "Ali" Gator'),
        u'role':        cgi.escape(u'CEO, <&> design Inc.')}u'<a href="..." title="%(name)s">%(role)s</a>' % data 

Wrong

data = {u'name':        cgi.escape(u'Bob "Croc" O\'Doyle',                   quote=True),
    u'role':        cgi.escape(u'CEO, <&> design Inc.')}
u'<a href="..." title="%(name)s">%(role)s</a>' % data

Correct (but fragile):

from xml.sax.saxutils import quoteattrdata = {    u'name':    quoteattr(u'Bob "Croc" O\'Doyle'),
    u'role':    cgi.escape(u'CEO, <&> design Inc.')}
u'<a href="..." title="%(name)s">%(role)s</a>' % data

Correct:

import jinja2
data = {u'name': u'Bob "Croc" O\'Doyle',
        u'role': u'CEO, <&> design Inc.'}
env = jinja2.Environment(autoescape=True)
template = env.from_string(
    u'<a href="..." title="{{name}}">'    u'{{role}}</a>')
template.render(data) 

Correct:

import genshi.template
data = {u'name': u'Bob "Croc" O\'Doyle',
        u'role': u'CEO, <&> design Inc.'}

template = genshi.template.MarkupTemplate(
    u'<a href="..." title="${name}">'    u'${role}</a>')
stream = template.generate(**data)
stream.render('html')

Gotchas

HTML

<![CDATA[ ]]> is not a useful escape mechanism,
it's just a convenience for hand written markup

XML 1.0 (XHTML)

Control characters are not valid,
even escaped as character entities
http://www.w3.org/TR/REC-xml/#charsets

XML 1.1

All unicode characters execept NUL are valid, some restricted characters must be escaped
http://www.w3.org/TR/xml11/#charsets

URL/POST

Special characters

Nearly everything, depending on section of the URL and schema

Escape mechanis

  1. Encode as UTF8
  2. Percent escape reserved and non-ascii bytes

Regular Expressions

Wrong

needle = u'needle+thread@example.com'
haystack = (u'hay@straw.com\n'
            u'needle+thread@example.com\n'
            u'christian@bale.com')
match = re.search(u'^%s$' % needle,                  haystack,                  re.UNICODE|re.MULTILINE)

Correct

needle = u'needle+thread@example.com'
haystack = (u'hay@straw.com\n'
            u'needle+thread@example.com\n'
            u'christian@bale.com')
match = re.search(u'^%s$'                  % re.escape(needle),                  haystack,
                  re.UNICODE|re.MULTILINE)

Gotchas

  • re.escape() is only written/tested for Python's re engine
  • The re module uses a back tracking engine, re.search() can take exponential time for even simple patterns

Tips

  • Use the re2 package for services open to DOS attacks
    https://pypi.python.org/pypi/re2/

id attribute

HTML 4 ([A-Za-z]) ([A-Za-z] | [0-9] | [-_:.])+
XHTML 1 (Letter | '_' | ':') (Letter | Digit | [-_:.] | CombiningChar | Extender)
HTML 5 ([^ #x09#x0A#x0C#x0D])+

class attribute

HTML 4 Any characters (CDATA list), I recommend no whitespace
HTML 5 ([^ #x09#x0A#x0C#x0D])+
XHTML 1 Any characters (with < and & escaped), I recommend no whitespace

Thank you



Twitter: @moreati


Escapology for Pythonistas

By Alex Willmer

Escapology for Pythonistas

The finer points of getting bytes in and out of JSON, SQL, URLs and other hard to reach places

  • 1,414