Escapology for

Pythonistas

Alex Willmer

This talk is not about Unicode

Watch http://bit.ly/unipain to stop your Unicode pain

Special Characters

    u'\0' (NUL) 
    u'\t' u'\n' (Control characters)
    < > & ' ", 
    é æ €

SQL

Special characters

'             String literal delimiter
" ` [ ]       Identifier delimiter
/* */ // -- % ({ })         Comment delimiterMore...

Escape mechanism

Parameterised query (Bind variables)

Photo: Glenn Euloth

Wrong

cursor.execute(
    u"SELECT name, ... FROM users "
    u"WHERE username = '" + username + u"'")


    cursor.execute(
    u"SELECT name, ... FROM users "
    u"WHERE username = '%s'"
    % (username,))


    def escape(s):
    return s.replace(u"'", u'\\')

cursor.execute(
    u"SELECT name, ... FROM users "
    u"WHERE username = '%s'"
    % (escape(username),))

Correct

cursor.execute(u"SELECT name, ... FROM users "
               u"WHERE username = ?",
               (username,))

Gotchas/tips

paramstyle depends on database
Dynamic SQL have to string interpolate identifiers don't interpolate data from external sources
Don't use BLOBs - write binary data to a file, store the full path in the database

For automatic encoding/decoding of Unicode in Postgresql register two type handlers

import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
conn = psycopg2.connect(...)
cursor = conn.cursor()
cursor.execute(u"SELECT name, ... FROM users "
               u"WHERE username = %s",
               (username,))

CSV

Special characters:

, \t | :    Field delimiter
\n \n\f \f  Row delimiter
" '         Field quote
            Space

Escape mechanisms

Quote fields

Repeat quote character within fields

Wrong

data = [[42.0, u'Tartan paint',         u"Björk's DIY", u'€19.95'],
        [1, u'Hammer, left handed',          u'"Crazy" Eddie', u'¥2000']]

with open('output.csv', 'w') as csv_out:
    for row in data:        item = unicode(item)
        line = ','.join(item.encode('utf-8')
                        for item in row)
    csv_out.write(line + '\n')

Correct, but fragile

# -*- coding: utf-8 -*-
import csv, codecs cStringIO

class UTF8Recoder:
    """Iterator that reads an encoded stream and reencodes the input to UTF-8"""
    ...

class UnicodeWriter:
    """A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding."""    ...
data = [[42.0, u'Tartan paint', u"Björk's DIY", u'€19.95'],
        [1, u'Hammer, left handed', u'"Crazy" Eddie', u'¥2000']]

with codecs.open('wildgeese.csv', 'wb') as csv_out:
    csv_out.write(codecs.BOM_UTF8)
    writer = UnicodeWriter(csv_out)
    writer.writerows(data)

Gotchas

There's no universal standard for CSV, just common practise
Excel will still spoil decimals & digit grouping, due to regional settings
e.g. 3.14159 in UK, 3,14159 in Sweden

Correct

Give up on TSV and CSV, use .xls with xlwt

# -*- coding: utf-8 -*-
import xlwt

data = [[4.0, u'Tartan paint',         u"Björk's DIY", u'€19.95'],
        [1, u'Hammer, left handed',         u'"Crazy" Eddie', u'¥2000']]

workbook = xlwt.Workbook(encoding='utf-8')
sheet = workbook.add_sheet(u'Sheet 1¼')
for i, row in enumerate(data):
	for j, item in enumerate(row):
		sheet.write(i, j, item)

workbook.save('wildgeese.xls')

JSON

What's special

{ }     Object delimiter
[ ]     Array delimiter
"       String delimiter>

Escape mechanisms

Quote all strings
\uNNNN escape within strings

Wrong

data = {u'text': u'ℙƴ☂ℌøἤ', 
        u'number': 42,
        u'bool': True,
        u'sequence': (1,2,3,4)}
repr(data)

Correct

data = {u'text': u'ℙƴ☂ℌøἤ', 
        u'number': 42,
        u'bool': True,
        u'sequence': (1,2,3,4)}
json.dumps(data)

Correct

data = {u'text': u'ℙƴ☂ℌøἤ', 
        u'number': 42,
        u'bool': True,
        u'sequence': (1,2,3,4)}
json.dumps(data, ensure_ascii=False)

Gotchas

The json module will serialise and deserialise JSON that's invalid according to the RFC
JSON has no date type or tuple type

HTML/XML

Special characters:

& < > ' "   Syntactic elements
`           (parsed by Internet Explorer)‫

Escape mechanisms:

Quote attributes (id="mainform")
Use character entities 10 ÷ 2 = 5
Use character entity references 10 - 12 < 0

Wrong:

data = {u'name': u'Alice "Ali" Gator',
        u'role': u'CEO, <&> design Inc.'}u'<a href="..." title="%(name)s">%(role)s</a>' % data

Wrong:

data = {u'name':        cgi.escape(u'Alice "Ali" Gator'),
        u'role':        cgi.escape(u'CEO, <&> design Inc.')}u'<a href="..." title="%(name)s">%(role)s</a>' % data

Wrong

data = {u'name':        cgi.escape(u'Bob "Croc" O\'Doyle',                   quote=True),
    u'role':        cgi.escape(u'CEO, <&> design Inc.')}
u'<a href="..." title="%(name)s">%(role)s</a>' % data

Correct (but fragile):

from xml.sax.saxutils import quoteattrdata = {    u'name':    quoteattr(u'Bob "Croc" O\'Doyle'),
    u'role':    cgi.escape(u'CEO, <&> design Inc.')}
u'<a href="..." title="%(name)s">%(role)s</a>' % data

Correct:

import jinja2
data = {u'name': u'Bob "Croc" O\'Doyle',
        u'role': u'CEO, <&> design Inc.'}
env = jinja2.Environment(autoescape=True)
template = env.from_string(
    u'<a href="..." title="{{name}}">'    u'{{role}}</a>')
template.render(data)

Correct:

import genshi.template
data = {u'name': u'Bob "Croc" O\'Doyle',
        u'role': u'CEO, <&> design Inc.'}

template = genshi.template.MarkupTemplate(
    u'<a href="..." title="${name}">'    u'${role}</a>')
stream = template.generate(**data)
stream.render('html')

Gotchas

HTML

<![CDATA[ ]]> is not a useful escape mechanism,
it's just a convenience for hand written markup

XML 1.0 (XHTML)

Control characters are not valid,
even escaped as character entities
http://www.w3.org/TR/REC-xml/#charsets

XML 1.1

All unicode characters execept NUL are valid, some restricted characters must be escaped
http://www.w3.org/TR/xml11/#charsets

URL/POST

Special characters

Nearly everything, depending on section of the URL and schema

Escape mechanis

Encode as UTF8
Percent escape reserved and non-ascii bytes

Regular Expressions

Wrong

needle = u'needle+thread@example.com'
haystack = (u'hay@straw.com\n'
            u'needle+thread@example.com\n'
            u'christian@bale.com')
match = re.search(u'^%s$' % needle,                  haystack,                  re.UNICODE|re.MULTILINE)

Correct

needle = u'needle+thread@example.com'
haystack = (u'hay@straw.com\n'
            u'needle+thread@example.com\n'
            u'christian@bale.com')
match = re.search(u'^%s$'                  % re.escape(needle),                  haystack,
                  re.UNICODE|re.MULTILINE)

Gotchas

re.escape() is only written/tested for Python's re engine
The re module uses a back tracking engine, re.search() can take exponential time for even simple patterns

Tips

Use the re2 package for services open to DOS attacks
https://pypi.python.org/pypi/re2/

id attribute

HTML 4 ([A-Za-z]) ([A-Za-z] | [0-9] | [-_:.])+

HTML 5 ([^ #x09#x0A#x0C#x0D])+

class attribute

HTML 4 Any characters (CDATA list), I recommend no whitespace

HTML 5 ([^ #x09#x0A#x0C#x0D])+

XHTML 1 Any characters (with < and & escaped), I recommend no whitespace

Thank you

Slides: https://bit.ly/escapology

Email: al.willmer@cgi.com

Twitter: @moreati

Escapology for Pythonistas

By Alex Willmer

Escapology for Pythonistas

The finer points of getting bytes in and out of JSON, SQL, URLs and other hard to reach places

1,561

Alex Willmer

moreati

Escapology for

Pythonistas

Alex Willmer

This talk is not about Unicode

Special Characters

SQL

Special characters

Escape mechanism

Wrong

Correct

Gotchas/tips

CSV

Special characters:

Escape mechanisms

Correct, but fragile

Gotchas

Correct

JSON

What's special

Escape mechanisms

Wrong

Correct

Correct

Gotchas

HTML/XML

Special characters:

Escape mechanisms:

Wrong:

Wrong:

Wrong

Correct (but fragile):

Correct:

Correct:

Gotchas

HTML

XML 1.0 (XHTML)

XML 1.1

URL/POST

Special characters

Escape mechanis

Regular Expressions

Wrong

Correct

Gotchas

Tips

id attribute

class attribute

Thank you

Escapology for Pythonistas

More from Alex Willmer