Elixir 1.6
Chapter 11
Strings and Binaries
String Literals
Elixir has two kinds of string:
'single-quoted' and "double-quoted".
They differ in their internal representation
but have many things in common.
- Strings can hold characters in UTF-8 encoding.
- They may contain escape sequences.
- They allow interpolation using the syntax #{...}.
- Characters that would otherwise have special meaning can be escaped with a backslash.
- They support heredocs.
String Literals
Escape sequences:
\a | BEL (0x07) | \b | BS (0x08) | \d | DEL (0x7f) |
---|---|---|---|---|---|
\e | ESC (0x1b) | \f | FF (0x0c) | \n | NL (0x0a) |
\r |
CR (0x0d) |
\s |
SP (0x20) |
\t |
TAB (0x09) |
\v |
VT (0x0b) |
\uhhh |
1–6 hex digits |
\xhh |
2 hex digits |
String Literals
Interpolation #{...}
iex> name = "dave"
"dave"
iex> "Hello, #{String.capitalize name}!"
"Hello, Dave!"
String Literals
The heredoc notation: Triple the string delimiter (”’ or """)
IO.puts "starting"
IO.write """
my
string
"""
IO.puts "ending"
# produces:
starting
my
string
ending
Heredocs are used extensively to add documentation to functions and modules.
Sigils
Like Ruby, Elixir has an
alternative syntax for some literals.
A sigil starts with a tilde,
followed by an upper or lowercase letter,
some delimited content, and perhaps some options.
The delimiters can be
<…>, {…}, […], (…), |…|, /…/, "…", and ’…’.
Sigils
The letter determines the sigil’s type:
- ~C A character list with no escaping or interpolation
- ~c A character list, escaped and interpolated just like a single-quoted string
- ~R A regular expression with no escaping or interpolation
- ~r A regular expression, escaped and interpolated
- ~S A string with no escaping or interpolation
- ~s A string, escaped and interpolated just like a double-quoted string
- ~W A list of whitespace-delimited words, with no escaping or interpolation
- ~w A list of whitespace-delimited words, with escaping and interpolation
Sigils
iex> ~C[1\n2#{1+2}]
'1\\n2\#{1+2}'
iex> ~c"1\n2#{1+2}"
'1\n23'
iex> ~S[1\n2#{1+2}]
"1\\n2\#{1+2}"
iex> ~s/1\n2#{1+2}/
"1\n23"
iex> ~W[the c#{'a'}t sat on the mat]
["the", "c\#{'a'}t", "sat", "on", "the", "mat"]
iex> ~w[the c#{'a'}t sat on the mat]
["the", "cat", "sat", "on", "the", "mat"]
Sigils
iex> ~w[the c#{'a'}t sat on the mat]a
[:the, :cat, :sat, :on, :the, :mat]
iex> ~w[the c#{'a'}t sat on the mat]c
['the', 'cat', 'sat', 'on', 'the', 'mat']
iex> ~w[the c#{'a'}t sat on the mat]s
["the", "cat", "sat", "on", "the", "mat"]
The ~W and ~w sigils take an optional type specifier,
a, c, or s, which determines whether it returns
atoms, a list, or strings of characters.
Elixir does not check the nesting of delimiters, so the sigil ~s{a{b} is the three-character string a{b.
Sigils
iex> ~w"""
...> the
...> cat
...> sat
...> """
["the", "cat", "sat"]
If the opening delimiter is three single or three double quotes, the sigil is treated as a heredoc.
If you want to specify modifiers with heredoc sigils (most commonly you’d do this with ~r), add them after the trailing delimiter.
iex> ~r"""
...> hello
...> """i
~r/hello\n/i
The Name “strings”
In Elixir, the convention is that we call only
double-quoted strings “strings.”
The single-quoted form is a character list.
The single and double-quoted forms are very different, and libraries that work on strings work only on the double-quoted form.
Single-Quoted Strings—
Lists of Character Codes
Single-quoted strings are represented as a list of integer values, each value corresponding to a codepoint in the string. We refer to them as character lists (or char lists).
iex> str = 'wombat'
'wombat'
iex> is_list str
true
iex> length str
6
iex> Enum.reverse str
'tabmow'
This is confusing: iex says it is a list, but it shows the value as a string.
iex prints a list of integers as a string if it believes
each number in the list is a printable character.
Single-Quoted Strings—
Lists of Character Codes
iex> [ 67, 65, 84 ]
CAT
iex> str = 'wombat'
'wombat'
iex> :io.format "~w~n", [ str ]
[119,111,109,98,97,116]
:ok
iex> List.to_tuple str
{119, 111, 109, 98, 97, 116}
iex> str ++ [0]
[119, 111, 109, 98, 97, 116, 0]
The ~w in the format string forces str to be written as an Erlang term—the underlying list of integers.
The ~n is a newline.
str ++ [0] creates a new character list with a null byte at the end. iex no longer thinks all the bytes are printable, and so returns the underlying character codes.
Single-Quoted Strings—
Lists of Character Codes
iex> '∂x/∂y'
[8706, 120, 47, 8706, 121]
iex> 'pole' ++ 'vault'
'polevault'
iex> 'pole' -- 'vault'
'poe'
iex> List.zip [ 'abc', '123' ]
[{97, 49}, {98, 50}, {99, 51}]
iex> [ head | tail ] = 'cat'
'cat'
iex> head
99
iex> tail
'at'
iex> [ head | tail ]
'cat'
Why is the head of ’cat’ 99 and not c?. A char list is just a list of integer character codes,
so each individual entry is a number.
It happens that 99 is the code for a lowercase c.
Single-Quoted Strings—
Lists of Character Codes
defmodule Parse do
def number([ ?- | tail ]), do: _number_digits(tail, 0) * -1
def number([ ?+ | tail ]), do: _number_digits(tail, 0)
def number(str), do: _number_digits(str, 0)
defp _number_digits([], value), do: value
defp _number_digits([ digit | tail ], value)
when digit in '0123456789' do
_number_digits(tail, value*10 + digit - ?0)
end
defp _number_digits([ non_digit | _ ], _) do
raise "Invalid digit '#{[non_digit]}'"
end
end
The notation ?c returns the integer code for the character c. This is often useful when employing patterns to extract information from character lists.
Single-Quoted Strings—
Lists of Character Codes
iex> c("parse.exs")
[Parse]
iex> Parse.number('123')
123
iex> Parse.number('-123')
-123
iex> Parse.number('+123')
123
iex> Parse.number('+9')
9
iex> Parse.number('+a')
** (RuntimeError) Invalid digit 'a”
A simple module that parses the character-list representation of an optionally signed decimal number.
Binaries
iex> b = << 1, 2, 3 >>
<<1, 2, 3>>
iex> byte_size b
3
iex> bit_size b
24
# you can also specify bit size
iex> b = << 1::size(2), 1::size(3) >> # 01 001
<<9::size(5)>> # = 9 (base 10)
iex> byte_size b
1
iex> bit_size b
5
The binary type represents a sequence of bits.
A binary literal looks like << term,… >>.
The simplest term is just a number from 0 to 255. The numbers are stored as successive bytes in the binary.
Binaries
iex> int = << 1 >>
<<1>>
iex> float = << 2.5 :: float >>
<<64, 4, 0, 0, 0, 0, 0, 0>>
iex> mix = << int :: binary, float :: binary >>
<<1, 64, 4, 0, 0, 0, 0, 0, 0>>
# IEEE 754 float has a sign bit, 11 bits of exponent,
# and 52 bits of mantissa.
# The exponent is biased by 1023,
# and the mantissa is a fraction with the top bit assumed to be 1
iex> << sign::size(1), exp::size(11), mantissa::size(52) >> = << 3.14159::float >>
iex> (1 + mantissa / :math.pow(2, 52)) * :math.pow(2, exp-1023)
3.14159
You can store integers, floats,
and other binaries in binaries.
Double-Quoted Strings Are Binaries
The contents of a double-quoted string (dqs) are stored as a consecutive sequence of bytes in UTF-8 encoding. This does have two implications.
- because UTF-8 characters can take more than a single byte to represent, the size of the binary is not necessarily the length of the string.
- because you’re no longer using lists, you need to learn and work with the binary syntax alongside the list syntax in your code.
Double-Quoted Strings Are Binaries
iex> dqs = "∂x/∂y"
"∂x/∂y"
iex> String.length dqs
5
iex> byte_size dqs
9
iex> String.at(dqs, 0)
"∂"
iex> String.codepoints(dqs)
["∂", "x", "/", "∂", "y"]
iex> String.split(dqs, "/")
["∂x", "∂y"]
Strings and Elixir Libraries
When Elixir library documentation uses the word string (and most of the time it uses the word binary),
it means double-quoted strings.
The String module defines functions that work with double-quoted strings.
A single grapheme can consist of multiple codepoints that may be perceived as a single character by readers. For example, the “é” grapheme can be represented either as a single “e with acute” codepoint (like above), or as the letter “e” followed by a “combining acute accent” (two codepoints).
Graphemes can be locale dependent.
Strings and Elixir Libraries
iex> String.capitalize "école"
"École"
iex> String.capitalize "ÎÎÎÎÎ"
"Îîîîî"
at(str, offset)
Returns the grapheme at the given offset (starting at 0). Negative offsets count from the end of the string.
iex> String.at("∂og", 0)
"∂"
iex> String.at("∂og", -1)
"g"
capitalize(str)
Converts str to lowercase,
and then capitalizes the first character.
Strings and Elixir Libraries
iex> String.downcase "ØRSteD"
"ørsted"
codepoints(str)
Returns the codepoints in str.
iex> String.codepoints("José's ∂øg")
["J", "o", "s", "é", "'", "s", " ", "∂", "ø", "g"]
downcase(str)
Converts str to lowercase.
Strings and Elixir Libraries
iex> String.ends_with? "string", ["elix", "stri", "ring"]
true
duplicate(str, n)
Returns a string containing n copies of str.
iex> String.duplicate "Ho! ", 3
"Ho! Ho! Ho! "
ends_with?(str, suffix | [ suffixes ])
True if str ends with any of the given suffixes.
Strings and Elixir Libraries
first(str)
Returns the first grapheme from str.
iex> String.first "∂og"
"∂"
Strings and Elixir Libraries
iex> String.codepoints "noe\u0308l"
["n", "o", "e", "¨", "l"]
iex> String.graphemes "noe\u0308l"
["n", "o", "ë", "l"]
graphemes(str)
Returns the graphemes in the string. This is different from the codepoints function, which lists combining characters separately.
The following example uses a combining diaeresis along with the letter “e” to represent “ë”.
Strings and Elixir Libraries
iex> String.length "∂x/∂y"
5
last(str)
Returns the last grapheme from str.
iex> String.last "∂og"
"g"
length(str)
Returns the number of graphemes in str.
Strings and Elixir Libraries
iex> String.lstrip "\t\f Hello\t\n"
"Hello\t\n"
ljust(str, new_length, padding \\ " ")
Returns a new string, at least new_length characters long, containing str left-justified
and padded with padding.
iex> String.ljust("cat", 5)
"cat "
lstrip(str)
Removes leading whitespace from str.
Strings and Elixir Libraries
iex> String.next_codepoint("∂og")
{"∂", "og"}
lstrip(str, character)
Removes leading copies of character (an integer codepoint) from str.
iex> String.lstrip "!!!SALE!!!", ?!
"SALE!!!"
next_codepoint(str)
Splits str into its leading codepoint and the rest, or nil if str is empty. This may be used as the basis of an iterator.
Strings and Elixir Libraries
printable?(str)
Returns true if str contains only printable characters.
iex> String.printable? "José"
true
iex> String.printable? "\x{0000} a null"
false
Strings and Elixir Libraries
replace(str, pattern, replacement,
options \\ [global: true, insert_replaced: nil])
Replaces pattern with replacement
in str under control of options.
If the :global option is true, all occurrences of the pattern are replaced; otherwise only the first is replaced.
If :insert_replaced is a number, the pattern is inserted into the replacement at that offset.
If the option is a list, it is inserted multiple times.
Strings and Elixir Libraries
iex> String.replace "the cat on the mat", "at", "AT"
"the cAT on the mAT"
iex> String.replace "the cat on the mat", "at", "AT", global: false
"the cAT on the mat"
iex> String.replace "the cat on the mat", "at", "AT", insert_replaced: 0
“the catAT on the matAT"
iex> String.replace "the cat on the mat", "at", "AT", insert_replaced: [0,2]
"the catATat on the matATat"
replace(str, pattern, replacement, options \\ [global: true, insert_replaced: nil])
Strings and Elixir Libraries
iex> String.rjust("cat", 5, ?>)
">>cat"
reverse(str)
Reverses the graphemes in a string.
iex> String.reverse "pupils"
"slipup"
iex> String.reverse "∑ƒ÷∂"
"∂÷ƒ∑"
rjust(str, new_length, padding \\ 32)
Returns a new string, at least new_length characters long, containing str right-justified
and padded with padding.
Strings and Elixir Libraries
iex> String.rstrip "!!!SALE!!!", ?!
"!!!SALE"
rstrip(str)
Removes trailing whitespace from str.
iex> String.rstrip(" line \r\n")
" line"
rstrip(str, character)
Removes trailing occurrences of character from str.
Strings and Elixir Libraries
iex> String.starts_with? "string", ["elix", "stri", "ring"]
true
slice(str, offset, len)
Returns a len character substring starting at offset (measured from the end of str if negative).
iex> String.slice "the cat on the mat", 4, 3
"cat"
iex> String.slice "the cat on the mat", -3, 3
"mat"
starts_with?(str, prefix | [ prefixes ])
True if str starts with any of the given prefixes.
Strings and Elixir Libraries
iex> String.trim "!!!SALE!!!", "!"
"SALE"
trim(str)
Trims leading and trailing whitespace from str.
iex> String.trim "\t Hello \r\n"
"Hello"
trim(str, character)
Trims leading and trailing instances of character from str.
Strings and Elixir Libraries
iex> String.split " the cat on the mat "
["the", "cat", "on", "the", "mat"]
iex> String.split "the cat on the mat", "t"
["", "he ca", " on ", "he ma", ""]
iex> String.split "the cat on the mat", ~r{[ae]}
["th", " c", "t on th", " m", "t"]
iex> String.split "the cat on the mat", ~r{[ae]}, parts: 2
["th", " cat on the mat"]
split(str, pattern \\ nil, options \\ [global: true])
Splits str into substrings delimited by pattern. If :global is false, only one split is performed. pattern can be a string, a regular expression, or nil. In the latter case, the string is split on whitespace.
Strings and Elixir Libraries
iex> String.valid_character? "∂"
true
iex> String.valid_character? "∂og"
false
upcase(str)
iex> String.upcase "José Ørstüd"
"JOSÉ ØRSTÜD"
valid_character?(str)
Returns true if str is a single-character string
containing a valid codepoint.
Binaries and Pattern Matching
The first rule of binaries is
“if in doubt, specify the type of each field.”
Available types are binary, bits, bitstring, bytes, float, integer, utf8, utf16, and utf32.
<< length::unsigned-integer-size(12), flags::bitstring-size(4) >> = data
Use hyphens to separate multiple attributes for a field:
However, unless you’re doing a lot of work with binary file or protocol formats, the most common use of all this scary stuff is to process UTF-8 strings.
String Processing with Binaries
When we process lists, we use patterns that split the head from the rest of the list.
With binaries that hold strings, we can do the same kind of trick. We have to specify the type of the head (UTF-8), and make sure the tail remains a binary.
defp _each(<< head :: utf8, tail :: binary >>, func) do
func.(head)
_each(tail, func)
end
String Processing with Binaries
defmodule Utf8 do
def each(str, func) when is_binary(str), do: _each(str, func)
defp _each(<< head :: utf8, tail :: binary >>, func) do
func.(head)
_each(tail, func)
end
defp _each(<<>>, _func), do: []
end
Utf8.each "∂og", fn char -> IO.puts char end
# produces
8706
111
103
Rather than use [ head | tail ], we use
<< head::utf8, tail::binary >>. And rather than terminate when we reach the empty list, [], we look for an empty binary, <<>>.
Thank you!
Programming Elixir 1.6 Chapter 11
By Dustin McCraw
Programming Elixir 1.6 Chapter 11
- 1,179