Elixir has two kinds of string:
'single-quoted' and "double-quoted".
They differ in their internal representation
but have many things in common.
Escape sequences:
\a | BEL (0x07) | \b | BS (0x08) | \d | DEL (0x7f) |
---|---|---|---|---|---|
\e | ESC (0x1b) | \f | FF (0x0c) | \n | NL (0x0a) |
\r |
CR (0x0d) |
\s |
SP (0x20) |
\t |
TAB (0x09) |
\v |
VT (0x0b) |
\uhhh |
1–6 hex digits |
\xhh |
2 hex digits |
Interpolation #{...}
iex> name = "dave"
"dave"
iex> "Hello, #{String.capitalize name}!"
"Hello, Dave!"
The heredoc notation: Triple the string delimiter (”’ or """)
IO.puts "starting"
IO.write """
my
string
"""
IO.puts "ending"
# produces:
starting
my
string
ending
Heredocs are used extensively to add documentation to functions and modules.
Like Ruby, Elixir has an
alternative syntax for some literals.
A sigil starts with a tilde,
followed by an upper or lowercase letter,
some delimited content, and perhaps some options.
The delimiters can be
<…>, {…}, […], (…), |…|, /…/, "…", and ’…’.
The letter determines the sigil’s type:
iex> ~C[1\n2#{1+2}]
'1\\n2\#{1+2}'
iex> ~c"1\n2#{1+2}"
'1\n23'
iex> ~S[1\n2#{1+2}]
"1\\n2\#{1+2}"
iex> ~s/1\n2#{1+2}/
"1\n23"
iex> ~W[the c#{'a'}t sat on the mat]
["the", "c\#{'a'}t", "sat", "on", "the", "mat"]
iex> ~w[the c#{'a'}t sat on the mat]
["the", "cat", "sat", "on", "the", "mat"]
iex> ~w[the c#{'a'}t sat on the mat]a
[:the, :cat, :sat, :on, :the, :mat]
iex> ~w[the c#{'a'}t sat on the mat]c
['the', 'cat', 'sat', 'on', 'the', 'mat']
iex> ~w[the c#{'a'}t sat on the mat]s
["the", "cat", "sat", "on", "the", "mat"]
The ~W and ~w sigils take an optional type specifier,
a, c, or s, which determines whether it returns
atoms, a list, or strings of characters.
Elixir does not check the nesting of delimiters, so the sigil ~s{a{b} is the three-character string a{b.
iex> ~w"""
...> the
...> cat
...> sat
...> """
["the", "cat", "sat"]
If the opening delimiter is three single or three double quotes, the sigil is treated as a heredoc.
If you want to specify modifiers with heredoc sigils (most commonly you’d do this with ~r), add them after the trailing delimiter.
iex> ~r"""
...> hello
...> """i
~r/hello\n/i
In Elixir, the convention is that we call only
double-quoted strings “strings.”
The single-quoted form is a character list.
The single and double-quoted forms are very different, and libraries that work on strings work only on the double-quoted form.
Single-quoted strings are represented as a list of integer values, each value corresponding to a codepoint in the string. We refer to them as character lists (or char lists).
iex> str = 'wombat'
'wombat'
iex> is_list str
true
iex> length str
6
iex> Enum.reverse str
'tabmow'
This is confusing: iex says it is a list, but it shows the value as a string.
iex prints a list of integers as a string if it believes
each number in the list is a printable character.
iex> [ 67, 65, 84 ]
CAT
iex> str = 'wombat'
'wombat'
iex> :io.format "~w~n", [ str ]
[119,111,109,98,97,116]
:ok
iex> List.to_tuple str
{119, 111, 109, 98, 97, 116}
iex> str ++ [0]
[119, 111, 109, 98, 97, 116, 0]
The ~w in the format string forces str to be written as an Erlang term—the underlying list of integers.
The ~n is a newline.
str ++ [0] creates a new character list with a null byte at the end. iex no longer thinks all the bytes are printable, and so returns the underlying character codes.
iex> '∂x/∂y'
[8706, 120, 47, 8706, 121]
iex> 'pole' ++ 'vault'
'polevault'
iex> 'pole' -- 'vault'
'poe'
iex> List.zip [ 'abc', '123' ]
[{97, 49}, {98, 50}, {99, 51}]
iex> [ head | tail ] = 'cat'
'cat'
iex> head
99
iex> tail
'at'
iex> [ head | tail ]
'cat'
Why is the head of ’cat’ 99 and not c?. A char list is just a list of integer character codes,
so each individual entry is a number.
It happens that 99 is the code for a lowercase c.
defmodule Parse do
def number([ ?- | tail ]), do: _number_digits(tail, 0) * -1
def number([ ?+ | tail ]), do: _number_digits(tail, 0)
def number(str), do: _number_digits(str, 0)
defp _number_digits([], value), do: value
defp _number_digits([ digit | tail ], value)
when digit in '0123456789' do
_number_digits(tail, value*10 + digit - ?0)
end
defp _number_digits([ non_digit | _ ], _) do
raise "Invalid digit '#{[non_digit]}'"
end
end
The notation ?c returns the integer code for the character c. This is often useful when employing patterns to extract information from character lists.
iex> c("parse.exs")
[Parse]
iex> Parse.number('123')
123
iex> Parse.number('-123')
-123
iex> Parse.number('+123')
123
iex> Parse.number('+9')
9
iex> Parse.number('+a')
** (RuntimeError) Invalid digit 'a”
A simple module that parses the character-list representation of an optionally signed decimal number.
iex> b = << 1, 2, 3 >>
<<1, 2, 3>>
iex> byte_size b
3
iex> bit_size b
24
# you can also specify bit size
iex> b = << 1::size(2), 1::size(3) >> # 01 001
<<9::size(5)>> # = 9 (base 10)
iex> byte_size b
1
iex> bit_size b
5
The binary type represents a sequence of bits.
A binary literal looks like << term,… >>.
The simplest term is just a number from 0 to 255. The numbers are stored as successive bytes in the binary.
iex> int = << 1 >>
<<1>>
iex> float = << 2.5 :: float >>
<<64, 4, 0, 0, 0, 0, 0, 0>>
iex> mix = << int :: binary, float :: binary >>
<<1, 64, 4, 0, 0, 0, 0, 0, 0>>
# IEEE 754 float has a sign bit, 11 bits of exponent,
# and 52 bits of mantissa.
# The exponent is biased by 1023,
# and the mantissa is a fraction with the top bit assumed to be 1
iex> << sign::size(1), exp::size(11), mantissa::size(52) >> = << 3.14159::float >>
iex> (1 + mantissa / :math.pow(2, 52)) * :math.pow(2, exp-1023)
3.14159
You can store integers, floats,
and other binaries in binaries.
The contents of a double-quoted string (dqs) are stored as a consecutive sequence of bytes in UTF-8 encoding. This does have two implications.
iex> dqs = "∂x/∂y"
"∂x/∂y"
iex> String.length dqs
5
iex> byte_size dqs
9
iex> String.at(dqs, 0)
"∂"
iex> String.codepoints(dqs)
["∂", "x", "/", "∂", "y"]
iex> String.split(dqs, "/")
["∂x", "∂y"]
When Elixir library documentation uses the word string (and most of the time it uses the word binary),
it means double-quoted strings.
The String module defines functions that work with double-quoted strings.
A single grapheme can consist of multiple codepoints that may be perceived as a single character by readers. For example, the “é” grapheme can be represented either as a single “e with acute” codepoint (like above), or as the letter “e” followed by a “combining acute accent” (two codepoints).
Graphemes can be locale dependent.
iex> String.capitalize "école"
"École"
iex> String.capitalize "ÎÎÎÎÎ"
"Îîîîî"
at(str, offset)
Returns the grapheme at the given offset (starting at 0). Negative offsets count from the end of the string.
iex> String.at("∂og", 0)
"∂"
iex> String.at("∂og", -1)
"g"
capitalize(str)
Converts str to lowercase,
and then capitalizes the first character.
iex> String.downcase "ØRSteD"
"ørsted"
codepoints(str)
Returns the codepoints in str.
iex> String.codepoints("José's ∂øg")
["J", "o", "s", "é", "'", "s", " ", "∂", "ø", "g"]
downcase(str)
Converts str to lowercase.
iex> String.ends_with? "string", ["elix", "stri", "ring"]
true
duplicate(str, n)
Returns a string containing n copies of str.
iex> String.duplicate "Ho! ", 3
"Ho! Ho! Ho! "
ends_with?(str, suffix | [ suffixes ])
True if str ends with any of the given suffixes.
first(str)
Returns the first grapheme from str.
iex> String.first "∂og"
"∂"
iex> String.codepoints "noe\u0308l"
["n", "o", "e", "¨", "l"]
iex> String.graphemes "noe\u0308l"
["n", "o", "ë", "l"]
graphemes(str)
Returns the graphemes in the string. This is different from the codepoints function, which lists combining characters separately.
The following example uses a combining diaeresis along with the letter “e” to represent “ë”.
iex> String.length "∂x/∂y"
5
last(str)
Returns the last grapheme from str.
iex> String.last "∂og"
"g"
length(str)
Returns the number of graphemes in str.
iex> String.lstrip "\t\f Hello\t\n"
"Hello\t\n"
ljust(str, new_length, padding \\ " ")
Returns a new string, at least new_length characters long, containing str left-justified
and padded with padding.
iex> String.ljust("cat", 5)
"cat "
lstrip(str)
Removes leading whitespace from str.
iex> String.next_codepoint("∂og")
{"∂", "og"}
lstrip(str, character)
Removes leading copies of character (an integer codepoint) from str.
iex> String.lstrip "!!!SALE!!!", ?!
"SALE!!!"
next_codepoint(str)
Splits str into its leading codepoint and the rest, or nil if str is empty. This may be used as the basis of an iterator.
printable?(str)
Returns true if str contains only printable characters.
iex> String.printable? "José"
true
iex> String.printable? "\x{0000} a null"
false
replace(str, pattern, replacement,
options \\ [global: true, insert_replaced: nil])
Replaces pattern with replacement
in str under control of options.
If the :global option is true, all occurrences of the pattern are replaced; otherwise only the first is replaced.
If :insert_replaced is a number, the pattern is inserted into the replacement at that offset.
If the option is a list, it is inserted multiple times.
iex> String.replace "the cat on the mat", "at", "AT"
"the cAT on the mAT"
iex> String.replace "the cat on the mat", "at", "AT", global: false
"the cAT on the mat"
iex> String.replace "the cat on the mat", "at", "AT", insert_replaced: 0
“the catAT on the matAT"
iex> String.replace "the cat on the mat", "at", "AT", insert_replaced: [0,2]
"the catATat on the matATat"
replace(str, pattern, replacement, options \\ [global: true, insert_replaced: nil])
iex> String.rjust("cat", 5, ?>)
">>cat"
reverse(str)
Reverses the graphemes in a string.
iex> String.reverse "pupils"
"slipup"
iex> String.reverse "∑ƒ÷∂"
"∂÷ƒ∑"
rjust(str, new_length, padding \\ 32)
Returns a new string, at least new_length characters long, containing str right-justified
and padded with padding.
iex> String.rstrip "!!!SALE!!!", ?!
"!!!SALE"
rstrip(str)
Removes trailing whitespace from str.
iex> String.rstrip(" line \r\n")
" line"
rstrip(str, character)
Removes trailing occurrences of character from str.
iex> String.starts_with? "string", ["elix", "stri", "ring"]
true
slice(str, offset, len)
Returns a len character substring starting at offset (measured from the end of str if negative).
iex> String.slice "the cat on the mat", 4, 3
"cat"
iex> String.slice "the cat on the mat", -3, 3
"mat"
starts_with?(str, prefix | [ prefixes ])
True if str starts with any of the given prefixes.
iex> String.trim "!!!SALE!!!", "!"
"SALE"
trim(str)
Trims leading and trailing whitespace from str.
iex> String.trim "\t Hello \r\n"
"Hello"
trim(str, character)
Trims leading and trailing instances of character from str.
iex> String.split " the cat on the mat "
["the", "cat", "on", "the", "mat"]
iex> String.split "the cat on the mat", "t"
["", "he ca", " on ", "he ma", ""]
iex> String.split "the cat on the mat", ~r{[ae]}
["th", " c", "t on th", " m", "t"]
iex> String.split "the cat on the mat", ~r{[ae]}, parts: 2
["th", " cat on the mat"]
split(str, pattern \\ nil, options \\ [global: true])
Splits str into substrings delimited by pattern. If :global is false, only one split is performed. pattern can be a string, a regular expression, or nil. In the latter case, the string is split on whitespace.
iex> String.valid_character? "∂"
true
iex> String.valid_character? "∂og"
false
upcase(str)
iex> String.upcase "José Ørstüd"
"JOSÉ ØRSTÜD"
valid_character?(str)
Returns true if str is a single-character string
containing a valid codepoint.
The first rule of binaries is
“if in doubt, specify the type of each field.”
Available types are binary, bits, bitstring, bytes, float, integer, utf8, utf16, and utf32.
<< length::unsigned-integer-size(12), flags::bitstring-size(4) >> = data
Use hyphens to separate multiple attributes for a field:
However, unless you’re doing a lot of work with binary file or protocol formats, the most common use of all this scary stuff is to process UTF-8 strings.
When we process lists, we use patterns that split the head from the rest of the list.
With binaries that hold strings, we can do the same kind of trick. We have to specify the type of the head (UTF-8), and make sure the tail remains a binary.
defp _each(<< head :: utf8, tail :: binary >>, func) do
func.(head)
_each(tail, func)
end
defmodule Utf8 do
def each(str, func) when is_binary(str), do: _each(str, func)
defp _each(<< head :: utf8, tail :: binary >>, func) do
func.(head)
_each(tail, func)
end
defp _each(<<>>, _func), do: []
end
Utf8.each "∂og", fn char -> IO.puts char end
# produces
8706
111
103
Rather than use [ head | tail ], we use
<< head::utf8, tail::binary >>. And rather than terminate when we reach the empty list, [], we look for an empty binary, <<>>.