Elixir 1.6

Chapter 11 

Strings and Binaries

String Literals

Elixir has two kinds of string:
'single-quoted' and "double-quoted".
They differ in their internal representation
but have many things in common.

  • Strings can hold characters in UTF-8 encoding.
  • They may contain escape sequences.
  • They allow interpolation using the syntax #{...}.
  • Characters that would otherwise have special meaning can be escaped with a backslash.
  • They support heredocs.

String Literals

Escape sequences:

\a BEL (0x07) \b BS (0x08) \d DEL (0x7f)
\e ESC (0x1b) \f FF (0x0c) \n NL (0x0a)
\r
 
CR (0x0d) \s
 
SP (0x20) \t
 
TAB (0x09)
\v
 
VT (0x0b) \uhhh
 
1–6 hex digits \xhh
 
2 hex digits

String Literals

Interpolation #{...}

​iex>​ name = "dave"
"dave"
​iex>​ "Hello, #{String.capitalize name}!"
"Hello, Dave!"

String Literals

The heredoc notation: Triple the string delimiter (”’ or """)

IO.puts ​"​​starting"​
​IO.write ​"""​
​​   my​
​​   string​
​​   """​
​IO.puts ​"​​ending"​

# produces:
​starting
​my
​string
​ending

Heredocs are used extensively to add documentation to functions and modules.

Sigils

Like Ruby, Elixir has an
alternative syntax for some literals.

 

A sigil starts with a tilde,
followed by an upper or lowercase letter,
some delimited content, and perhaps some options.
The delimiters can be

<>, {}, [], (), ||, //, "", and.

 

Sigils

The letter determines the sigil’s type:

  • ~C   A character list with no escaping or interpolation
  • ~c   A character list, escaped and interpolated just like a single-quoted string
  • ~R   A regular expression with no escaping or interpolation
  • ~r   A regular expression, escaped and interpolated
  • ~S   A string with no escaping or interpolation
  • ~s   A string, escaped and interpolated just like a double-quoted string
  • ~W   A list of whitespace-delimited words, with no escaping or interpolation
  • ~w   A list of whitespace-delimited words, with escaping and interpolation

Sigils

​iex>​ ​~​C[1\n2​#{1+2}]​
​'1\\n2\#{1+2}'

​​iex>​ ​~​c​"​​1\n2​​#{​1+2​}​​"​
​'1\n23'

​​iex>​ ​~​S[1\n2​#{1+2}]​
​"1\\n2\#{1+2}"

​​iex>​ ​~​s/1\n2​#{1+2}/​
​"1\n23"

​​iex>​ ​~​W[the c​#{'a'}t sat on the mat]​
​["the", "c\#{'a'}t", "sat", "on", "the", "mat"]

​​iex>​ ​~​w[the c​#{'a'}t sat on the mat]​
​["the", "cat", "sat", "on", "the", "mat"]

Sigils

iex>​ ​~​w[the c​#{'a'}t sat on the mat]a​
​[:the, :cat, :sat, :on, :the, :mat]

​iex>​ ​~​w[the c​#{'a'}t sat on the mat]c​
​['the', 'cat', 'sat', 'on', 'the', 'mat']

​iex>​ ​~​w[the c​#{'a'}t sat on the mat]s​
​["the", "cat", "sat", "on", "the", "mat"]

The ~W and ~w sigils take an optional type specifier,
a, c, or s, which determines whether it returns
atoms, a list, or strings of characters

Elixir does not check the nesting of delimiters, so the sigil ~s{a{b} is the three-character string a{b.

Sigils

​​iex>​ ​~​w​"​​"​​"​
​​...>​​ the​
​​...>​​ cat​
​​...>​​ sat​
​​...>​​ "​​"​​"​
​["the", "cat", "sat"]

If the opening delimiter is three single or three double quotes, the sigil is treated as a heredoc.

If you want to specify modifiers with heredoc sigils (most commonly you’d do this with ~r), add them after the trailing delimiter.

​​iex>​ ​~​r​"​​"​​"​
​​...>​​ hello​
​​...>​​ "​​"​​"​i
​~r/hello\n/i

The Name “strings”

In Elixir, the convention is that we call only
double-quoted strings “strings.” 
The single-quoted form is a character list.

 

The single and double-quoted forms are very different, and libraries that work on strings work only on the double-quoted form.

Single-Quoted Strings—
Lists of Character Codes

Single-quoted strings are represented as a list of integer values, each value corresponding to a codepoint in the string. We refer to them as character lists (or char lists).

iex>​ str = ​'wombat'​
​'wombat'
​​iex>​ is_list str
​true
​​iex>​ length str
​6
​​iex>​ Enum.reverse str
​'tabmow'

This is confusing: iex says it is a list, but it shows the value as a string.
iex prints a list of integers as a string if it believes
each number in the list is a printable character.

Single-Quoted Strings—
Lists of Character Codes

​iex>​ [ 67, 65, 84 ]
CAT
iex>​ str = ​'wombat'​
'wombat'
​iex>​ ​:io​.format ​"​​~w~n"​, [ str ]
[119,111,109,98,97,116]
:ok
​iex>​ List.to_tuple str
{119, 111, 109, 98, 97, 116}
​iex>​ str ++ [0]
[119, 111, 109, 98, 97, 116, 0]

The ~w in the format string forces str to be written as an Erlang term—the underlying list of integers.
The ~n is a newline.

str ++ [0] creates a new character list with a null byte at the end. iex no longer thinks all the bytes are printable, and so returns the underlying character codes.

Single-Quoted Strings—
Lists of Character Codes

​iex>​ ​'∂x/∂y'​
​[8706, 120, 47, 8706, 121]
​​iex>​ ​'pole'​ ++ ​'vault'​
​'polevault'
​​iex>​ ​'pole'​ -- ​'vault'​
​'poe'
​​iex>​ List.zip [ ​'abc'​, ​'123'​ ]
​[{97, 49}, {98, 50}, {99, 51}]
​​iex>​ [ head | tail ] = ​'cat'​
​'cat'
​​iex>​ head
​99
​​iex>​ tail
​'at'
​​iex>​ [ head | tail ]
​'cat'

Why is the head of ’cat’ 99 and not c?. A char list is just a list of integer character codes,
so each individual entry is a number.
It happens that 99 is the code for a lowercase c.

Single-Quoted Strings—
Lists of Character Codes

defmodule Parse do

  def number([ ?- | tail ]), do: _number_digits(tail, 0) * -1
  def number([ ?+ | tail ]), do: _number_digits(tail, 0)
  def number(str),           do: _number_digits(str,  0)

  defp _number_digits([], value), do: value
  defp _number_digits([ digit | tail ], value)
  when digit in '0123456789' do
    _number_digits(tail, value*10 + digit - ?0)
  end
  defp _number_digits([ non_digit | _ ], _) do
    raise "Invalid digit '#{[non_digit]}'"
  end
end

The notation ?c returns the integer code for the character c. This is often useful when employing patterns to extract information from character lists. 

Single-Quoted Strings—
Lists of Character Codes

​iex>​ c(​"​​parse.exs"​)
​[Parse]
​​iex>​ Parse.number(​'123'​)
​123
​​iex>​ Parse.number(​'-123'​)
​-123
​​iex>​ Parse.number(​'+123'​)
​123
​​iex>​ Parse.number(​'+9'​)
​9
​​iex>​ Parse.number(​'+a'​)
​​**​ (RuntimeError) Invalid digit 'a”

A simple module that parses the character-list representation of an optionally signed decimal number.

Binaries

​iex>​ b = << 1, 2, 3 >>
​<<1, 2, 3>>
​​iex>​ byte_size b
​3
​​iex>​ bit_size b
​24
# you can also specify bit size
iex>​ b = << 1::size(2), 1::size(3) >>    ​# 01 001​
​<<9::size(5)>>                           # = 9 (base 10)
​​iex>​ byte_size b
​1
​​iex>​ bit_size b
​5

The binary type represents a sequence of bits.

A binary literal looks like << term,… >>.

The simplest term is just a number from 0 to 255. The numbers are stored as successive bytes in the binary.

Binaries

iex>​ int = << 1 >>
​<<1>>
​​iex>​ float = << 2.5 :: float >>
​<<64, 4, 0, 0, 0, 0, 0, 0>>
​​iex>​ mix = << int :: binary, float :: binary >>
​<<1, 64, 4, 0, 0, 0, 0, 0, 0>>

# IEEE 754 float has a sign bit, 11 bits of exponent, 
# and 52 bits of mantissa. 
# The exponent is biased by 1023, 
# and the mantissa is a fraction with the top bit assumed to be 1

​iex>​ << sign::size(1), exp::size(11), mantissa::size(52) >> = << 3.14159::float >>
​iex>​ (1 + mantissa / ​:math​.pow(2, 52)) * ​:math​.pow(2, exp-1023)
3.14159

You can store integers, floats,
and other binaries in binaries.

Double-Quoted Strings Are Binaries

The contents of a double-quoted string (dqs) are stored as a consecutive sequence of bytes in UTF-8 encoding. This does have two implications.

  1. because UTF-8 characters can take more than a single byte to represent, the size of the binary is not necessarily the length of the string.
  2. because you’re no longer using lists, you need to learn and work with the binary syntax alongside the list syntax in your code.

Double-Quoted Strings Are Binaries

​iex>​ dqs = ​"​​∂x/∂y"​
​"∂x/∂y"
​iex>​ String.length dqs
​5
​iex>​ byte_size dqs
​9
​iex>​ String.at(dqs, 0)
​"∂"
​iex>​ String.codepoints(dqs)
​["∂", "x", "/", "∂", "y"]
​iex>​ String.split(dqs, ​"​​/"​)
​["∂x", "∂y"]

Strings and Elixir Libraries

When Elixir library documentation uses the word string (and most of the time it uses the word binary),
it means double-quoted strings.

The String module defines functions that work with double-quoted strings.

 

A single grapheme can consist of multiple codepoints that may be perceived as a single character by readers. For example, the “é” grapheme can be represented either as a single “e with acute” codepoint (like above), or as the letter “e” followed by a “combining acute accent” (two codepoints).
Graphemes can be locale dependent.

Strings and Elixir Libraries

​​iex>​ String.capitalize ​"​​école"​
​"École"
​​iex>​ String.capitalize ​"​​ÎÎÎÎÎ"​
​"Îîîîî"

at(str, offset)

Returns the grapheme at the given offset (starting at 0). Negative offsets count from the end of the string.

​​iex>​ String.at(​"​​∂og"​, 0)
​"∂"
​​iex>​ String.at(​"​​∂og"​, -1)
​"g"

capitalize(str)
Converts str to lowercase,
and then capitalizes the first character.

Strings and Elixir Libraries

​​iex>​ String.downcase ​"​​ØRSteD"​
"ørsted"

codepoints(str)
Returns the codepoints in str.

​​iex>​ String.codepoints(​"​​José's ∂øg"​)
​["J", "o", "s", "é", "'", "s", " ", "∂", "ø", "g"]

downcase(str)
Converts str to lowercase.

 

Strings and Elixir Libraries

​​iex>​ String.ends_with? ​"​​string"​, [​"​​elix"​, ​"​​stri"​, ​"​​ring"​]
​true

duplicate(str, n)
Returns a string containing n copies of str.

 

​​iex>​ String.duplicate ​"​​Ho! "​, 3
​"Ho! Ho! Ho! "

ends_with?(str, suffix | [ suffixes ])
True if str ends with any of the given suffixes.

 

Strings and Elixir Libraries

first(str)
Returns the first grapheme from str.

 

​​iex>​ String.first ​"​​∂og"​
​"∂"

Strings and Elixir Libraries

​iex>​ String.codepoints ​"​​noe\u0308l"​
​["n", "o", "e", "¨", "l"]
​​iex>​ String.graphemes ​"​​noe\u0308l"​
​["n", "o", "ë", "l"]

graphemes(str)
Returns the graphemes in the string. This is different from the codepoints function, which lists combining characters separately.
The following example uses a combining diaeresis along with the letter “e” to represent “ë”. 

Strings and Elixir Libraries

​​iex>​ String.length ​"​​∂x/∂y"​
​5

last(str)
Returns the last grapheme from str.

 

iex>​ String.last ​"​​∂og"​
​"g"

length(str)
Returns the number of graphemes in str.

Strings and Elixir Libraries

​​iex>​ String.lstrip ​"​​\t\f     Hello\t\n"​
​"Hello\t\n"

ljust(str, new_length, padding \\ " ")
Returns a new string, at least new_length characters long, containing str left-justified
and padded with padding.

​​iex>​ String.ljust(​"​​cat"​, 5)
​"cat  "

lstrip(str)
Removes leading whitespace from str.

Strings and Elixir Libraries

​​iex>​ String.next_codepoint("∂og")
{"∂", "og"}

lstrip(str, character)
Removes leading copies of character (an integer codepoint) from str.

 

​​iex>​ String.lstrip ​"​​!!!SALE!!!"​, ​?!​
​"SALE!!!"

next_codepoint(str)
Splits str into its leading codepoint and the rest, or nil if str is empty. This may be used as the basis of an iterator.

Strings and Elixir Libraries

printable?(str)
Returns true if str contains only printable characters.

​​iex>​ String.printable? ​"​​José"​
​true
​​iex>​ String.printable? ​"​​\x{0000} a null"​
​false

Strings and Elixir Libraries

replace(str, pattern, replacement,
options \\ [global: true, insert_replaced: nil])

Replaces pattern with replacement
in str under control of options.

If the :global option is true, all occurrences of the pattern are replaced; otherwise only the first is replaced.

If :insert_replaced is a number, the pattern is inserted into the replacement at that offset.
If the option is a list, it is inserted multiple times.

Strings and Elixir Libraries

​​iex>​ String.replace ​"​​the cat on the mat"​, ​"​​at"​, ​"​​AT"​
​"the cAT on the mAT"
​​iex>​ String.replace ​"​​the cat on the mat"​, ​"​​at"​, ​"​​AT"​, ​global:​ false
​"the cAT on the mat"
​​iex>​ String.replace ​"​​the cat on the mat"​, ​"​​at"​, ​"​​AT"​, ​insert_replaced:​ 0
“the catAT on the matAT"
​​iex>​ String.replace ​"​​the cat on the mat"​, ​"​​at"​, ​"​​AT"​, ​insert_replaced:​ [0,2]
​"the catATat on the matATat"

replace(str, pattern, replacement, options \\ [global: true, insert_replaced: nil])
 

Strings and Elixir Libraries

​​iex>​ String.rjust(​"​​cat"​, 5, ​?>​)
​">>cat"

reverse(str)
Reverses the graphemes in a string.

​​iex>​ String.reverse ​"​​pupils"​
​"slipup"
​​iex>​ String.reverse ​"​​∑ƒ÷∂"​
​"∂÷ƒ∑"

rjust(str, new_length, padding \\ 32)
Returns a new string, at least new_length characters long, containing str right-justified
and padded with padding.

Strings and Elixir Libraries

​​iex>​ String.rstrip ​"​​!!!SALE!!!"​, ​?!​
​"!!!SALE"

rstrip(str)
Removes trailing whitespace from str.

​​iex>​ String.rstrip(​"​​ line \r\n"​)
​" line"

rstrip(str, character)
Removes trailing occurrences of character from str.

Strings and Elixir Libraries

​​iex>​ String.starts_with? ​"​​string"​, [​"​​elix"​, ​"​​stri"​, ​"​​ring"​]
​true

slice(str, offset, len)
Returns a len character substring starting at offset (measured from the end of str if negative).

 

​​iex>​ String.slice ​"​​the cat on the mat"​, 4, 3
​"cat"
​​iex>​ String.slice ​"​​the cat on the mat"​, -3, 3
​"mat"

starts_with?(str, prefix | [ prefixes ])
True if str starts with any of the given prefixes.

Strings and Elixir Libraries

​​iex>​ String.trim "!!!SALE!!!", "!"
​"SALE"

trim(str)
Trims leading and trailing whitespace from str.

​​iex>​ String.trim ​"​​\t  Hello   \r\n"​
​"Hello"

trim(str, character)
Trims leading and trailing instances of character from str.

Strings and Elixir Libraries

​​iex>​ String.split ​"​​   the cat on the mat   "​
​["the", "cat", "on", "the", "mat"]
​​iex>​ String.split ​"​​the cat on the mat"​, ​"​​t"​
​["", "he ca", " on ", "he ma", ""]
​​iex>​ String.split ​"​​the cat on the mat"​, ​~​r{[ae]}
​["th", " c", "t on th", " m", "t"]
​​iex>​ String.split ​"​​the cat on the mat"​, ​~​r{[ae]}, ​parts:​ 2
​["th", " cat on the mat"]

split(str, pattern \\ nil, options \\ [global: true])
Splits str into substrings delimited by pattern. If :global is false, only one split is performed. pattern can be a string, a regular expression, or nil. In the latter case, the string is split on whitespace.

Strings and Elixir Libraries

​​iex>​ String.valid_character? ​"​​∂"​
​true
​​iex>​ String.valid_character? ​"​​∂og"​
​false

upcase(str)

 

​​iex>​ String.upcase ​"​​José Ørstüd"​
​"JOSÉ ØRSTÜD"

valid_character?(str)
Returns true if str is a single-character string
containing a valid codepoint.

Binaries and Pattern Matching

The first rule of binaries is
“if in doubt, specify the type of each field.”
Available types are binary, bits, bitstring, bytes, float, integer, utf8, utf16, and utf32.

<< length::unsigned-integer-size(12), flags::bitstring-size(4) >> = data

Use hyphens to separate multiple attributes for a field:

However, unless you’re doing a lot of work with binary file or protocol formats, the most common use of all this scary stuff is to process UTF-8 strings.

String Processing with Binaries

When we process lists, we use patterns that split the head from the rest of the list.
With binaries that hold strings, we can do the same kind of trick. We have to specify the type of the head (UTF-8), and make sure the tail remains a binary.

 

defp _each(<< head :: utf8, tail :: binary >>, func) do
  func.(head)
  _each(tail, func)
end

String Processing with Binaries

defmodule Utf8 do
  def each(str, func) when is_binary(str), do: _each(str, func)

  defp _each(<< head :: utf8, tail :: binary >>, func) do
    func.(head)
    _each(tail, func)
  end

  defp _each(<<>>, _func), do: []
end

Utf8.each "∂og", fn char -> IO.puts char end

# produces
8706
​111
​103

Rather than use [ head | tail ], we use
<< head::utf8, tail::binary >>. And rather than terminate when we reach the empty list, [], we look for an empty binary, <<>>.

Thank you!

Programming Elixir 1.6 Chapter 11

By Dustin McCraw

Programming Elixir 1.6 Chapter 11

  • 1,167