Performing concurrent fuzzy matches with Postgres and GenStage
Maryna Shabalina, Edenlab
marinakrasnovatr81@gmail.com
Elixir/Erlang developer at Edenlab
Problem
- There are can be many similar records in database identify same person account
- Each account record costs money
Problem
- No additional infrastructure allowed
- 30 millions records
- Production database performance can not be abused during the calculations
Шварценеґґер, 5433224760, +38093000000,
ЕН617561, Таль
Швартсніггер, 5433224760, +380930000000,
ЕН617567; м. Таль
Шфортцнєґер, 5433224769, NULL,
67561, Thal
PERSON
CLONES
Same person - few records
More prosaic examples
Мельнік ― Мельник
Новохатня ― Новокхатняя
І-239407 ― 1-239407, 239407
- Numbers on old paper documents
1 ― 7
0 ― 9
В ― 8
- Birth date input forms are not convenient often
Why casual methods for finding duplicates/fraud will not run?
- activity analysis - accounts do nothing
- accounts created via async 3rd party software
Person
Presumable clone
= 0.8783456
score of similarity for pair
0..1
Дзигановський Юрій, 074338611, +380..9,
2016-04-05, ...
Дзекановський Юра, 1074338611, +380..9,
2019-04-05, ...
model
Record#N
Record#N
Verified✓
← clone, deactivate
← similar
← similar
Deduplication process for one record
Elixir
Postgres production database
30, 000, 000 records with duplicates
Few days for finding and deactivation duplicated record
mark record as counted✓
unverified record
get candidates
count similarity
candidates
|> Task.async_stream(&count_pair_weight(person, &1))
|> async_stream_filter()
Processing record
store results if nessesary
Getting candidates
- all to all:
~900,000,000,000,000 pairs to process
with 0.02 second to calculate score for one pair:
~ 570776.2557077626 years
In 1 process
In 20,000 parallel processes
~ 28,5 year
900000000000000*0.02/(60*60*24*365)
All to All
Waiting for all to all total check for duplicates is ready
This step should takes milliseconds
➘
get candidates
record
● exact match few consequential fields |
● fuzzy match by consequential identifiers |
Getting only records are guaranteed similar in advance
Exact match |
tax_id
{tax_id: 1234567890, documents: [{number: 1111}]}
PERSONS
1234567890
1234567890
number
1111
DOCUMENTS
1111
person_id
person_id
id
id
Candidates
Fuzzy match consequential identifiers |
- Approximately equal name sound (use Ukrainian doublemetaphone)
- "typo" mistakes
- Distance 1-2 symbols in necessary fields
Double metaphone
Algorithm get sign of a word normalizing silent letters
С,Ц, ТС, ТЦ -> C (S)
З + vovel -> C (S)
Д + vovel -> Т (Т)
В + vovel -> Ф (F)
Б + vovel -> П (P)
КХ + vovel -> Х (KH)
Popular last names endings -> symbol
Double metaphone examples
КВРТ@ Ковратенко, Гавритенко, Гаврутенко
ПК# Паковська, Бугаєвська, Бегійовська, Буяковська
КЛН& Кліницький, 'Гуляницький,Каленський, Каленські
РПК Рибяк, Рибак, Рип'як, Арабок, Ріпак
Producer
Consumer 1
Consumer 2
Consumer N
mark records as processing
GenStage's processes
process record
get batch of unverified records
consume new record
consume last record
mark record as counted✓
unverified record
get candidates
count similarity
Processing record
store results if nessesary
Consumer N
Producer
- Never store mark status as column
PostgreSQL database performance tips for deduplication
- Store last processed record attribute and offset (if attribute is not uniq) separate on disk with fast access
Use the index, Luke!
SELECT * FROM persons
WHERE updated_at + interval '7 days' < now()
SELECT * FROM persons
WHERE ∫(μ(updated_at)y('7 days'))′dt < now()
"persons_updated_at_index" btree (updated_at)
index will not run
Tips for using simple indexes
SELECT * FROM persons
WHERE updated_at < now() - interval '7 days'
... OFFSET ..
"persons_updated_at_index" btree (updated_at)
SELECT * FROM person_documents
WHERE
regexp_replace(
number, '[^[:digit:]]', '', 'g') = '697569'
"documents_number_index" btree
(regexp_replace(number, '[^[:digit:]]', '', 'g'))
SELECT * FROM person_documents
WHERE number = '697569'
Usage indexes builded with function
Left side of indexed field condition always equal to function, index builded with
Simple indexes using
Partial indexes
CREATE INDEX pd_passport_indx person_documents ON lower(number)
WHERE type = 'PASSPORT';
CREATE INDEX pd_passport_indx person_documents ON lower(number)
WHERE type = 'NATIONAL_ID';
... 5 more types ...
person_documents | person_documents_number_index | 1156 MB
person_documents | birth_cert_l_number_index | 131 MB
Simplest case: 1 to N relation
| person_documents |
|------------------|
| document |
persons | |------------------|
---------+ 1 -> N | document |
person | |------------------|
| document |
|------------------|
INDEXES:
persons (birth_date)
person_documents (number)
Make postgres to choose best query plan
Use 'JOIN' instead of 'WHERE' conditions
Make postgres to choose best query plan
explain analyze SELECT p0.*
FROM "persons" AS p0
INNER JOIN "person_documents" AS p1
ON p1."person_id" = p0."id"
WHERE
(p0."birth_date" = '1991-08-24') AND (p1."type" = 'PASSPORT') AND (lower(p1."number")) = 'XXXXXX'
GROUP BY p0."id" ORDER BY p0."inserted_at" DESC LIMIT 11 OFFSET 0;
Limit (cost=11598.42..11598.45 rows=11 width=703) (actual time=5680.152..5680.152 rows=1 loops=1)
-> Sort (cost=11598.42..11598.45 rows=11 width=703) (actual time=5680.150..5680.150 rows=1 loops=1)
Sort Key: p0.inserted_at DESC
Sort Method: quicksort Memory: 26kB
-> Group (cost=11598.18..11598.23 rows=11 width=703) (actual time=5680.099..5680.101 rows=1 loops=1)
Group Key: p0.id
-> Sort (cost=11598.18..11598.21 rows=11 width=703) (actual time=5680.089..5680.089 rows=1 loops=1)
Sort Key: p0.id
Sort Method: quicksort Memory: 26kB
-> Nested Loop (cost=1.12..11597.99 rows=11 width=703) (actual time=2475.272..5680.054 rows=1 loops=1)
-> Index Scan using persons_birth_date_index on persons p0 (cost=0.56..3325.90 rows=2966 width=703) (actual time=0.657..2162.997 rows=3402 loops=1)
Index Cond: (birth_date = '1940-01-01'::date)
Filter: is_active
-> Index Scan using person_documents_person_id_index on person_documents p1 (cost=0.56..2.78 rows=1 width=16) (actual time=1.032..1.032 rows=0 loops=3402)
Index Cond: (person_id = p0.id)
Filter: ((type = 'PASSPORT'::text) AND ((lower(p1."number")) = 'XXXXXX'::text))
Rows Removed by Filter: 1
Planning time: 61.766 ms
Execution time: 5680.474 ms
With JOIN on ids and WHERE on search condition very slow with correct indexes
Too many records with frequent distribution birth_date condition used
Make postgres to choose best query plan
explain analyze SELECT p0.*
FROM persons AS p0
INNER JOIN person_documents AS p1
ON p1.person_id = p0.id
AND p1.type = 'BIRTH_CERTIFICATE' AND regexp_replace(number,'[^[:digit:]]','','g') = '1'
WHERE p0."birth_date" = '2012-03-14' ORDER BY p0."inserted_at" DESC LIMIT 11 OFFSET 0;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=58.18..58.18 rows=1 width=540) (actual time=4.605..4.605 rows=0 loops=1)
-> Sort (cost=58.18..58.18 rows=1 width=540) (actual time=4.603..4.603 rows=0 loops=1)
Sort Key: p0.inserted_at DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1.12..58.17 rows=1 width=540) (actual time=4.594..4.594 rows=0 loops=1)
-> Index Scan using person_documents_number_index on person_documents p1 (cost=0.56..38.62 rows=7 width=16) (actual time=4.592..4.592 rows=0 loops=1)
Index Cond: (regexp_replace(number, '[^[:digit:]]'::text, ''::text, 'g'::text) = '1'::text)
Filter: (type = 'BIRTH_CERTIFICATE'::text)
-> Index Scan using persons_tmp_pkey on persons p0 (cost=0.56..2.78 rows=1 width=540) (never executed)
Index Cond: (id = p1.person_id)
Filter: (birth_date = '2012-03-14'::date)
Planning time: 1.161 ms
Execution time: 4.679 ms
(13 rows)
With join on search condition
Index, builded with search expression( regexp) for documents with rare distribution was used
Make postgres to choose best query plan
explain analyze SELECT p0.*
FROM "persons" AS p0 INNER JOIN
(SELECT DISTINCT p0."id" AS "id"
FROM "persons" AS p0
INNER JOIN "person_documents" AS p1 ON
(p1."person_id" = p0."id") AND (((p1."type" = 'NATIONAL_ID')
AND lower(p1.number) = '20120314'))) AS s1 ON p0."id" = s1."id"
WHERE ((p0."birth_date" = '2012-03-14') AND (p0."status" = 'active')) AND (lower(p0."last_name") = 'селедцов') AND (p0."is_active")
ORDER BY p0."inserted_at" DESC LIMIT 11 OFFSET 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=19263.84..19263.84 rows=1 width=540) (actual time=0.046..0.046 rows=0 loops=1)
-> Sort (cost=19263.84..19263.84 rows=1 width=540) (actual time=0.045..0.045 rows=0 loops=1)
Sort Key: p0.inserted_at DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=19040.51..19263.83 rows=1 width=540) (actual time=0.030..0.030 rows=0 loops=1)
Join Filter: (p0.id = p0_1.id)
-> Index Scan using auth_method_search_idx on persons p0 (cost=0.69..2.91 rows=1 width=540) (actual time=0.030..0.030 rows=0 loops=1)
Index Cond: ((birth_date = '2012-03-14'::date) AND (lower((last_name)::text) = 'селедцов'::text))
Filter: (is_active AND ((status)::text = 'active'::text))
-> HashAggregate (cost=19039.82..19107.85 rows=6803 width=16) (never executed)
Group Key: p0_1.id
-> Nested Loop (cost=0.99..19022.81 rows=6803 width=16) (never executed)
-> Index Scan using national_id_l_number_index on person_documents p1 (cost=0.43..7590.64 rows=6803 width=16) (never executed)
Index Cond: (lower(number) = '20120314'::text)
-> Index Only Scan using persons_tmp_pkey on persons p0_1 (cost=0.56..1.67 rows=1 width=16) (never executed)
Index Cond: (id = p1.person_id)
Heap Fetches: 0
Planning time: 2.402 ms
Execution time: 0.307 ms
(19 rows)
With join to subquery on search condition
Join on subquery with exact match to best index - there is no way for postgres use another query plans :)
Multicolumn Indexes
- Order of fields on building index is matter
- The field with rarest distribution of repeats goes first
CREATE INDEX address_settlement_dm_indx person_addresses
ON (dm_last_name, settlement_id);
BAD index example
type
Mobile
Other types
number
e. g.
20,000,000
rows
0500000000
0500000001
.......................
09999999999
Almost whole index file will be checked ➘
index on phones (type, number)
`random_page_cost` is higher
Why previous example is not effective?
There is no "silver bullet" for multicolumn indexes
Distribution of repeats for fields
The order of result records set satisfying the search conditions
Few things to remember before building multicolumn indexes
Making postgres to select records faster when indexes already inflated
PARTITIONS
When it is time to use partitions?
Table(s) too big
Indexes too big
Any optimizations would not help - too many pages will be scanned
Most of accessed rows can be stored in single or few partitions
Example
persons (large table)
person_addresses (x2 large table)
first_name
settlement_id
widespread values (settlement, first name)
.. where p.first_name = .. and a.settlement_id = ..
extremely slow select
Partitions example
person_addresses
fast select
partitioned by settlement id child tables
first_name
settlement_id
persons (large table)
.. where p.first_name = .. and a.settlement_id = ..
widespread values (settlement, first name)
Partitions example
persons
person_addresses
.. where p.first_name = .. and a.settlement_id = ..
very fast select
partitioned by settlement id child tables
.. or even ..
partitioned by first names child tables
settlement_id
first_name
widespread values (settlement, first name)
Partitions
INSERT/UPDATE/SELECT performance improved
Additional logic should be described for inherited table (rules, indexes)
When building indexes/partitions costs too much make no sense because of distribution
DENORMALIZATION
When the denormalization makes sense
Index(es) builded with long-running function*
Index(es) costs too much for rebuilding index on each update*
Partitions would not help
* - indexes are rebuilding for every update, even only non-indexed fields
Pre-counted/aggregated/joined data you need is ready for fast selects
index .... (..)
index .... (..., ...)
denormalized table
Clusters: fuzzy, accurate
accurate documents match
accurate authentication phones match
double metaphone last_name and one of few person's important attributes accurate matches
double metaphone last_name and one of few person's important attributes fuzzy matches
Optimization tips for queries with lot of conditions
Putting all together
SELECT p.* from persons p
LEFT JOIN person_addresses pa ON pa.person_id = p.id
AND pa.settlement_id = .. AND (p.birth_date = .. OR p... = .. OR ..)
LEFT JOIN person_documents pd ON pd.person_id = p.id
AND (pd.number = ... OR pd.. = .. OR ..)
LEFT JOIN person_phones ...
.
..3 more..
.
JOIN person_double_methaphone_names ...
.
.
WHERE p... = OR ...
ORDER BY ..
LIMIT
OFFSET
Query with a lot of joins and conditions will run very slow -
probably planner will check all the rows in main table
Optimization tips for queries with lot of conditions
UNION make planner run select you need one by one using your optimizations for each search cluster
SELECT p.* from (
(select distinct person_id from person_auth WHERE ..)
UNION
(select id from persons WHERE ..)
UNION
(select distinct person_id from denormalized_table WHERE)
UNION
.. 3 MORE SUBSELECTS ..
()
)sub_union
left join ...
...
ORDER BY ..
LIMIT
OFFSET
JSON/JSONB
JSON(B) columns, event indexed, is not good for search
Big json better store in separate table
Tips start GenStage
- Start GenStage with separate supervisor with :one_for_all strategy
- Store 'subscription_tag' after 'sync_subscribe'
[
worker(Producer, [%{id: @producer}], id: @producer, name: @producer)
| Enum.map(consumer_ids(), fn consumer_id ->
worker(Consumer, [%{id: consumer_id}], id: consumer_id, name: consumer_id)
end)
]
..
{:ok, _} =
Supervisor.start_link(children, strategy: :one_for_all, name: Deduplication.Supervisor.GenStage)
..
Enum.each(consumers, fn consumer ->
with {:ok, subscription_tag} <- GenStage.sync_subscribe(consumer, to: @producer, max_demand: demand) do
PersistentTerm.store_subscriptions(...)
end
..
Enum.each(consumers, fn consumer ->
with {:ok, new_subscription_tag} <-
GenStage.sync_resubscribe(consumer, subscription_tag, :normal, to: @producer, max_demand: demand) do
PersistentTerm.store_subscriptions(...)
end
Consumer_4: 19
Consumer_7: 13
Consumer_1: 14
Consumer_11: 18
Consumer_19: 16
Consumer_11: 19
Consumer_2: 11
Consumer_1: 15
Consumer_4: 20
Consumer_12: 18
Consumer_7: 14
Consumer_19: 17
Consumer_2: 12
Consumer_11: 20
Consumer_14: 18
Consumer_1: 16
Consumer_13: 17
Consumer_12: 19
Consumer_19: 18
Consumer_14: 19
Consumer_12: 20
Why GenStage?
Measurement for person in each consumer takes different time because of different possible clones number |
Why GenStage?
Measurement for person in each consumer takes different time because of different possible clones number, automatic producing data |
Very easy for scale consumers
You don't have to implement queues/handler for getting data - less code, less time, same results
What to improve?
Use Bayesian statistic to handle consonant pollution and determinate matches double metaphone sign
Extend current model
Fuzzy documents cluster
Thank You!
ElixirClub Kyiv September 2019
By marinakr
ElixirClub Kyiv September 2019
Parallel execution with GenStage
- 890