Performing concurrent  fuzzy matches with Postgres and GenStage

Maryna Shabalina, Edenlab

marinakrasnovatr81@gmail.com

Elixir/Erlang developer at Edenlab

Problem

  • There are can be many similar records in database identify same person account
  • Each account record costs money

Problem

  • No additional infrastructure allowed
  • 30 millions records
  • Production database performance can not be abused during the calculations

 Шварценеґґер, 5433224760, +38093000000,
ЕН617561, Таль

Швартсніггер, 5433224760, +380930000000,
ЕН617567; м. Таль

Шфортцнєґер, 5433224769, NULL,
67561, Thal

PERSON

CLONES

Same person - few records

More prosaic examples

Мельнік Мельник

Новохатня  Новокхатняя

І-239407  1-239407, 239407

  • Numbers on old paper documents

1 7

0 9

В 8

  • Birth date input forms are not convenient often

Why casual methods for finding duplicates/fraud will not run?

  • activity analysis  - accounts do nothing
  • accounts created via async 3rd party software

Person

Presumable clone

Python ML model

=    0.8783456

score of similarity for pair

0..1

Дзигановський Юрій, 074338611, +380..9,
2016-04-05, ...

Дзекановський Юра, 1074338611, +380..9, 
2019-04-05, ...

model

Record#N 

Python ML model
Python ML model
Python ML model

Record#N
Verified

 clone, deactivate

 similar

 similar

Deduplication process for one record

Elixir

Postgres production database 

30, 000, 000 records with duplicates

Few days for finding and deactivation duplicated record

mark record as counted

unverified record

get candidates

count similarity

candidates
|> Task.async_stream(&count_pair_weight(person, &1))
|> async_stream_filter()

Processing record

store results if nessesary

Getting candidates

  • all to all:

~900,000,000,000,000 pairs to process

with 0.02 second to calculate score for one pair:

~ 570776.2557077626 years

In 1 process

In 20,000 parallel processes

~ 28,5 year

900000000000000*0.02/(60*60*24*365)

All to All

Waiting for all to all total check for duplicates is ready

This step should takes milliseconds

get candidates

record

● exact match few consequential fields

fuzzy match by consequential identifiers

Getting only records are guaranteed similar in advance

Exact match

tax_id

{tax_id: 1234567890, documents: [{number: 1111}]}

PERSONS

1234567890

1234567890

number

1111

DOCUMENTS

1111

person_id

person_id

id

id

Candidates

Fuzzy match consequential identifiers
  • Approximately equal name sound (use Ukrainian doublemetaphone)
  • "typo" mistakes
  • Distance 1-2 symbols in necessary fields

Double metaphone

Algorithm get sign of a word normalizing silent letters

С,Ц, ТС, ТЦ -> C (S)

З + vovel ->  C (S)

Д + vovel ->  Т (Т)

В + vovel ->  Ф (F)

Б + vovel ->  П (P)

КХ + vovel ->  Х (KH)

Popular last names endings -> symbol

Double metaphone examples

КВРТ@    Ковратенко, Гавритенко, Гаврутенко

ПК#    Паковська, Бугаєвська, Бегійовська, Буяковська

КЛН&   Кліницький, 'Гуляницький,Каленський, Каленські

РПК    Рибяк, Рибак, Рип'як, Арабок, Ріпак

Producer

Consumer 1

Consumer 2

Consumer N

mark records as processing

GenStage's processes

process record

get batch of unverified records

consume new record

consume last record

mark record as counted

unverified record

get candidates

count similarity

Processing record

store results if nessesary

Consumer N

Producer

  • Never store mark status as column

PostgreSQL database performance tips for deduplication

  • Store last processed record attribute and offset (if attribute is not uniq) separate on disk with fast access

Use the index, Luke!

SELECT * FROM persons 
WHERE updated_at + interval '7 days' < now()
SELECT * FROM persons 
WHERE ∫(μ(updated_at)y('7 days'))′dt < now()

"persons_updated_at_index" btree (updated_at)

index will not run

Tips for using simple indexes

SELECT * FROM persons 
WHERE updated_at < now() - interval '7 days'
 ... OFFSET ..

"persons_updated_at_index" btree (updated_at)

SELECT * FROM person_documents 
WHERE 
  regexp_replace(
    number, '[^[:digit:]]', '', 'g') = '697569'

"documents_number_index" btree

(regexp_replace(number, '[^[:digit:]]', '', 'g'))

SELECT * FROM person_documents 
WHERE number = '697569'

Usage indexes builded with function

Left side of indexed field condition always equal to function, index builded with

Simple indexes using

Partial indexes

CREATE INDEX pd_passport_indx person_documents ON lower(number)
WHERE type = 'PASSPORT';

CREATE INDEX pd_passport_indx person_documents ON lower(number)
WHERE type = 'NATIONAL_ID';

... 5 more types ...
 person_documents | person_documents_number_index  | 1156 MB
 person_documents | birth_cert_l_number_index      | 131 MB

Simplest case: 1 to N relation


                     | person_documents |  
                     |------------------|
                     |      document    |
 persons |           |------------------|
---------+  1 -> N   |      document    |
 person  |           |------------------|
                     |      document    |
                     |------------------|
                     

INDEXES:
persons (birth_date)
person_documents (number)

Make postgres to choose best query plan

Use 'JOIN' instead of 'WHERE' conditions

Make postgres to choose best query plan

explain analyze SELECT p0.*
  FROM "persons" AS p0 
  INNER JOIN "person_documents" AS p1 
  ON p1."person_id" = p0."id" 
  WHERE 
    (p0."birth_date" = '1991-08-24') AND (p1."type" = 'PASSPORT') AND (lower(p1."number")) = 'XXXXXX'
  GROUP BY p0."id" ORDER BY p0."inserted_at" DESC LIMIT 11 OFFSET 0;

Limit  (cost=11598.42..11598.45 rows=11 width=703) (actual time=5680.152..5680.152 rows=1 loops=1)
   ->  Sort  (cost=11598.42..11598.45 rows=11 width=703) (actual time=5680.150..5680.150 rows=1 loops=1)
         Sort Key: p0.inserted_at DESC
         Sort Method: quicksort  Memory: 26kB
         ->  Group  (cost=11598.18..11598.23 rows=11 width=703) (actual time=5680.099..5680.101 rows=1 loops=1)
               Group Key: p0.id
               ->  Sort  (cost=11598.18..11598.21 rows=11 width=703) (actual time=5680.089..5680.089 rows=1 loops=1)
                     Sort Key: p0.id
                     Sort Method: quicksort  Memory: 26kB
                     ->  Nested Loop  (cost=1.12..11597.99 rows=11 width=703) (actual time=2475.272..5680.054 rows=1 loops=1)
                           ->  Index Scan using persons_birth_date_index on persons p0  (cost=0.56..3325.90 rows=2966 width=703) (actual time=0.657..2162.997 rows=3402 loops=1)
                                 Index Cond: (birth_date = '1940-01-01'::date)
                                 Filter: is_active
                           ->  Index Scan using person_documents_person_id_index on person_documents p1  (cost=0.56..2.78 rows=1 width=16) (actual time=1.032..1.032 rows=0 loops=3402)
                                 Index Cond: (person_id = p0.id)
                                 Filter: ((type = 'PASSPORT'::text) AND ((lower(p1."number")) = 'XXXXXX'::text))
                                 Rows Removed by Filter: 1
 Planning time: 61.766 ms
 Execution time: 5680.474 ms

With JOIN on ids and WHERE on search condition very slow with correct indexes

Too many records with frequent distribution birth_date condition used

Make postgres to choose best query plan

explain analyze SELECT p0.* 
  FROM persons AS p0 
  INNER JOIN person_documents AS p1 
  ON p1.person_id = p0.id 
  AND p1.type = 'BIRTH_CERTIFICATE' AND regexp_replace(number,'[^[:digit:]]','','g') = '1' 
  WHERE p0."birth_date" = '2012-03-14' ORDER BY p0."inserted_at" DESC LIMIT 11 OFFSET 0;
  
  
                                                                              QUERY PLAN                                                                               
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=58.18..58.18 rows=1 width=540) (actual time=4.605..4.605 rows=0 loops=1)
   ->  Sort  (cost=58.18..58.18 rows=1 width=540) (actual time=4.603..4.603 rows=0 loops=1)
         Sort Key: p0.inserted_at DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=1.12..58.17 rows=1 width=540) (actual time=4.594..4.594 rows=0 loops=1)
               ->  Index Scan using person_documents_number_index on person_documents p1  (cost=0.56..38.62 rows=7 width=16) (actual time=4.592..4.592 rows=0 loops=1)
                     Index Cond: (regexp_replace(number, '[^[:digit:]]'::text, ''::text, 'g'::text) = '1'::text)
                     Filter: (type = 'BIRTH_CERTIFICATE'::text)
               ->  Index Scan using persons_tmp_pkey on persons p0  (cost=0.56..2.78 rows=1 width=540) (never executed)
                     Index Cond: (id = p1.person_id)
                     Filter: (birth_date = '2012-03-14'::date)
 Planning time: 1.161 ms
 Execution time: 4.679 ms
(13 rows)

With join on search condition

Index, builded with search expression( regexp) for documents with rare distribution was used

Make postgres to choose best query plan

explain analyze SELECT p0.*
  FROM "persons" AS p0 INNER JOIN 
  (SELECT DISTINCT p0."id" AS "id" 
    FROM "persons" AS p0 
    INNER JOIN "person_documents" AS p1 ON 
    (p1."person_id" = p0."id") AND (((p1."type" = 'NATIONAL_ID') 
    AND lower(p1.number) = '20120314'))) AS s1 ON p0."id" = s1."id" 
   WHERE ((p0."birth_date" = '2012-03-14') AND (p0."status" = 'active')) AND (lower(p0."last_name") = 'селедцов') AND (p0."is_active") 
   ORDER BY p0."inserted_at" DESC LIMIT 11 OFFSET 0;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=19263.84..19263.84 rows=1 width=540) (actual time=0.046..0.046 rows=0 loops=1)
   ->  Sort  (cost=19263.84..19263.84 rows=1 width=540) (actual time=0.045..0.045 rows=0 loops=1)
         Sort Key: p0.inserted_at DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=19040.51..19263.83 rows=1 width=540) (actual time=0.030..0.030 rows=0 loops=1)
               Join Filter: (p0.id = p0_1.id)
               ->  Index Scan using auth_method_search_idx on persons p0  (cost=0.69..2.91 rows=1 width=540) (actual time=0.030..0.030 rows=0 loops=1)
                     Index Cond: ((birth_date = '2012-03-14'::date) AND (lower((last_name)::text) = 'селедцов'::text))
                     Filter: (is_active AND ((status)::text = 'active'::text))
               ->  HashAggregate  (cost=19039.82..19107.85 rows=6803 width=16) (never executed)
                     Group Key: p0_1.id
                     ->  Nested Loop  (cost=0.99..19022.81 rows=6803 width=16) (never executed)
                           ->  Index Scan using national_id_l_number_index on person_documents p1  (cost=0.43..7590.64 rows=6803 width=16) (never executed)
                                 Index Cond: (lower(number) = '20120314'::text)
                           ->  Index Only Scan using persons_tmp_pkey on persons p0_1  (cost=0.56..1.67 rows=1 width=16) (never executed)
                                 Index Cond: (id = p1.person_id)
                                 Heap Fetches: 0
 Planning time: 2.402 ms
 Execution time: 0.307 ms
(19 rows)

With join to subquery on search condition

Join on subquery with exact match to best index - there is no way for postgres use another query plans :)

Multicolumn Indexes

  • Order of fields on building index is matter
  • The field with rarest distribution of repeats goes first 
CREATE INDEX address_settlement_dm_indx person_addresses 
  ON (dm_last_name, settlement_id);

BAD index example

type

Mobile

Other types

number

e. g.
20,000,000

rows

0500000000

0500000001

.......................

09999999999

Almost whole index file  will be checked

 index on phones (type, number)

`random_page_cost` is higher

Why previous example is not effective?

There is no "silver bullet" for multicolumn indexes

Distribution of repeats for fields

The order of result records set satisfying the search conditions

Few things to remember before building multicolumn indexes

Making postgres to select records faster when indexes already inflated

PARTITIONS

When it is time to use partitions?

Table(s) too big

Indexes too big

Any optimizations would not help - too many pages will be scanned

Most of accessed rows can be stored in single or few partitions

Example

persons (large table)

person_addresses (x2 large table)

first_name

settlement_id

widespread values (settlement, first name)

.. where p.first_name = .. and a.settlement_id = ..

extremely slow select

Partitions example

person_addresses

fast select

partitioned by settlement id child tables

first_name

settlement_id

persons (large table)

.. where p.first_name = .. and a.settlement_id = ..

widespread values (settlement, first name)

Partitions example

persons

person_addresses

.. where p.first_name = .. and a.settlement_id = ..

very fast select

partitioned by settlement id child tables

.. or even ..

partitioned by first names child tables

settlement_id

first_name

widespread values (settlement, first name)

Partitions

INSERT/UPDATE/SELECT performance improved

Additional logic should be described for inherited table (rules, indexes)

When building indexes/partitions costs too much make no sense because of distribution

DENORMALIZATION

When the denormalization makes sense

Index(es) builded with long-running function*

Index(es) costs too much for rebuilding index on each update*

Partitions would not help

* - indexes are rebuilding for every update, even only non-indexed fields

Pre-counted/aggregated/joined data you need is ready for fast selects

index .... (..)

index .... (..., ...)

denormalized table

Clusters: fuzzy, accurate

accurate documents match

accurate authentication phones match

double metaphone  last_name and one of few person's important attributes accurate matches

double metaphone  last_name and one of few person's important attributes fuzzy matches

Optimization tips for queries with lot of conditions

Putting all together

SELECT p.* from persons p 
LEFT JOIN person_addresses pa ON pa.person_id = p.id 
     AND pa.settlement_id = ..  AND (p.birth_date = .. OR p... = .. OR ..)
LEFT JOIN person_documents pd ON pd.person_id = p.id 
     AND (pd.number = ... OR pd.. = .. OR ..)
LEFT JOIN person_phones ...
.
..3 more..
.
JOIN person_double_methaphone_names ...
.
.
WHERE p... =  OR ...
ORDER BY ..
LIMIT 
OFFSET

Query with a lot of joins and conditions will run very slow -

probably planner will check all the rows in main table

Optimization tips for queries with lot of conditions

UNION make planner run select you need one by one using your optimizations for each search cluster

SELECT p.* from (
  (select distinct person_id from person_auth WHERE ..)
  UNION
  (select id from persons WHERE ..)
  UNION
  (select distinct person_id from denormalized_table WHERE)
  UNION
  .. 3 MORE SUBSELECTS ..
  ()
)sub_union
left join ...
...
ORDER BY ..
LIMIT 
OFFSET

JSON/JSONB

JSON(B) columns, event indexed, is not good for search

Big json better store in separate table

Tips start GenStage

  • Start GenStage with separate supervisor with :one_for_all strategy
  • Store 'subscription_tag' after 'sync_subscribe'
[
  worker(Producer, [%{id: @producer}], id: @producer, name: @producer)
    | Enum.map(consumer_ids(), fn consumer_id ->
       worker(Consumer, [%{id: consumer_id}], id: consumer_id, name: consumer_id)
      end)
]

  
..

{:ok, _} =  
  Supervisor.start_link(children, strategy: :one_for_all, name: Deduplication.Supervisor.GenStage)

..

Enum.each(consumers, fn consumer -> 
  with {:ok, subscription_tag} <- GenStage.sync_subscribe(consumer, to: @producer, max_demand: demand) do
     PersistentTerm.store_subscriptions(...)
  end

..

Enum.each(consumers, fn consumer -> 
  with {:ok, new_subscription_tag} <-
         GenStage.sync_resubscribe(consumer, subscription_tag, :normal, to: @producer, max_demand: demand) do
    PersistentTerm.store_subscriptions(...)
  end

Consumer_4: 19
Consumer_7: 13
Consumer_1: 14
Consumer_11: 18
Consumer_19: 16
Consumer_11: 19
Consumer_2: 11
Consumer_1: 15
Consumer_4: 20
Consumer_12: 18
Consumer_7: 14
Consumer_19: 17
Consumer_2: 12
Consumer_11: 20
Consumer_14: 18
Consumer_1: 16
Consumer_13: 17
Consumer_12: 19
Consumer_19: 18
Consumer_14: 19
Consumer_12: 20

Why GenStage?

 Measurement for person in each consumer  takes different time because

of different possible clones number

Why GenStage?

 Measurement for person in each consumer  takes different time because of different possible clones number, automatic producing data

Very easy for scale consumers

You don't have to implement queues/handler for getting data - less code, less time, same results

What to improve?

Use Bayesian statistic to handle consonant  pollution and determinate matches double metaphone sign

Extend current model

Fuzzy documents cluster

Thank You!

ElixirClub Kyiv September 2019

By marinakr

ElixirClub Kyiv September 2019

Parallel execution with GenStage

  • 890