The Usual Suspects
Bibliographic De-duplication in Evergreen
Rogan Hamby, Equinox Open Library Initiative
Evergreen International Conference 2017
Housekeeping
Questions for Me
An Interactive Experience
Be Kind to Each Other, Keep me From Yodeling.
Seriously, I sound like a wounded llama with serious mental problems.
Self Questioning
During the course of a de-duplication you will ask yourself a lot of questions. You will probably get sick of talking to yourself. So, bring plenty of people into the project so that you don't end up talking to a dog.
A Question I Get
What is the correct way to do a de-dupe?
The Answer I Give
There are those who believe there is one way to do a de-duplication. I believe there are at least one hundred.
Where on the spectrum do you see yourself?
A Question For You
Perfect
Descriptive
Granualarity
Extreme
Ease of
Search
Let's Get to Know Each Other
1997 p.307 Hardcover
1998 p.320 Softcover
1998 p.320 Softcover
1999 p.320 Softcover
ACHTUNG!!!!!!!!!
How to Build a Deduplication
- What records are you de-dupping? Against or Within?
How to Build a Deduplication
- What records are you de-dupping? Against or Within?
- Do you have any absolute match criteria, i.e. A & B must match?
How to Build a Deduplication
- What records are you de-dupping? Against or Within?
- Do you have any absolute match criteria, i.e. A & B must match?
- What will your optional match points be?
How to Build a Deduplication
- What records are you de-dupping? Against or Within?
- Do you have any absolute match criteria, i.e. A & B must match?
- What will your optional match points be?
- How will you score the optional match points?
How to Build a Deduplication
- What records are you de-dupping? Against or Within?
- Do you have any absolute match criteria, i.e. A & B must match?
- What will your optional match points be?
- How will you score the optional match points?
- How will you determine the lead record?
What records are you de-dupping? Against or Within?
How many.
VS
Do you have any absolute match criteria?
Do you have any criteria in your data reliable enough to be a sure match?
Do you have any criteria in your data reliable enough to be a sure match?
Start with rounding up the usual suspects.
035
020
022
024
title
author
pub
Do Not Trust Your Own Intuition - LOOK AT YOUR DATA
5 | 1575664100
1 | 0396085903
1 | 0688147089
1 | 0439270553
1 | 9781433992155 (6-pack)
1 | 0252008790
1 | 0811645118
1 | 9780892047222
1 | 0253166756
1 | 0695400894
1 | 0517385546
1 | 0743270495
1 | 0345340426 (boxed set)
1 | 0307456242 (pbk.)
1 | 9780544115897
1 | 1607060760
1 | 0961899549
1 | 0516204688 (hc : lib. bdg.)
1 | 0136487416 (pbk.)
1 | 1570754047 (pbk.)
1 | 1568304862
1 | 1436174546
1 | 0783268580
1 | 9780452289253 (pbk.)
You don't need more than basic statistics to do a successful de-duplication. But you do need the statistics, at all stages.
Are you willing to be wrong?
We don't write programs to do dedupping because they do a better job than catalogers. We right them because code scales where people can't. And that means we accept some limitations. And it means accepting that there will always be work for catalogers to do.
http://www.infotoday.com/cilmag/may12/Hamby-A-Practical-Approach-to-Collection-Deduping.shtml
https://goo.gl/cjqJ63
10% Wrong for 90% Done: A Practical Approach to Collection Deduping
Computers in Library, May 2012
Determining Optional Match Points and Scoring
Start With the Usual Suspects
- 020
- 022
- 024
- 035
- author
- pub
- title
The same reports you ran to look at values as requirements can inform optional match point decisions.
Beyond that ... the unusual suspects.
Non OCLC 035s.
Pull illustrators from 245$c.
... could use 650s (well, you could).
010
You could index proper nouns in 520s
(which seems a bit mad but ....)
In the end results matter.
Should these two cats become one? Cat/MARC analogies can only go so far. Remember, at this stage pairing is a team score.
How will you determine the lead record?
Lead record scoring is king of the hill.
Remember when I made fun of using 650s a few slides back?
Look at everything that adds to the quality of a record : 500s, 600s, 700s, look for false positives, look at lengths, number of entries and so on. You can also revisit the criteria you looked at for matching (and should).
I'm done, right?
No.
Model your results through reports or a test server.
Have representative samples checked.
Tweak.
Re-run.
Repeat as needed until you're ready.
Sifting Pebbles - An Interactive Discussion of Bibliographic De-duplication in Evergreen
By Rogan Hamby
Sifting Pebbles - An Interactive Discussion of Bibliographic De-duplication in Evergreen
- 2,020