The Usual Suspects

Bibliographic De-duplication in Evergreen

Rogan Hamby, Equinox Open Library Initiative

Evergreen International Conference 2017

Housekeeping

Questions for Me

 

An Interactive Experience

Be Kind to Each Other, Keep me From Yodeling.  

 

Seriously, I sound like a wounded llama with serious mental problems.

Self Questioning

During the course of a de-duplication you will ask yourself a lot of questions.  You will probably get sick of talking to yourself.  So, bring plenty of people into the project so that you don't end up talking to a dog.  

A Question I Get

What is the correct way to do a de-dupe?

The Answer I Give

There are those who believe there is one way to do a de-duplication.  I believe there are at least one hundred.  

Where on the spectrum do you see yourself?  

A Question For You

Perfect

Descriptive 

Granualarity

Extreme

Ease of 

Search

Let's Get to Know Each Other

1997 p.307 Hardcover

1998 p.320 Softcover

1998 p.320 Softcover

1999 p.320 Softcover

ACHTUNG!!!!!!!!!

How to Build a Deduplication

  • What records are you de-dupping? Against or Within?

How to Build a Deduplication

  • What records are you de-dupping? Against or Within?
  • Do you have any absolute match criteria, i.e. A & B must match?

How to Build a Deduplication

  • What records are you de-dupping? Against or Within?
  • Do you have any absolute match criteria, i.e. A & B must match?
  • What will your optional match points be?

How to Build a Deduplication

  • What records are you de-dupping? Against or Within?
  • Do you have any absolute match criteria, i.e. A & B must match?
  • What will your optional match points be?
  • How will you score the optional match points?

How to Build a Deduplication

  • What records are you de-dupping? Against or Within?
  • Do you have any absolute match criteria, i.e. A & B must match?
  • What will your optional match points be?
  • How will you score the optional match points?
  • How will you determine the lead record?

What records are you de-dupping? Against or Within?

How many.

VS

Do you have any absolute match criteria?

Do you have any criteria in your data reliable enough to be a sure match?

 

 

Do you have any criteria in your data reliable enough to be a sure match?

 

Start with rounding up the usual suspects.

035

020

022

024

title

author

pub

Do Not Trust Your Own Intuition - LOOK AT YOUR DATA

     5 | 1575664100

     1 | 0396085903

     1 | 0688147089

     1 | 0439270553

     1 | 9781433992155 (6-pack)

     1 | 0252008790

     1 | 0811645118

     1 | 9780892047222

     1 | 0253166756

     1 | 0695400894

     1 | 0517385546

     1 | 0743270495

     1 | 0345340426 (boxed set)

     1 | 0307456242 (pbk.)

     1 | 9780544115897

     1 | 1607060760

     1 | 0961899549

     1 | 0516204688 (hc : lib. bdg.)

     1 | 0136487416 (pbk.)

     1 | 1570754047 (pbk.)

     1 | 1568304862

     1 | 1436174546

     1 | 0783268580

     1 | 9780452289253 (pbk.)

You don't need more than basic statistics to do a successful de-duplication.  But you do need the statistics, at all stages.

Are you willing to be wrong?

We don't write programs to do dedupping because they do a better job than catalogers.  We right them because code scales where people can't.  And that means we accept some limitations.  And it means accepting that there will always be work for catalogers to do.

http://www.infotoday.com/cilmag/may12/Hamby-A-Practical-Approach-to-Collection-Deduping.shtml

https://goo.gl/cjqJ63

10% Wrong for 90% Done: A Practical Approach to Collection Deduping

 

Computers in Library, May 2012

Determining Optional Match Points and Scoring

Start With the Usual Suspects

  • 020
  • 022
  • 024
  • 035
  • author
  • pub
  • title

The same reports you ran to look at values as requirements can inform optional match point decisions.

Beyond that ... the unusual suspects.

Non OCLC 035s.

 

Pull illustrators from 245$c.

 

... could use 650s (well, you could).

 

010

 

You could index proper nouns in 520s 

(which seems a bit mad but ....)

In the end results matter.

Should these two cats become one?  Cat/MARC analogies can only go so far.  Remember, at this stage pairing is a team score.  

How will you determine the lead record?

Lead record scoring is king of the hill.

Remember when I made fun of using 650s a few slides back?

 

 

 

Look at everything that adds to the quality of a record : 500s, 600s, 700s, look for false positives, look at lengths, number of entries and so on.  You can also revisit the criteria you looked at for matching (and should).

I'm done, right?

No.

Model your results through reports or a test server.

 

Have representative samples checked.

 

Tweak.

 

Re-run.

 

Repeat as needed until you're ready.

Sifting Pebbles - An Interactive Discussion of Bibliographic De-duplication in Evergreen

By Rogan Hamby

Sifting Pebbles - An Interactive Discussion of Bibliographic De-duplication in Evergreen

  • 1,911