Using Data Mining for Citation Analysis

Phil White

Earth Sciences & Environment Librarian University of Colorado Boulder

Outline

1. Introduction

2. Background

3. Methods

4. Results & Discussion

Introduction

Citation analysis has many applications

  • Trends in journal usage
  • Faculty research productivity and impact
  • Identify seminal works
  • Library patron studies

Collection assessment and development

Introduction

Research Question:

What library materials are Geosciences faculty using; do we have them?

Introduction

Problems

  • Citation analyses are very time consuming
  • Scientists publish A LOT!!
  • Many citations per paper, 100+ not unusual
  • Methods are typically manual or semi-manual

High Volume + Time Intensive + Manual

= No Thanks!

Introduction

Dual Objectives:

  1. Discover bibliometric trends in Geological Sciences publications
  2. Detect gaps in the Earth Sciences collection

Collection assessment

Methodological advances

  1. Advance a streamlined, programmatic data collection method
  2. Develop auto or semi-automated process for matching citation data to the library holdings

Background

  • Few and far between
  • Zipp (1996) conducted local analysis at U of Iowa
  • Journal of Paleontology, GSA Bulletin most cited
  • 20% of collection recieves 80% of use (Nisonger 2008)
  • Studies have both confirmed and refuted this rule

The 80/20 Rule

Citation analysis in Geosciences for collection development

Background

  • 2 approaches to overcoming time barriers:                          
    1. Sampling (e.g., 1 out of 5 citations)
    2. Use of proxy data (such as dissertations)
  • Data collection:
    • Get citations into Excel—somehow!
  • Lack of standardized methodology, often vague
  • See Hoffman & Doucette 2012

Past Methodology

Background

No published studies have used the Web of Science API to collect citation data.

Methods

Faculty publication data:

  • Elements (aka CUBE)
  • Downloaded list of all publications of Geoscience faculty, 2012-2016 (csv file)
  • Contained typical bibliographic data, including Web of Science accession numbers (important)
  • Commonly used for faculty info systems (such as Elements)
  • SOAP based API—runs on XML—no interface, just a url!
  • Send the API URL an XML message, it will send an XML message in return

Web of Science API

Methods

Data extraction:

  • How do I interact with the API if there is no interface?
  • Python! An open source scripting language
  • Python developers have created tools for communicating with SOAP APIs
  • Automatically generates XML message to your specifications and sends it
  • Relies on WOS accession number

Methods

Challenge: Learn to code in Python

3 months later:

Methods

Data Cleaning:

  • Title standardization

  • J. of Geophys. Res. A. OR Journ. Geophys Research: Solid Earth

  • OpenRefine!

    • Semi-automated clustering technology

    • Iterative process

Methods

OpenRefine Clustering Technology

Methods

Matching Citation Data to Library Holdings Data

  • Obtained mix of Serial Solutions & Sierra data (Thanks Gabby & Laura!)
  • OpenRefine Reconciliation tools to match data sets based on titles
  • "Supervised" process

Calculated Bibliometrics

  • Basic metrics of faculty publications
  • Rankings of most cited publications
  • Citation age at time of citing
  • Dispersion of citations to publications

Results & Discussion


Pubs
Total citations in all pubs Avg citations per pub
2012 78 4808 61.6
2013 65 3621 55.7
2014 69 3572 51.8
2015 121 6324 52.3
2016 98 6123 62.5
Total 431 24448 56.7

Summary of Publications and Citations Used in Study


Rank

Journal
Times
published in
1 Geophysical Research Letters 64
2 Earth and Planetary Science Letters 34
3 Geology 19
4 Science 17
5 Journal of Geophysical Research. Earth Surface 16
5 Geosphere 16
7 Quaternary Science Reviews 14
7 American Mineralogist 14
9 Pure and Applied Geophysics 13
10 Journal Of Geophysical Research. Space Physics 11

Journals Most Often Published In

Rank Journal Times cited
1 Geophysical Research Letters 1074
2 Science 933
3 Nature 734
4 Earth and Planetary Science Letters 638
5 Geology 559
6 Journal of Geophysical Research: Solid Earth 546
7 Quaternary Science Reviews 461
8 Journal of Geophysical Research: Space Physics 456
9 Geochimica et Cosmochimica Acta 412
10 Space Science Reviews 374

Most Frequently Cited Journals

0

400

800

1200

1600

0

20

40

60

80

100

120

Citation Age

Years

Citations

Median: 9

Mode: 3

25

50

75

100

Distribution of Citations to Journal Titles (80/20 Rule)

% Titles cited (n = 3,961)

20

40

60

80

100

% Total citations (n = 23,944)

10% of Titles

Proportions of Cited Titles in Library Collection

Titles cited 20+ times

Titles cited 10+ times

Titles cited 5+ times

n = 151

n = 241

n = 429

In Library

Not in Library

3%

5%

8%

97%

95%

92%

Journal Times Cited
Quaternary Research 110
Soil Science Society of America Journal 28
Anales Del Instituto de La Patagonia, Serie Ciencias Humanas 27
Jokull 20
Contra Viento y Marea. Arqueologia de Patagonia 19
Anales del Instituto de La Patagonia Serie Ciencias Sociales 15
Arctic 13
Sop Lando en el Viento. Actas de las Ill Jornadas de  Arqueologia  de la Patagonia 11
Photogrammetric Engineering and Remote Sensing 10
Arqueologia de Patagonia: Una Mirada Desde el Ultimo Confin 10

Most Frequently Cited Journals Not in Library

Results & Discussion

Summary

  • Recently published materials are very important

  • 80% of citations went to just 10% of titles
  • We don't have Quaternary Research?!
  • Faculty most often cited works 3 years of age (often less)
  • About 50% of all citations went to top 1% of titles
  • Results underscore the importance of a relatively small amount of journals
  • Good coverage, but found several serials that should be added to collection

Results & Discussion

Methodological Implications

A methodological revolution for citation studies?

What is next?

  • This study opened a door...
  • Future studies could use this method to analyze a huge volume of citations
  • All of the sciences at CU?
  • Speed: these methods could be applied to conduct "just in time" citation studies
  • Cross-institutional comparison?

Conclusion

1. Study produced actionable results

2. Study developed new method for collecting data

Conclusion

Thank you!

Notes:

Hoffmann, K., & Doucette, L. (2012). A Review of Citation Analysis Methodologies for Collection Management. College & Research Libraries, 73(4), 321–335. https://doi.org/10.5860/crl-254
Kellsey, C., & Knievel, J. (2012). Overlap between Humanities Faculty Citation and Library Monograph Collections, 2004-2009. College & Research Libraries, 73(6), 569–583.
Nisonger, T. E. (2008). The 80/20 Rule and Core Journals. The Serials Librarian, 55(1–2), 62–84. https://doi.org/10.1080/03615260801970774
Zipp, L. S. (1996). Thesis and Dissertation Citations as Indicators of Faculty Research Use of University Library Journal Collections. Library Resources & Technical Services, 40(4), 335–342. https://doi.org/10.5860/lrts.40n4.335
Made with Slides.com