Using Data Mining for Citation Analysis
Phil White
Earth Sciences & Environment Librarian University of Colorado Boulder
Outline
1. Introduction
2. Background
3. Methods
4. Results & Discussion
Introduction
Citation analysis has many applications
- Trends in journal usage
- Faculty research productivity and impact
- Identify seminal works
- Library patron studies
Collection assessment and development
Introduction
Research Question:
What library materials are Geosciences faculty using; does the library have them?
Introduction
Problems
- Citation analyses are very time consuming
- Scientists publish A LOT!!
- Many citations per paper, 100+ not unusual
- Methods are typically manual or semi-manual
High Volume + Time Intensive + Manual
= No Thanks!
Introduction
Dual Objectives:
- Discover bibliometric trends in Geological Sciences publications
- Detect gaps in the Earth Sciences collection
Collection assessment
Methodological advances
- Advance a streamlined, programmatic data collection method
- Develop auto or semi-automated process for matching citation data to the library holdings
Background
- Few and far between
- Zipp (1996) conducted local analysis at U of Iowa
- Journal of Paleontology, GSA Bulletin most cited
- 20% of collection recieves 80% of use (Nisonger 2008)
- Studies have both confirmed and refuted this rule
The 80/20 Rule
Citation analysis in Geosciences for collection development
Background
- 2 approaches to overcoming time barriers:
- Sampling (e.g., 1 out of 5 citations)
- Use of proxy data (such as dissertations)
- Data collection:
- Get citations into Excel—somehow!
- Lack of standardized methodology, often vague
- See Hoffman & Doucette 2012
Past Methodology
Background
No published studies have used the Web of Science API to collect citation data.
Methods
Geoscience Department publication data:
- Elements (aka CUBE)
- Downloaded list of all publications of Geoscience faculty, 2012-2016 (csv file)
- Contained typical bibliographic data, including Web of Science accession numbers (important)
- Commonly used for faculty info systems (such as Elements)
- SOAP based API—runs on XML—no interface, just a url!
- Send the API URL an XML message, it will send an XML message in return
Web of Science API
Methods
Data extraction:
- How do I interact with the API if there is no interface?
- Python! An open source scripting language
- Python developers have created tools for communicating with SOAP APIs
- Automatically generates XML message to your specifications and sends it
- Relies on WOS accession number
Methods
Challenge: Learn to code in Python
3 months later:
Methods
Data Cleaning:
-
Title standardization
-
J. of Geophys. Res. B. OR Journ. Geophys Research: Solid Earth
-
OpenRefine!
-
Semi-automated clustering technology
-
Iterative process
-
Methods
Matching Citation Data to Library Holdings Data
- Obtained library's journal holdings data
- OpenRefine Reconciliation tools to match data sets based on titles
- "Supervised" process
Calculated Bibliometrics
- Basic metrics of faculty publications
- Rankings of most cited publications
- Citation age at time of citing
- Dispersion of citations to publications
Results & Discussion
|
Pubs |
Total citations in all pubs | Avg citations per pub | |
|---|---|---|---|
| 2012 | 78 | 4808 | 61.6 |
| 2013 | 65 | 3621 | 55.7 |
| 2014 | 69 | 3572 | 51.8 |
| 2015 | 121 | 6324 | 52.3 |
| 2016 | 98 | 6123 | 62.5 |
| Total | 431 | 24448 | 56.7 |
Summary of Publications and Citations Used in Study
|
Rank |
Journal |
Times published in |
|---|---|---|
| 1 | Geophysical Research Letters | 64 |
| 2 | Earth and Planetary Science Letters | 34 |
| 3 | Geology | 19 |
| 4 | Science | 17 |
| 5 | Journal of Geophysical Research. Earth Surface | 16 |
| 5 | Geosphere | 16 |
| 7 | Quaternary Science Reviews | 14 |
| 7 | American Mineralogist | 14 |
| 9 | Pure and Applied Geophysics | 13 |
| 10 | Journal Of Geophysical Research. Space Physics | 11 |
Journals Most Often Published In
| Rank | Journal | Times cited |
|---|---|---|
| 1 | Geophysical Research Letters | 1074 |
| 2 | Science | 933 |
| 3 | Nature | 734 |
| 4 | Earth and Planetary Science Letters | 638 |
| 5 | Geology | 559 |
| 6 | Journal of Geophysical Research: Solid Earth | 546 |
| 7 | Quaternary Science Reviews | 461 |
| 8 | Journal of Geophysical Research: Space Physics | 456 |
| 9 | Geochimica et Cosmochimica Acta | 412 |
| 10 | Space Science Reviews | 374 |
Most Frequently Cited Journals

0
400
800
1200
1600
0
20
40
60
80
100
120
Citation Age
Years
Citations
Median: 9
Mode: 3
25
50
75
100
Distribution of Citations to Journal Titles (80/20 Rule)
% Titles cited (n = 3,961)

20
40
60
80
100
% Total citations (n = 23,944)
10% of Titles
Proportions of Cited Titles in Library Collection
Titles cited 20+ times

Titles cited 10+ times
Titles cited 5+ times
n = 151
n = 241
n = 429
In Library
Not in Library
3%
5%
8%
97%
95%
92%
| Journal | Times Cited |
|---|---|
| Quaternary Research | 110 |
| Soil Science Society of America Journal | 28 |
| Anales Del Instituto de La Patagonia, Serie Ciencias Humanas | 27 |
| Jokull | 20 |
| Contra Viento y Marea. Arqueologia de Patagonia | 19 |
| Anales del Instituto de La Patagonia Serie Ciencias Sociales | 15 |
| Arctic | 13 |
| Sop Lando en el Viento. Actas de las Ill Jornadas de Arqueologia de la Patagonia | 11 |
| Photogrammetric Engineering and Remote Sensing | 10 |
| Arqueologia de Patagonia: Una Mirada Desde el Ultimo Confin | 10 |
Most Frequently Cited Journals Not in Library
Results & Discussion
Summary
-
Recently published materials are very important
- 80% of citations went to just 10% of titles
- We don't have Quaternary Research?!
- Geosciences faculty most often cited works 3 years of age (often less)
- About 50% of all citations went to top 1% of titles
- Results underscore the importance of a relatively small amount of journals
- Good coverage, but found several serials that should be added to collection
Results & Discussion
A methodological revolution for citation studies...?
What is next?
- All of the sciences at CU?
- Cross-institutional comparison?
If we don't have it, where did you get it?
Conclusion
1. Study produced actionable results
2. Study developed new method for collecting data
Conclusion
Thank you!
Notes:
Copy of Using Data Mining for Citation Analysis
By Philip White
Copy of Using Data Mining for Citation Analysis
- 734