Intro to Text Analysis

From Ngrams to NLP

Outline

modeling language

common NLP research tasks

scattertext

Bulk

Lexical Features

Word Frequency
Density
Average Words Per Sentence
Contexts

noun
a bird that lives by water and has webbed feet

Duck

verb
to move your head or the top part of your body quickly down, especially to avoid being hit

Negation

This film isn't about ducks, it's called "Mighty Ducks," and it's about hockey.

We shared that blunt with 20 people, and by the time it got to me; All I got was the duck.

– by theothers August 22, 2012

Context

Natural Language Processing

NLP uses machine learning to model human language and to predict the linguistic and semantic attributes of text.

〞

Common NLP Research Tasks

from raw text to structured data

〞

Entity Extraction

〞

Entity Linking

import spacy
# this is any existing model
nlp = spacy.load('en_core_web_lg')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight')

doc = nlp('Yale English is hiring in race, diaspora, and/or indigeneity, with particular interest in scholars of Latinx literature, Asian American literature, Native American and/or Global Indigenous literature, or Caribbean literature')
for ent in doc.ents:
    print(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore'])

# OUTPUT:
Yale http://dbpedia.org/resource/Yale_University 0.9988926828857767
English http://dbpedia.org/resource/English_language 0.8806620156671483
diaspora http://dbpedia.org/resource/Diaspora 0.940470180380478
Latinx http://dbpedia.org/resource/Latinx 0.9994470717639963
Asian American literature http://dbpedia.org/resource/Asian_American_literature 1.0
Native American http://dbpedia.org/resource/Race_and_ethnicity_in_the_United_States_Census 0.9480969026168182
Caribbean literature http://dbpedia.org/resource/Caribbean_literature 1.0

https://github.com/MartinoMensio/spacy-dbpedia-spotlight

Dependency Parsing

Holy NLP! Understanding Part of Speech Tags, Dependency Parsing, and Named Entity Recognition

Categorization

MultiLingual Models

Language

Domain

🤗 huggingface.co/models

BookNLP

BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

Part-of-speech tagging
Dependency parsing
Entity recognition
Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
Quotation speaker identification
Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
Event tagging
Referential gender inference (TOM_SAWYER -> he/him/his)

https://github.com/booknlp/booknlp

Intro to NLP

By Andrew Janco

Intro to Text Analysis

Outline

Lexical Features

Duck

Duck

Negation

Context

Natural Language Processing

〞

〞

〞

MultiLingual Models

BookNLP

Intro to NLP

More from Andrew Janco