Intro to Text Analysis

From Ngrams to NLP

Outline

Lexical Features

  • Word Frequency
  • Density
  • Average Words Per Sentence
  • Contexts
  • noun
  • a bird that lives by water and has webbed feet

Duck

Duck

  • verb
  • to move your head or the top part of your body quickly down, especially to avoid being hit

Negation

This film isn't about ducks, it's called "Mighty Ducks," and it's about hockey.

We shared that blunt with 20 people, and by the time it got to me; All I got was the duck.

by theothers August 22, 2012

Context

Natural Language Processing

NLP uses machine learning to model human language and to predict the linguistic and semantic attributes of text.

Common NLP Research Tasks

 

from raw text to structured data

Entity Extraction

Entity Linking

import spacy
# this is any existing model
nlp = spacy.load('en_core_web_lg')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight')

doc = nlp('Yale English is hiring in race, diaspora, and/or indigeneity, with particular interest in scholars of Latinx literature, Asian American literature, Native American and/or Global Indigenous literature, or Caribbean literature')
for ent in doc.ents:
    print(ent.text, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore'])

# OUTPUT:
Yale http://dbpedia.org/resource/Yale_University 0.9988926828857767
English http://dbpedia.org/resource/English_language 0.8806620156671483
diaspora http://dbpedia.org/resource/Diaspora 0.940470180380478
Latinx http://dbpedia.org/resource/Latinx 0.9994470717639963
Asian American literature http://dbpedia.org/resource/Asian_American_literature 1.0
Native American http://dbpedia.org/resource/Race_and_ethnicity_in_the_United_States_Census 0.9480969026168182
Caribbean literature http://dbpedia.org/resource/Caribbean_literature 1.0
https://github.com/MartinoMensio/spacy-dbpedia-spotlight

Dependency Parsing

Categorization

Categorization

MultiLingual Models

Language

Domain

 

🤗  huggingface.co/models

BookNLP

BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

  • Part-of-speech tagging
  • Dependency parsing
  • Entity recognition
  • Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
  • Quotation speaker identification
  • Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
  • Event tagging
  • Referential gender inference (TOM_SAWYER -> he/him/his)

https://github.com/booknlp/booknlp