foundations of data science for everyone IX

 
dr.federica bianco | fbb.space |    fedhere |    fedhere 

Natural Language Processing

 

this slide deck:

 

1

what is

Natural Languate Processing

1. Computers only know numbers, not words

2. Language's constituent elements are words

3. Meaning depends on words, how they are combined, and on the context

That is great!

1. Computers only know numbers, not words

2. Language's constituent elements are words

3. Meaning depends on words, how they are combined, and on the context

That is great!

That is not great...

1. Computers only know numbers, not words

2. Language's constituent elements are words

3. Meaning depends on words, how they are combined, and on the context

That is great!

That is not great...

2

NLP preprocessing

tokenization and parsing : splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

** we will see how its done

lemmatization/stemming :  reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 

 

am, are, is --> be

dog, dogs, dog's --> dog

part-of-speech tagging:  marking up a word in a text (corpus) as corresponding to a particular part of speech

 

language detection:  automatically detecting which language is used

3

NLP

descriptive tasks

Statistical properties of the "corpus"

how many characters

how many words

how many sentences

how many paragraphs

how many proper names

how often each proper name appears

 

Statistical properties of the "corpus"

Content categorization

search and indexing

content alerts and duplication detection

plagiarism detection

Sentiment analysis

Is the sentiment positive, negative, neutral

Applications? 

Sentiment analysis

Is the sentiment positive, negative, neutral

Applications? 

Social media monitoring

detection of hate speach, measure the health of a conversation

Customer support ticket analysis 

VoC Voice of Customer - Voice of Employee

Brand monitoring and reputation management

 

** we will see how its done

3

NLP

Machine Learning  tasks

  • Topic discovery and modeling. capture the meaning and themes in text collections (associated tasks: optimization and forecasting)
  • Contextual extraction. Automatically pull structured information from text-based sources.
  • Sentiment analysis. 

AI and ML supported NLP tasks

Identifying the mood or subjective opinions within large amounts of text, including average sentiment and opinion mining.

  • Speech-to-text and text-to-speech conversion. Transforming voice commands into written text, and vice versa. 
  • Document summarization. Automatically generating synopses of large bodies of text and detect represented languages in multi-lingual corpora (documents).
  • Machine translation. Automatic translation of text or speech from one language to another.
  • Text generation automatic captioning

Text generation

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell 

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

reading