federica bianco PRO
astro | data science | data for good
dr.federica bianco | fbb.space | fedhere | fedhere
Natural Language Processing
1. Computers only know numbers, not words
2. Language's constituent elements are words
3. Meaning depends on words, how they are combined, and on the context
That is great!
1. Computers only know numbers, not words
2. Language's constituent elements are words
3. Meaning depends on words, how they are combined, and on the context
That is great!
That is not great...
1. Computers only know numbers, not words
2. Language's constituent elements are words
3. Meaning depends on words, how they are combined, and on the context
That is great!
That is not great...
tokenization and parsing : splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
** we will see how its done
lemmatization/stemming : reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
am, are, is --> be
dog, dogs, dog's --> dog
part-of-speech tagging: marking up a word in a text (corpus) as corresponding to a particular part of speech
language detection: automatically detecting which language is used
how many characters
how many words
how many sentences
how many paragraphs
how many proper names
how often each proper name appears
Content categorization.
search and indexing
content alerts and duplication detection
plagiarism detection
Is the sentiment positive, negative, neutral
Applications?
Is the sentiment positive, negative, neutral
Applications?
detection of hate speach, measure the health of a conversation
Customer support ticket analysis
VoC Voice of Customer - Voice of Employee
Brand monitoring and reputation management
** we will see how its done
Identifying the mood or subjective opinions within large amounts of text, including average sentiment and opinion mining.
Text generation
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.
By federica bianco
NLP