https://slides.com/d/2Ouudnw/live
https://slides.com/didaskalos/ai-humanities-inkcode
1x Neo: https://www.1x.tech/
AI = Generative AI?
(above) "Pelican on a bike" by gpt-4o (we'll come back to this....)
🤔 ...
built-in (sort-of) to Google products;
end-to-end: v0, lovable, and many others
see also: chaos coding with https://yoheinakajima.com/blog/
Oct. 2022
Ridiculous virality
A Bet on Scaling
Emergent Abilities in Large Language Models: A Survey, https://arxiv.org/html/2503.05788v1
Sharp increase disappears (bottom) with linear metric
1. AI: What is it good for
First instincts
Based on what you have seen and been thinking about this week with your project, what is the first thing that you would consider doing with these tools?
It is strongly recommended that you use a disposable or non-primary Google account for signing up or engaging with these services
Setup
Chat is not the default interface for AI
Chat is not the data type of AI
2. Clean Data
1. Get Data
3. Explore Data
4. Model | Analyze | Visualize Data
5. Communicate about the Data
Label data
Enrich data
Form research question
https://www.datascience-pm.com/data-science-life-cycle/
Good software turns best practices into tools
🤔 ...
See also: F. Moretti, Graphs, Maps, Trees: Abstract Models for a Literary History: https://www.versobooks.com/products/1939-graphs-maps-trees
Unstructured: Free-form text, images, audio, video
Semi-structured: JSON: {"key": "value", "list": [a, b, c]}
| XML/TEI: <stuff>blah</stuff>
| Graphs | Trees | Maps/dictionaries
Structured: Tabular: CSV, Parquet, SQL tables | Simple lists: [a, b, c]
| Fixed-format records
Some Varieties of Data
AI Data Types: Tokens | Vector Embeddings | Models
Vary it
Different languages, characters, anything at all
Assess
What do you observe?
Exercise: Tokenization
Click!
Explore the data. Anything surprising? What is this showing?
Analysis?
How does this kind of data compare to other forms of data you have worked with?
Embeddings Visualization
Pre-2010s | Vector Space Models (LSA, LDA) | Topic modeling, semantic search | Gensim, scikit-learn | Sparse, interpretable vectors based on word/document co-occurrence. |
2013–2014 | Word2Vec (Mikolov et al.), GloVe | Semantic similarity, analogies | Gensim, spaCy | Dense, static word vectors. Each word has a single vector regardless of context123. |
2015–2017 | FastText (Facebook) | Morphologically rich languages | fastText, Gensim | Adds subword info (character n-grams) for OOV/generalization. |
2018–2019 | ELMo, BERT (contextual embeddings) | Context-aware similarity, QA | Hugging Face Transformers | Dynamic vectors—same word gets different embeddings in different contexts. |
2019+ | Sentence Transformers (SBERT) | Semantic search, clustering, retrieval | Sentence Transformers (SBERT) | SBERT builds on Hugging Face Transformers, adding pooling layers and fine-tuning for sentence-level meaning4. |
2021+ | CLIP, ALIGN, Multimodal Embeddings | Cross-modal retrieval, image-text alignment | OpenAI CLIP, LLaVA, Hugging Face | Joint embedding spaces for text and images; enables text-to-image and vice versa. |
2022+ | Domain-Specific/Instruction-Tuned Embeddings | Specialized DH tasks, RAG, QA | SPECTER, E5, Ada, etc. | Embeddings fine-tuned for tasks (e.g., academic papers, legal docs, retrieval-augmented generation). |
Run the notebook
(it will take a few minutes)
Observations?
What do you notice? What patterns do you see?
Optional: Experiment with other embedding models
Do data types shape research questions?
Impact of Format Restrictions on Large Language Models: https://arxiv.org/pdf/2408.02442
Choose a task relevant to your data or project
Vary the output structure
"Output as JSON" vs. "Output as xml" vs. no instructions vs. more specific and detailed instructions
Differences? No Change?
Exercise: Structured Outputs
DH as discovery within Latent Space
Traversing AI "circuits"
see also: Universal Natural Language Processing: https://arxiv.org/abs/2010.02584
Choose a text, image, object related to your dataset
Choose a task
Classification, Entity Extraction, etc.
https://medium.com/nlplanet/two-minutes-nlp-33-important-nlp-tasks-explained-31e2caad2b1b
Results?
Exercise: NLP tasks with LLMs
see also: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/
AI Engineers World Fair Talk: https://www.youtube.com/live/z4zXicOAF28?feature=shared&t=5090
OpenAI gpt-3.5
Prompt:
Generate an SVG of a pelican riding a bicycle
OpenAI gpt-4o-mini
Prompt:
Generate an SVG of a pelican riding a bicycle
OpenAI gpt-4o
OpenAI gpt-o1 ("reasoning" model)
Google 1.5-flash
Google 1.5-pro
Google Gemini exp 1206
Claude Haiku (smallest Anthropic model)
Claude Sonnet (medium Anthropic model)
Claude Opus (big Anthropic "reasoning" model)
Pull up any model or LLM of your choosing
chatgpt, openai, ollama locally
Create an eval (or set of related evals)
Make it hard. Make it clever. Make it not about pelicans.
Discuss and Iterate
What is your eval actually assessing? What good is it?
Exercise: Create an AI evaluation
Evaluation is a HARD problem
Towards Understanding Sycophancy in Language Models: https://arxiv.org/pdf/2310.13548
See also: tracing "thoughts": https://www.anthropic.com/research/tracing-thoughts-language-model, alignment faking: https://www.anthropic.com/research/alignment-faking, Anthropomorphism in Large Language Models: https://arxiv.org/abs/2405.06079 and https://openaccess.cms-conferences.org/publications/book/978-1-958651-95-7/article/978-1-958651-95-7_46
Opportunity for Humanists?
Diverse, domain-specific evaluations and benchmarks as a direct and actionable way to impact the development of AI
Based on the task from before, how would you judge that the output was correct?
LLM as Judge
Use an llm to evaluate the output of another llm
How good is your judge?
How would you improve the ability of the second llm to judge the output of the first?
Multipurpose evaluator:
LLM as Judge
aka few-shot prompting
Give the LLM a task
Think of a task that you want the LLM to do. Ask question to LLM in the minimal form possible.
e.g. LLM-as-judge tasks, NLP tasks
Context
Add examples of the task you are asking for done well and done poorly.
Try to improve performance
Does it make a difference if you use obvious examples or unobvious examples? What kinds of examples? How do you get examples?
Exercise: Basic In-context learning
Any LLM
Define a simulation:
e.g. you are an 18th century resident of San Miniato,
simulate a conversation in the fifth century BCE
Can you make it better based on your expert knowledge?
a. What criteria would you evaluate in the output?
b. Can you build an LLM judge to assess the output?
Exercise: Simulation
A2A, MCP
so-called "reasoning" models (aka lots more tokens models)
models that don't follow rewards
see also https://www.anthropic.com/research/reward-tampering
Provocations from the Humanities for Generative AI Research: