https://slides.com/d/2Ouudnw/live

https://slides.com/didaskalos/ai-humanities-inkcode

AI & Humanities Data

Allen J. Romano

allenjromano@gmail.com

https://www.linkedin.com/in/allenjromano/

Index

0. Preliminaries: What do we mean by "AI"?

1. AI: What is it good for?

2. AI Data, Universal NLP, Simulation

3. AI Systems > models

4. AI as an object of study

AI?

De-hyping the Noise

1x Neo: https://www.1x.tech/

AI = Generative AI?

(above) "Pelican on a bike" by gpt-4o (we'll come back to this....)

🤔 ...

Cursor: https://www.cursor.com/
Windsurf: https://windsurf.com/editor
Github Copilot (add as extension to VS Code): https://github.com/features/copilot
Claude Code: https://www.anthropic.com/claude-code
Chat interfaces: Claude, ChatGPT, etc.

built-in (sort-of) to Google products;

end-to-end: v0, lovable, and many others
see also: chaos coding with https://yoheinakajima.com/blog/

AI -assisted coding tools

Oct. 2022

Ridiculous virality

Not Hype

Unprecedented speed of investment
Ease of developer access to a variety of AI technologies
Real-time and edge inference
Cambrian explosion of models (see, e.g. https://constellation.sites.stanford.edu/, https://huggingface.co/models?other=LLM, https://ollama.com/library)

https://informationisbeautiful.net/visualizations/the-rise-of-generative-ai-large-language-models-llms-like-chatgpt/

Not Hype

Multimodal (voice, image, video) models and architectures
Agentic systems
Integration into other systems
Robots

Data transformations: structured to unstructured, language to language, code to language, etc.
Code generation

https://www.youtube.com/watch?v=6dGqs_9z5ic

https://notebooklm.google.com/notebook/4ce8133d-96cb-46fb-9e6d-ccbda2e8ab15?original_referer=https%3A%2F%2Fnotebooklm.google%23&pli=1

https://notebooklm.google/

A Bet on Scaling

Emergent Abilities in Large Language Models: A Survey, https://arxiv.org/html/2503.05788v1

Sharp increase disappears (bottom) with linear metric

1. AI: What is it good for

https://www.bbc.com/future/article/20240819-why-these-ai-cat-videos-may-be-the-internets-future

0.

Model access options

claude.ai

chatgpt.com/

google gemini

ollama (local, small models)

llm (cli): https://github.com/simonw/llm

1.

First instincts

Based on what you have seen and been thinking about this week with your project, what is the first thing that you would consider doing with these tools?

2.

It is strongly recommended that you use a disposable or non-primary Google account for signing up or engaging with these services

Setup

Chat is not the default interface for AI

https://wattenberger.com/thoughts/boo-chatbots

https://lucas-mcgregor.medium.com/no-one-wants-to-talk-to-your-chatbot-9d8bb1a70266

https://medium.com/design-bootcamp/chatbots-are-the-past-the-future-of-ai-is-invisible-intelligent-and-in-action-bcb51d2d0eee

Chat is not the data type of AI

2. Clean Data

1. Get Data

3. Explore Data

4. Model | Analyze | Visualize Data

5. Communicate about the Data

Label data

Enrich data

Form research question

https://www.datascience-pm.com/data-science-life-cycle/

original: https://web.archive.org/web/20160220042455/dataists.com/2010/09/a-taxonomy-of-data-science/

https://r4ds.had.co.nz/introduction.html

Good software turns best practices into tools

🤔 ...

See also: F. Moretti, Graphs, Maps, Trees: Abstract Models for a Literary History: https://www.versobooks.com/products/1939-graphs-maps-trees

Unstructured: Free-form text, images, audio, video

Semi-structured: JSON: {"key": "value", "list": [a, b, c]} | XML/TEI: <stuff>blah</stuff> | Graphs | Trees | Maps/dictionaries

Structured: Tabular: CSV, Parquet, SQL tables | Simple lists: [a, b, c] | Fixed-format records

Some Varieties of Data

AI Data Types: Tokens | Vector Embeddings | Models

Tokenization

Sequences of pieces

see more: https://www.datacamp.com/blog/what-is-tokenization

https://platform.openai.com/tokenizer

Try it

https://platform.openai.com/tokenizer

1.

Vary it

Different languages, characters, anything at all

2.

Assess

What do you observe?

3.

Exercise: Tokenization

Vector Embeddings

Go to embeddings projector

https://projector.tensorflow.org/

1.

Click!

Explore the data. Anything surprising? What is this showing?

2.

Analysis?

How does this kind of data compare to other forms of data you have worked with?

3.

Embeddings Visualization

Pre-2010s	Vector Space Models (LSA, LDA)	Topic modeling, semantic search	Gensim, scikit-learn	Sparse, interpretable vectors based on word/document co-occurrence.
2013–2014	Word2Vec (Mikolov et al.), GloVe	Semantic similarity, analogies	Gensim, spaCy	Dense, static word vectors. Each word has a single vector regardless of context1 23.
2015–2017	FastText (Facebook)	Morphologically rich languages	fastText, Gensim	Adds subword info (character n-grams) for OOV/generalization.
2018–2019	ELMo, BERT (contextual embeddings)	Context-aware similarity, QA	Hugging Face Transformers	Dynamic vectors—same word gets different embeddings in different contexts.
2019+	Sentence Transformers (SBERT)	Semantic search, clustering, retrieval	Sentence Transformers (SBERT)	SBERT builds on Hugging Face Transformers, adding pooling layers and fine-tuning for sentence-level meaning4.
2021+	CLIP, ALIGN, Multimodal Embeddings	Cross-modal retrieval, image-text alignment	OpenAI CLIP, LLaVA, Hugging Face	Joint embedding spaces for text and images; enables text-to-image and vice versa.
2022+	Domain-Specific/Instruction-Tuned Embeddings	Specialized DH tasks, RAG, QA	SPECTER, E5, Ada, etc.	Embeddings fine-tuned for tasks (e.g., academic papers, legal docs, retrieval-augmented generation).

https://huggingface.co/spaces/mteb/leaderboard

Copy the notebook

https://colab.research.google.com/drive/1gIcCAsQmKdXY9b3Umsr6-I_dWOxcgMxi?usp=sharing

1.

Run the notebook

(it will take a few minutes)

2.

Observations?

What do you notice? What patterns do you see?

Optional: Experiment with other embedding models

https://huggingface.co/spaces/mteb/leaderboard

3.

Models as data

Model "knowledge" is a compressed abstraction of its training data
Sparse data is often under-represented or represented through denser data (e.g. low resource languages accessed through alignment with English)
Adaptive and contextual compression in modern LLMs (vs. traditional data compression)

Models as Data Compression

https://arxiv.org/pdf/2505.17117

Model-Oriented ML

Improve performance by adjusting architecture or parameters
Make the best model
Most work today is ostensibly model-centric

Data-Centric ML

Improve performance by getting better | cleaner | optimal data
Make the best dataset
Data-centric generally produce larger overall performance gains

Do data types shape research questions?

Impact of Format Restrictions on Large Language Models: https://arxiv.org/pdf/2408.02442

Choose a task relevant to your data or project

1.

Vary the output structure

"Output as JSON" vs. "Output as xml" vs. no instructions vs. more specific and detailed instructions

2.

Differences? No Change?

3.

Exercise: Structured Outputs

DH as discovery within Latent Space

https://genforce.github.io/eqgan-sa/

Traversing AI "circuits"

Universal NLP?

see also: Universal Natural Language Processing: https://arxiv.org/abs/2010.02584

Choose a text, image, object related to your dataset

1.

Choose a task

Classification, Entity Extraction, etc.

https://medium.com/nlplanet/two-minutes-nlp-33-important-nlp-tasks-explained-31e2caad2b1b

2.

Results?

3.

Exercise: NLP tasks with LLMs

AI Evaluation

Beyond Prompt and Pray

https://llm-stats.com/

Benchmarks

https://crfm.stanford.edu/helm/

Holistic Evaluation

https://github.com/simonw/pelican-bicycle/blob/main/README.md

Pelican Evaluation

AI Engineers World Fair Talk: https://www.youtube.com/live/z4zXicOAF28?feature=shared&t=5090

OpenAI gpt-3.5

Prompt:

Generate an SVG of a pelican riding a bicycle

OpenAI gpt-4o-mini

Prompt:

Generate an SVG of a pelican riding a bicycle

OpenAI gpt-4o

OpenAI gpt-o1 ("reasoning" model)

Google 1.5-flash

Google 1.5-pro

Google Gemini exp 1206

Claude Haiku (smallest Anthropic model)

Claude Sonnet (medium Anthropic model)

Claude Opus (big Anthropic "reasoning" model)

Pull up any model or LLM of your choosing

chatgpt, openai, ollama locally

1.

Create an eval (or set of related evals)

Make it hard. Make it clever. Make it not about pelicans.

2.

Discuss and Iterate

What is your eval actually assessing? What good is it?

3.

Exercise: Create an AI evaluation

Evaluation is a HARD problem

Domain-specific
Benchmarks can be "juiced"
- for example, outputs that are more flattering or verbose often do better on "intelligence" benchmarks; but are they actually "better"
- Sycoph
Fabrication (aka "Hallucination") where the model invents things may be impossible to detect reliably
Most existing evaluations were made for pre-GPT-level models and systems; they are also likely included in training data for all LLMs
Effective evaluation requires a test set that the model has not seen before