Can we make an AI respect TEI XML?

An experiment with a Small-Scale explainable AI

Presentation for DH2025 in Lisbon, Portugal and Balisage: The Markup Conference 2025

Wednesday, 16 July 2025 and Wednesday, 6 August 2025

Alexander C. Fisher, Hadleigh Jae Bills, and Elisa Beshero-Bondar

Penn State Erie, The Behrend College

Link to Slides:
https://bit.ly/

digitai-dh25

What DigitAI will be

A tool that assists scholars and editors in applying the TEI Guidelines to their own corpora

  • Built for users who are new to TEI or seeking deeper understanding
  • Acts as a TEI aware tutor NOT an Auto-Encoder
  • Helps scholars and editors understand best practices in TEI encoding
  • Helps TEI Developers to find  inconsistencies in the guidelines

Supports learning by providing relevant examples, explanations, and excerpts directly from the TEI Guidelines

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Why TEI?

  • One of us regularly edits TEI Guidelines
  • TEI Guidelines : delivered as
    • website (and pdf, etc)
    • a code schema (TEI-all.rng)
    • most importantly, built in TEI XML
  • p5.xml = a single highly structured, systematic code base and knowledge resource, containing:
    • chapters
    • examples
    • "coding specs" that define the building blocks of the TEI elements, attributes, macros, modules, datatypes, etc.
    • built from component parts when we release new versions of the Guidelines
    • one of the best-organized resources we can imagine for designing a knowledge base for a RAG system*
      * Literally p5.xml is designed from the ground-up to be the TEI community's "ground truth" about our encoding system!

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

p5.xml → XSLT → digitai-p5.json

TEI Guidelines p5.xml: a single XML document that contains the entire built TEI Guidelines. It's a hierarchical (tree) structure:

  • contains front matter, 24 Chapters, each with sections and nested subsections.
    • Paragraphs in chapters
      • contain encoding that defines rules for elements and attribute classes, model classes (connecting related elements and attributes together), datatype specifications.
  • We are writing XSLT to translate this XML tree into
    • a JSON structure (digitai-p5.json)  for a database to ingest
    • Query scripts (in Cypher) to direct this database to translate this JSON into a knowledge graph

Designing the Memory Graph

  • Stores TEI Guidelines as Nodes and Relationships
  • Enables retrieval of structured and semantically relevant text
  • Preserves relationships of the XML structure
  • Queried using Cypher, Neo4j's native query language
    • Allows us to retrieval precise, contextual excerpts

Neo4j is like a digital card catalog— it doesnt just store texts, it understands how they relate to each other

WITH doc, doc_data 
FOREACH (part_data_1 IN doc_data.CONTAINS_PARTS | 
	MERGE (part:Part {name: part_data_1.PART}) SET part.sequence = part_data_1.SEQUENCE 
	MERGE (doc)-[:HAS_PART]->(part) 
	WITH part, part_data_1 
	FOREACH (chapter_data_2 IN part_data_1.CONTAINS_CHAPTERS | 
		MERGE (chapter:Chapter {chapter_id: chapter_data_2.ID}) 
        SET chapter.sequence = chapter_data_2.SEQUENCE, 
        chapter.title = chapter_data_2.CHAPTER,
        chapter.links = [x IN chapter_data_2.RELATES_TO WHERE x IS NOT NULL | x.ID] 
		MERGE (part)-[:HAS_CHAPTER]->(chapter) 
		WITH chapter, chapter_data_2 
FOREACH (example_data_7 IN paragraph_data_6.TEI_ENCODING_DISCUSSED.CONTAINS_EXAMPLES | 
	MERGE (example:Example {example: example_data_7.EXAMPLE}) 
    SET example.language = example_data_7.LANGUAGE 
	MERGE (paragraph)-[:HAS_EXAMPLE]->(example) 
	WITH example, example_data_7 
  FOREACH (paragraph_data_8 IN example_data_7.CONTAINS_END_PARAS | 
		MERGE (paragraph:TerminalPara {parastring: paragraph_data_8.PARASTRING}) 
        SET paragraph.files_mentioned = 
        [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.FILES_MENTIONED 
        WHERE x IS NOT NULL | x.FILE], 
        paragraph.parameter_entities_mentioned = 
          [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.PES_MENTIONED WHERE x IS NOT NULL | x.PE],
          paragraph.elements_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.ELEMENTS_MENTIONED 
          WHERE x IS NOT NULL | x.ELEMENT_NAME], paragraph.attributes_mentioned = 
          [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.ATTRIBUTES_MENTIONED 
          WHERE x IS NOT NULL | x.ATTRIBUTE_NAME], 
          paragraph.sequence = paragraph_data_8.SEQUENCE, 
          paragraph.frags_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.FRAGS_MENTIONED 
          WHERE x IS NOT NULL | x.FRAG], 
          paragraph.ns_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.NSS_MENTIONED
          WHERE x IS NOT NULL | x.NS], 
          paragraph.classes_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.CLASSES_MENTIONED
          WHERE x IS NOT NULL | x.CLASS],
          paragraph.modules_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.MODULES_MENTIONED
          WHERE x IS NOT NULL | x.MODULE], 
          paragraph.macros_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.MACROS_MENTIONED 
          WHERE x IS NOT NULL | x.MACRO], 
          paragraph.speclist_links = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.CONTAINS_SPECLISTS.SPECLIST
          WHERE x IS NOT NULL | x.ID], 
          paragraph.relaxng_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.RNGS_MENTIONED 
          WHERE x IS NOT NULL | x.RNG], 
          paragraph.datatypes_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.DATATYPES_MENTIONED 
          WHERE x IS NOT NULL | x.DATATYPE], paragraph.links = [x IN paragraph_data_8.RELATES_TO.SECTION
          WHERE x IS NOT NULL | x.ID],
          paragraph.parameter_entities_mentioned_ge = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.PES_MENTIONED 
          WHERE x IS NOT NULL | x.GE], 
          paragraph.schemas_mentioned = [x IN paragraph_data_8.TEI_ENCODING_DISCUSSED.SCHEMAS_MENTIONED 
          WHERE x IS NOT NULL | x.SCHEMA] 
	MERGE (example)-[:HAS_END_PARAGRAPH]->(paragraph) 
    ) 

Neo4j's Graph Model JSON... Beyond Human Readability?

{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:0", "text": null, "labels": ["Document"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:1", "text": null, "labels": ["Part"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:2", "text": null, "labels": ["Part"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:3", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:4", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:5", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:6", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:7", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:8", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:9", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:10", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:11", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:12", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:13", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:14", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:15", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:16", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:17", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:18", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:19", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:20", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:21", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:22", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:23", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:24", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:25", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:26", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:27", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:28", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:29", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:30", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:31", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:32", "text": null, "labels": ["Chapter"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:33", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:34", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:35", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:36", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:37", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:38", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:39", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:40", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:41", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:42", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:43", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:44", "text": null, "labels": ["Section"]}
{"id": "4:223785fa-2e25-4598-887d-0a5b446743b2:45", "text": null, "labels": ["Section"]}

Designing a RAG with the TEI

Making a knowlege graph: the challenge!

<XML>

<XSLT>

Neo4j

Knowledge Graph

LLM

Nodes

Relationships

<Cypher>

Core 4— Qwen2:7B: The LLM

FAISS

Neo4j Database

BGE-M3

Qwen LLM

Neo4j Database

BGE-M3

FAISS

RAG Embeddings
.JSONL

RAG Embeddings
.FAISS

Embedded Prompt

Generates Vector Embeddings from Node Text

RAG Embeddings Converted to FAISS Formatted Data Map

Data Map Compared to Embedded Query

Constructs Query

Top Matched Embeddings Retrieved as Text From Neo4j

Human Prompt

# === Get user input ===
query = input("❓ Enter your query: ").strip()
if not query:
    print("⚠️ No query provided. Exiting.")
    exit()
# === Encode the query ===
query_embedding = model.encode(query)

# Convert to 2D array and normalize if cosine similarity is enabled
if normalize:
    query_embedding = sk_normalize([query_embedding], norm="l2")
else:
    query_embedding = np.array([query_embedding])
query_embedding = query_embedding.astype("float32")

# === Perform FAISS similarity search ===
TOP_K = 5 # Number of top matching texts to retrieve
scores, indices = index.search(query_embedding, TOP_K)

# Filter out invalid results (-1 = no match)
matched_ids = [id_map[i] for i in indices[0] if i != -1]
if not matched_ids:
    print("⚠️ No relevant matches found in the FAISS index.")
    exit()
# === Fetch matching texts from JSONL file ===
def fetch_node_texts_by_ids(node_ids):
    texts = []
    with open(neo4jNodes, "r", encoding="utf-8") as f:
        for line in f:
            record = json.loads(line)
            if record.get("id") in node_ids and record.get("text"):
                texts.append(record["text"])
    return texts

texts = fetch_node_texts_by_ids(matched_ids)

if not texts:
    print("❌ No node texts found for matched IDs in local file.")
    exit()
# === Construct prompt for the LLM ===
context = "\n".join(f"- {text}" for text in texts)

prompt = f"""You are a chatbot that helps people understand 
the TEI guidelines which specify how to encode machine-readable 
texts using XML.

Answer the question below in the **same language the question is asked in**.
Use examples from the provided context as needed — they can be in any language. 
Do not translate them.

Context:
{context}

Question:
{query}
"""

# === Send prompt to local Ollama LLM ===
def ask_ollama(prompt, model):
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False
            }
        )
        return response.json().get("response", "[ERROR] Empty response from LLM.")
    except Exception as e:
        print("❌ Error while querying Ollama:", e)
        return "[ERROR] Could not get response from local LLM."

# === Query the model ===
print(f"🤖 Sending prompt to LLM ({llm_model})...")
answer = ask_ollama(prompt, llm_model)

# === Display the result ===
print("\n🧾 Response:\n")
print(answer)

XSLT Process so far

What did we learn?

 

  • Data structures and graph modeling
  • How to write Cypher Scripts
  • Differences between "fine-tuning", "RAG augmenting" and "training"
  • This is harder than we thought!!!
  • Also accessible to learn together (students and professor alike!

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

What's next for DigitAI?

 

  • Finish refining the knowledge graph
  • Fine-tuning with the text and code specs of the P5 Guidelines
  • Optimization and packaging (user interface over local network, docker container)

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

Questions?

Presentation for DH2025 in Lisbon, Portugal

Wednesday, 16 July 2025

https://bit.ly/

digitai-github

GitHub:

https://bit.ly/

digitai-dh25

Slides:

https://bit.ly/

digitai-dh25

https://bit.ly/

digitai-jupyter

Jupyter Notebooks: