Enhancing Long-Term Memory using Hierarchical Aggregate Tree in
Large Language Models

Aadharsh Aadhithya

CB.EN.U4AIE20001

Project Review : MiniProject

Guides: Dr. Sachin Kumar S & Dr. Soman K.P

Problem Statement

Large Language Models are trained to complete sentences.
These models are finetuned on "Instruction Following datasets" to make it follow instructions and respond in human like manner.
Figuring out the right input to give to the model, to get the desired output, have turned out to be a "skill", and is often termed as prompt engineering.

Problem Statement

LLM

Prompt

Output

Problem Statement

LLM

Prompt

Output

LLMs are "closed". Only way to feed in information about external world is through the prompts.

Problem Statement

LLM

Prompt

Output

LLMs are "closed". Only way to feed in information about external world is through the prompts.

Problem Statement

LLM

Prompt

Output

LLMs are "closed". Only way to feed in information about external world is through the prompts.

However, amount of information you can put in prompts is limited by the model's "Context Length".

Problem Statement

LLM

Prompt

Output

LLMs are "closed". Only way to feed in information about external world is through the prompts.

However, amount of information you can put in prompts is limited by the model's "Context Length".

Hence there is a need to efficiently and smartly choose what to put in prompts based on user's input and past conversation.

Problem Statement

LLM

Prompt

Output

However, amount of information you can put in prompts is limited by the model's "Context Length".

Hence there is a need to efficiently and smartly choose what to put in prompts based on user's input and past conversation.

For example: A long Open Domain conversation, we cannot put entire conversation in context (It might go for even years)

Existing Methods

Retrieval
Agumented

Memory
Agumented

[1]Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented
language model pre-training. In ICLR.
[2] Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Kuttler, Mike Lewis, Wen tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020.
Retrieval-augmented generation for knowledge- intensive nlp tasks. In NeurIPS.

[3]Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. 2022b.
Long time no see! open-domain conversation with long-term persona memory. In Findings of ACL

- Some sort of retrieval system is employed to retrieve "Relevant chunk" and inject to prompt

- "Train" a memory management model

Existing Methods

Retrieval
Agumented

Memory
Agumented

[3]Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. 2022b.
Long time no see! open-domain conversation with long-term persona memory. In Findings of ACL

- Some sort of retrieval system is employed to retrieve "Relevant chunk" and inject to prompt

- "Train" a memory management model

- We should go over entire corpus

- We need "training data" which is expensive

Existing Methods

Related Work

* Preprint Just Released on 29th Aug,2023

* Today: November

Lots of these papers and preprints gets accepted in conferences like NeurIPS, ACL, ICLR,etc. That underscores the relevance of the current work

Existing Methods

Related Work

This work proposes sort of "Running Summary" of the information.

Existing Methods

Related Work

This work proposes sort of "Running Summary" of the information.

However, prone to forgetting finer details, if required. And not scalable to data's like books,novels,etc.

Methedology

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

The tree consists of "nodes"

HST of memory length 2

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Nodes in HAT

Attributes

Operations

ID
text
prev_states
parent
aggregator
children

Update Parent
Add child
Remove Child
Update Text
Hash Children

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Nodes in HAT

ID - UID, generated using information of child nodes
text - Content of this node
prev_states - Hashmap/Dictionary of all childnodes this node have seen
parent - Pointer to its parent
aggregator - A function, that sets this node's property to aggregate of child nodes property
children - List of child nodes

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Nodes in HAT

Attributes

Operations

ID
text
prev_states
parent
aggregator
children

Update Parent
Add child
Remove Child
Update Text
Hash Children

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Nodes in HAT

Operations

Update Parent
Add child
Remove Child
Update Text
Hash Children

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Nodes in HAT

Operations

Update Parent
Add child
Remove Child
Update Text
Hash Children

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Nodes in HAT

Operations

Update Parent
Add child
Remove Child
Update Text
Hash Children

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Attributes

Operations

memory length
aggregator
layers
num layers

Insert Node

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Attributes

memory length - Number of nodes that should be aggregated in each layer
aggregator - Function takes in text aggregates and returns text
layers - List of nodes layer wise

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Attributes

Operations

memory length
aggregator
layers
num layers

Insert Node

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Operations

Insert Node

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Insert(node,L)

L.append(node)

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Insert(node,L)

Check if Number of nodes is a multiple of memory length. If not, add it as a child to last node in the above layer

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Insert(node,L)

Check if Number of nodes is a multiple of memory length. If yes, remove recent memory length childs from last node in above layer and add new node(np). call insert(np,L-1)

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Insert(np,L-1)

Check if Number of nodes is a multiple of memory length. If yes, remove recent memory length childs from last node in above layer and add new node(np). call insert(np,L-1)

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Insert(np,L-1)

This recursively goes on. If L==0, then create a new layer and add children

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Insert(np,L-1)

This recursively goes on. If L==0, then create a new layer and add children

Note: Because we are hashing children, When we remove children, we need not compute the summaries again.

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

Speaker 1 chat

Speaker 2 chat

Level 1 Summary

Level 2 Summary

Entire Summary

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

More Generally

Given a Task T, We should now find best strategy to traverse the tree

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

More Generally

Given a Task T, We should now find best strategy to traverse the tree

Actions at each state: Up,Down,Left,Right

states: nodes

Train an Agent?

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

More Generally

Given a Task T, We should now find best strategy to traverse the tree

Actions at each state: Up,Down,Left,Right

states: nodes

Train an Agent?

Baseline: GPT as Agent

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

Baseline: GPT as Agent

memory

agent

context

user query

action

mem
manager

if A== "OK"

Retrieved context

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

Baseline: GPT as Agent

memory

agent

context

user query

action

mem
manager

if A== "OK"

Retrieved context

Memory Module

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

Baseline: GPT as Agent

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

Baseline: GPT as Agent

Note: This framework can be used for a variety of tasks. However, now we concentrate on open domain conversations

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT for Conversations

Stratergy

Start from Root node. If you need details go down. If you need information from the past go right.

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

HAT

Insert(node)

Check if Number of nodes is a multiple of memory length. If yes, remove recent memory length childs from last node in above layer and add new node

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Prompt Used

You are an agent traversing a tree. Given a user message and details/personas from a conversation, your task is to determine if the context provided is sufficient to answer the user's question by the other speaker. You begin at the root node, which should contain a high-level summary of all information up to this point. If more information is needed, you can explore the tree, following the traversal history provided. Avoid circular paths and make sure to explore effectively based on the traversal history. If after 7-8 traversals the information remains unavailable, use the 'U' action. If you reach the bottom or edge of the tree, reset with the 'S' action. The tree's structure is as follows:

1) Tree depth corresponds to information resolution. Leaf nodes are individual sentences from chats, and higher nodes summarize their children. Descend for more details, ascend for high-level information.
2) Horizontal position reflects time. The rightmost node is the most recent. Depending on the query and information, decide whether to search in the past or future. Move left for past information, right for future information.

Layer '0' is the root. Layer number increases with depth, and node number increases to the right. Stay within the permitted limits of layers and nodes.

Pay attention to the 'err' field for feedback on your moves.

Your actions can only be:
- U: Up
- D: Down
- R: Right
- L: Left
- S: Reset (Return to root node)
- O: Current Information Sufficient to Respond
- U: Current Information Insufficient to Respond

#############
Traversal History: {self.memory.get_current_traversal()}
Latest Moves and Reasons: {self.memory.traversals[:-3]}
#############
Current Location: Layer: {self.memory.layer_num}, Node: {self.memory.node_num}
Total Layers: {len(self.memory.hst.layers)},
Total Nodes in Current Layer: {len(self.memory.hst.layers[self.memory.layer_num]) if (self.memory.layer_num < len(self.memory.hst.layers)) else 0}

#############
Details/Personas: {self.memory._get_memory()}
#############
User Message: {user_message}
############
Errors: {err}
############

Action:
Reason: (Tell reason why you chose the action)
"""

Methodology

I propose to use a new data structure called "Hierarchical Aggregate Tree" (HAT)

Prompt Used

You are in a natural conversation. Given with a memory and user query, respond (You are Speaker 2) such that you continue the conversation and

answer the user query if any from the memory. If memory is not sufficient to answer, tell i dont know.

###########

memory: {mem}

###########

user query: {user_query}

##########

Speaker 2:

Dataset and Evaluation

Dataset

We use the "Multi Session Chat" dataset [5]

[5] Xu, Jing, Arthur Szlam, and Jason Weston. "Beyond goldfish memory: Long-term open-domain conversation." arXiv preprint arXiv:2107.07567 (2021).

Dataset

We use the "Multi Session Chat" dataset [5]

[5] Xu, Jing, Arthur Szlam, and Jason Weston. "Beyond goldfish memory: Long-term open-domain conversation." arXiv preprint arXiv:2107.07567 (2021).

Metrics

BLEU (Bilingual Evaluation Understudy)

\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^N w_n \log p_n\right)

weights

ratio of the number of n-grams in the candidate that match n-grams in any reference

\begin{cases} 1 & \text{if } c > r \\ e^{(1-r/c)} & \text{if } c \leq r \end{cases}

Brevity Penalty

Metrics

F1

\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Precision: proportion of words (or n-grams) in the candidate sentence that are also found in the reference sentence.

Recall: proportion of words (or n-grams) in the reference sentence that are also found in the candidate sentence

Metrics

DISTINCT (Distinct-N)

Measures the diversity of the response

\text{DISTINCT-N} = \frac{|\{\text{Unique N-grams in the generated text}\}|}{|\{\text{Total N-grams in the generated text}\}|}

Results

Text

The MSC dataset has about 500 conversations in its Test set, spanning over 5 Sessions
We Follow the Methodology described, and compare the summary generated at end of every session with the gold summary given in the dataset

Results

Text

	F1	BLEU-1/2
Ours	38.88	37.98/16.18
[7]	25.35	28.52/11.63

[7]Wang, Qingyue, et al. "Recursively summarizing enables long-term dialogue memory in large language models." arXiv preprint arXiv:2308.15022 (2023).

* Results are reported just for Session 2. Results for other sessions are yet to be obtained

	DISTINCT-1/2
Ours	0.76/0.88

Results

Text

Results

Text

Beyond Numbers: A example

Speaker 1: What kind of car do you own? I have a jeep.

Speaker 2: I don't own my own car! I actually really enjoying walking and running, but then again, I live in a small town and semi-close to work.

Speaker 1: Ah I see! I like going to the gym to work out.

Speaker 2: I'm a computer programmer. What do you do for work?

Speaker 1: I work in marketing. Do you have any hobbies?

Speaker 2: I love to grill! I love meat, especially steak.

Speaker 1: Oh neat! I don't eat meat but I do like some grilled veggies.

Speaker 2: My family is from Alaska. Where are you from?

Speaker 1: I'm from New England. It must be cold up there.

Speaker 2: It's so cold! I think that's why I love wearing warm pants!

Speaker 1: Haha I feel you. I love wearing warm pants in the winter.

Speaker 2: Do you play any sports?

Session 2