Hello

Making a better Siri

with the TALL stack

User asks question

Robot thinks

Robot answers question

User asks question

(using voice)

Robot thinks

(using chatgpt?)

Robot answers question

(using audio)

step 1.0: speech to text

$relativePath = fmt('%/%', messageAudioFolderPath(), $fileName);
$type = Storage::mimeType($relativePath);

$response = Http::acceptJson()
    ->withHeaders([
        'Authorization' => 'Token '.config('services.deepgram.secret'),
    ])
    ->withBody(
        Storage::get($relativePath), $type
    )
    ->post('https://api.deepgram.com/v1/listen');

if ($alternatives = $response->json('results.channels.0.alternatives')) {
    return $alternatives[0]['transcript'];
}

throw new ConvertQuestionAudioToTextException(
    'Could not convert question audio to text'
);

<div
    class="container mx-auto"
    x-data="{
        isRecording: false,
        blob: @entangle('blob').defer,
        async startRecording() {
            this.isRecording = true;
            return window.AudioRecorder.start();
        },
        async stopRecording() {
            var blob = await window.AudioRecorder.stop();
            var base64 = await this.blobToBase64(blob);

            this.isRecording = false;

            this.blob = base64;
        },
        blobToBase64(blob) {
            return new Promise(resolve => {
                const reader = new FileReader();
                reader.onloadend = () => resolve(reader.result);
                reader.readAsDataURL(blob);
            });
        },
    }"
>

How to Record Audio in JavaScript

By Reema Alzohairi

if (! function_exists('writeAudioFile')) {
    function writeAudioFile(string $blob): string
    {
        $tempName = tempnam(sys_get_temp_dir(), 'question');
        file_put_contents($tempName, file_get_contents($blob));
        
        $path = Storage::putFile(
            messageAudioFolderPath(),
            new File($tempName),
        );

        return str($path)->afterLast('/');
    }
}

step 2.0: text to speech

$tempName = tempnam(sys_get_temp_dir(), 'answer').'.mp3';

Http::sink($tempName)
    ->withOptions([
        'curl' => [
            CURLOPT_USERPWD => 'apikey:'.config('services.watson-tts.secret'),
        ],
    ])
    ->accept('audio/mpeg')
    ->post(config('services.watson-tts.url').'/v1/synthesize', [
        'text' => $text,
    ])
    ->onError(function (Response $response) {
        throw new ConvertAnswerTextToAudioException($response->body());
    });

if ($path = Storage::putFile(messageAudioFolderPath(), new File($tempName))) {
    return str($path)->afterLast('/');
}

throw new ConvertAnswerTextToAudioException('Could not write answer audio file');

@if ($message->audio_file_name)
    <audio
        controls
        autoplay
    >
        <source
            src="/storage/messages/{{ $message->audio_file_name }}"
            type="audio/mpeg"
        >
        Your browser does not support the audio element.
    </audio>
@endif

step 3.0: asking the robot

$client = OpenAI::client(config('services.openai.secret'));

$messages = $conversation
    ->messages()
    ->get()
    ->map(fn(Message $message) => [
        'role' => $message->role,
        'content' => $message->text,
    ]);

$response = $client->chat()->create([
    'model' => 'gpt-3.5-turbo',
    'messages' => $messages,
    'max_tokens' => $maxTokens,
]);

if (isset($response['choices'][0]['message']['content'])) {
    return trim($response['choices'][0]['message']['content']);
}

throw new AskAiException($response->toArray());

step 4.0: putting it

all together

attempt 1: everything in the livewire component

attempt 2: queued jobs, single interactions

class Message extends Model
{
    use HasFactory;
    use HasStates;

    protected $guarded = [];

    protected $casts = [
        'state' => MessageState::class,
    ];

    public function conversation(): BelongsTo
    {
        return $this->belongsTo(Conversation::class);
    }
}

abstract class MessageState extends State
{
    public static function config(): StateConfig
    {
        return parent::config()
            ->registerState([
                AskingAiMessageState::class,
                ConvertingAnswerTextToAudioMessageState::class,
                ConvertingQuestionAudioToTextMessageState::class,
                DoneMessageState::class,
                FailedMessageState::class,
                PrepareQuestionAudioMessageState::class,
                QueuedMessageState::class,
            ])
            ->default(
                QueuedMessageState::class
            )
            
            // ...

// system setting the prompt

->allowTransition(
    QueuedMessageState::class,
    DoneMessageState::class,
)
->allowTransition(
    QueuedMessageState::class,
    FailedMessageState::class,
)

// user asking a question

->allowTransition(
    QueuedMessageState::class,
    PrepareQuestionAudioMessageState::class,
)
->allowTransition(
    QueuedMessageState::class,
    AskingAiMessageState::class,
)

$this->message->state->transitionTo(
    PrepareQuestionAudioMessageState::class
);
$this->prepareQuestionAudio();

$this->message->state->transitionTo(
    ConvertingQuestionAudioToTextMessageState::class
);
$this->convertQuestionAudioToText();

// ...

} catch (ConvertQuestionAudioToTextException $e) {
    $this->message->state->transitionTo(
        FailedMessageState::class
    );

    $this->message->reason_failed = $e->getMessage();
    $this->message->save();
}

step 0.5: preparing

question audio

try {
    FFMpeg::open(fmt('%/%', messageAudioFolderPath(), $fileName))
        ->export()
        ->inFormat(new Mp3())
        ->save(fmt(
            '%/%.%',
            messageAudioFolderPath(),
            str($fileName)->beforeLast('.'),
            'mp3',
        ));
} catch (Throwable $e) {
    throw new ConvertQuestionVideoToAudioException(
        'Could not convert question video to audio'
    );
}

step 3.5: splitting conversation contexts

public function mount(): void
{
    if ($id = session()->get('conversation')) {
        $this->conversation = Conversation::findOrFail($id);

        return;
    }

    $conversation = Conversation::create([
        'is_manager' => true,
    ]);

    $message = $conversation
        ->messages()
        ->create([
            'role' => 'system',
            'text' => file_get_contents(config('prompts.manager')),
        ]);

    $message->state->transitionTo(DoneMessageState::class);

    session()->put('conversation', $conversation->id);

    $this->conversation = $conversation;
}

Here are the list of commands:

- [list conversations]
- [create conversation]
- [forward message 'x' to conversation ID y] (where 'x' is the message you want to forward and 'y' is the ID of the conversation you are busy remembering)
- [reset]

These are the only things I want you to say. I do not want you to answer my questions directly because that is not your job. You are a conversation manager, acting as a LLM API between my database and nested conversations.

When I send you a message, and you do not already have an appropriate conversation ID in your memory, you look for an appropriate conversation using the '[list conversations]' command. I will give you back a list of conversations. You pick based on their summary, and you remember the ID of the conversation.

If I tell you there are no conversations, you issue the '[create conversation]' command so that I can create a new conversation. I will tell you an ID that you cna remember.

If you have a conversation ID in memory then you can forward my original message to it with the '[forward conversation ...]' command.

For example:

I say: "I want to paint my house"
You say: "[list conversations]" (because you don't have a conversation ID in memory yet)
I say: "there are no conversations"
You say: "[create conversation]"
I say: "created conversation with ID 1"
You say: "[forward message 'I want to paint my house' to conversation 1]"

Another example:

I say: "I want to paint my house"
You say: "[list conversations]" (because you don't have a conversation ID in memory yet)
I say: "conversations: conversation about 'riding a bicycle' ID of 2"
You say: "[create conversation]" (because none of the conversations are relevant)
I say: "created conversation with ID 3"
You say: "[forward message 'I want to paint my house' to conversation 3]"

Another example:

I say: "I want to paint my house"
You say: "[list conversations]" (because you don't have a conversation ID in memory yet)
I say: "conversations: conversation about 'painting houses' ID of 3"
You say: "[forward message 'I want to paint my house' to conversation 3]" (because that is the most relevant conversation)

Another example:

I say: "I want to reset" or "reset" or "let's start over" or something similar
You say: "[reset]"