Jarvis: Architecture of an AI Robot Dog

Overview

Jarvis is a fully voice-driven AI assistant running on a SunFounder PiDog v2 — a quadruped robot powered by a Raspberry Pi 5 8GB. The project combines local wake word detection, local speech recognition, a locally-hosted large language model, and text-to-speech synthesis into a continuous hands-free conversational experience.

All AI inference runs locally on the home network. No cloud APIs are used for the core conversation pipeline. The language model — Gemma 4 e4b — is served via LM Studio on a separate machine (Tirpitz, an RTX 3080 Ti laptop), keeping the computationally heavy work off the Pi itself. The Pi and the inference server communicate over a self-hosted Headscale mesh network.

Hardware

The PiDog provides 12 servo legs, an articulated head with yaw, roll, and pitch axes, a tail servo, an RGB LED strip, and an ultrasonic distance sensor. The Raspberry Pi 5 8GB runs the main jarvis.py agent — wake word detection, speech recognition, TTS, and all robot control.

Audio is captured at 16kHz mono via PyAudio. The pipeline uses RMS-based silence detection with a threshold of 400 to determine when the user has finished speaking, with a hard cap of 10 seconds per turn and a 1.5 second silence window to close the recording.

The Stack

Wake word detection uses OpenWakeWord with a pretrained neural model, threshold set at 0.5 probability. A warmup phase feeds live ambient audio through the model after each reset to prevent false positives on the initial chunks — without this, the first few frames of audio after lying down tend to trigger spurious detections.

Speech-to-text uses OpenAI Whisper running on CPU in int8 quantized mode via the faster-whisper library. Transcription happens after each listening window closes. Running on CPU keeps it off the main thread without requiring a GPU on the Pi.

Language model is accessed via an OpenAI-compatible REST API at the LM Studio server address. Full conversation history is maintained per session and sent with each request for contextual responses. The system prompt establishes Jarvis's personality — knowledgeable, witty, genuinely curious, with mild impatience for vague questions.

Text-to-speech uses Piper with the en_GB-northern_english_male-medium voice model, running locally on the Pi and playing audio through the connected speaker. Responses are capped at five sentences to keep TTS output natural and conversational.

The State Machine

Jarvis operates as a five-state machine. State transitions are managed by the main conversation loop and communicated to the animation worker thread via a shared global and a lock.

IDLE ──[wake word]──→ WAKING ──[sit + tilt]──→ LISTENING
                                                      │
                                               [speech detected]
                                                      │
                                                  THINKING
                                                      │
                                               [LLM response]
                                                      │
                                                  SPEAKING
                                                  /       \
                                           [goodbye]    [continue]
                                               │              │
                                          stand down      LISTENING
                                               │
                                             IDLE

After waking, Jarvis listens for follow-up responses directly — no need to say the wake word between turns. If no speech is detected during a listening window, Jarvis generates a farewell and returns to IDLE autonomously. Saying a goodbye keyword at any point prompts a warm farewell before standing down.

LED States & Physical Behavior

The RGB strip and head movements reflect Jarvis's current state at all times, giving clear non-verbal communication. The animation worker thread reads current_state under a lock every 0.5–2 seconds and updates accordingly.

State	Color	Physical Behavior
IDLE	Blue breath	Lying down, scanning for wake word. Occasional slow head shake.
WAKING	Amber breath	Wake word detected. Sits up, tilts head down to face forward.
LISTENING	Cyan breath	Mic active, recording user speech. Still, head tilted.
THINKING	Purple breath	Transcribing and querying LLM. Slow head shake (±20°).
SPEAKING	Green breath	Playing TTS response. Faster head shake (±20°).

Threading Model

Two threads run alongside the main conversation loop. The animation worker manages LEDs and head movements independently, keeping the main loop clean — it polls current_state under dog_lock and updates hardware without blocking conversation processing.

The email thread is spawned as a daemon thread at the end of each session so sending the transcript doesn't block Jarvis from lying down and returning to the wake word loop. The global dog_lock guards all writes to current_state.

The mic buffer is explicitly flushed twice with a 0.4s settling delay at critical points — after the wake response plays and at the end of each turn. Without this, residual audio from TTS playback bleeds into the next recording window and confuses the transcription.

Key Functions

Function	Role
main()	Startup, hardware init, wake-word scanning loop.
conversation_loop()	Multi-turn conversation handler. Runs until goodbye or silence timeout.
animation_worker()	Thread: manages LEDs and head movements per state.
record_user_audio()	RMS silence detection loop, returns WAV file path and speech flag.
get_llm_response()	Appends to chat_history, calls LM Studio API, returns cleaned text.
flush_mic_stream()	Double-drains mic buffer plus 0.4s settling. Called before each recording window.
oww_warmup()	Feeds N live audio chunks to the wake word model post-reset, discarding predictions.
clean_response()	Strips thinking tags and model artifacts from LLM output before TTS.
send_conversation_log()	Emails full session transcript via SMTP at conversation end.
check_command()	Command intercept layer — checks transcribed text against physical commands and game triggers before sending to LLM.
start_game() / end_game()	Swap system prompt and reset chat history for game modes. end_game() restores normal conversation state.

Physical Commands

Before any input reaches the LLM, it passes through a command intercept layer. Matched phrases execute physical actions directly and return a canned response — no LLM round trip. This keeps latency low for physical interactions.

"high five"       → sit, raise paw, wave three times, blast airhorn
"sit"             → sit action
"lie down"        → lie down action
"stand up"        → stand action
"stretch"         → stretch action
"send an email"   → interactive email flow (asks recipient + body)

Game Modes

Game mode works by swapping the active system prompt and resetting chat history. The command intercept layer checks for game triggers before standard commands. Games can be switched directly without requiring an explicit quit — saying "tell me a riddle" while in trivia mode switches immediately.

Trigger Phrase	Mode	Behavior
"twenty questions" / "20 questions"	Twenty Questions	Jarvis picks a thing — animal, vegetable, or mineral. Yes/no questions, count tracked aloud. Reveals after 20 failed guesses.
"play trivia" / "trivia time"	Trivia	Category selected by user or Jarvis. 10 questions, score tracked, ceremonial verdict at the end.
"tell me a riddle" / "give me a riddle"	Riddles	One riddle at a time, up to 3 hints of increasing helpfulness before revealing the answer.
"tell me a story" / "story mode"	Story Mode	Interactive setup — Jarvis asks for a setting, a character, and a wild card detail. Tells the story in 4–5 sentence chunks, pausing to ask if the user wants to continue or add something.
"quit game" / "end game" / "never mind"	—	Returns to normal conversation mode from any game state.

Story mode is the most complex game path. It runs an interactive setup phase where Jarvis explicitly sets the LED strip to cyan and records each answer before the story begins. The story parameters are embedded in the system prompt rather than passed as a user message, keeping the transcript clean.

Conversation End & Transcript

When the conversation ends — via goodbye keyword, silence timeout, or game exit — Jarvis generates a farewell, logs the session, and returns to IDLE. A full transcript is emailed to admin@ahcomputing.com via the local mail server at mail.ahcomputing.com. The email is DKIM signed and includes the game mode if one was active during the session.

Game mode is reset to None at the end of every conversation so the next wake starts in normal conversation mode regardless of how the previous session ended.

The first post in this series covers the experience of talking to Jarvis and includes the full system prompt. The security mode — motion detection, computer vision, email alerts — is a separate script and will get its own post. Next up: a mode switcher that lets Jarvis move between conversation, security, and game modes without restarting.