UpgradeTheSystem.com
Voice-First AI Strategic Mentorship
Richard Hames is one of the world's foremost strategic futurists — 40 years of thinking, 21 published books, and a body of work that most executives have never had direct access to. When he approached me to build his AI, the brief was simple: make those 40 years of thinking available in real-time conversation.
The real challenge wasn't the technology. Chatbots are easy. The challenge was fidelity — building something that didn't just know Richard's work, but thought like him. Surface-level Q&A would be an insult to the material. We needed deep persona engineering, a genuine RAG implementation across the full corpus, and voice that felt like a real conversation — not a text-to-speech read-back.
Stack & Architecture
- Cloudflare Workers — serverless compute at the edge, globally distributed
- Hono — ultra-fast TypeScript web framework, no cold-start overhead
- ElevenLabs — real-time voice synthesis with custom persona voice
- Gemini Flash — sub-second LLM inference for conversational turns
- Cloudflare D1 — SQLite at the edge for session and conversation state
- Cloudflare R2 — object storage for the vectorised knowledge corpus
- WebSockets — persistent bidirectional connection for voice streaming
- RAG pipeline — custom retrieval layer across 21 published works
The Architecture
Everything runs on Cloudflare Workers at the edge — meaning the compute lives close to the user, not in a data centre in Virginia. Hono handles routing with zero overhead. When a user speaks, the audio hits ElevenLabs for transcription, the text feeds into our RAG retrieval layer which surfaces the most contextually relevant passages from Richard's 21 books stored in R2, and then Gemini Flash generates a response grounded in that retrieved context. The response goes back to ElevenLabs for synthesis in Richard's custom voice, streamed back to the user over WebSockets. The whole round-trip runs under 100ms.
The RAG Knowledge Base
Vectorising 21 books isn't straightforward. You can't just chunk naively by paragraph — strategic thinking doesn't respect paragraph breaks. We built a custom chunking strategy that preserves conceptual boundaries, with metadata tagging by theme, book, chapter, and approximate publication era. This lets the retrieval layer surface not just relevant text but relevant context — Richard's thinking on, say, systems resilience in 1998 versus 2020 is meaningfully different, and the model needs to know that.
Persona Engineering
This was the part that took the longest and matters the most. Richard has a very specific way of framing problems — he rejects binary thinking, always pulls back to systemic context, and has a characteristic vocabulary that's evolved over decades. We spent significant time on system prompt architecture, response shaping, and what I call "deflection handling" — ensuring the AI knows when to say it doesn't have enough context rather than hallucinating an answer Richard would never give.
Cost Architecture
The $0 at rest figure isn't a gimmick — it's a deliberate architectural choice. Cloudflare Workers have no idle cost. D1 and R2 charge for storage and operations, not existence. ElevenLabs and Gemini are purely consumption-based. For a product with variable demand, this means the cost scales linearly with actual usage, not with provisioned capacity. No over-provisioning, no wasted spend on empty servers waiting for traffic.
UpgradeTheSystem.com is live and operational. Richard's 40 years of strategic thinking is now accessible in real-time voice conversation to anyone who visits. The system handles concurrent sessions without degradation, and the persona fidelity has been validated by Richard himself as representative of how he actually thinks.
Want to build something like this?
Fractional CTO, technology advisory, or a build from scratch. Let's talk.