Gemini 2.0 Flash API - v0 by Vercel

Make this not just a voice coach but Application Name: "AI Rizz Assistant"Core Goal: To provide real-time, in-ear AI-generated conversational prompts to the user based on their crush's speech, helping the user navigate a live conversation.High-Level Architecture & Technologies: * Frontend: React/Next.js (Vercel v0 standard), Web Audio API for mic access and audio playback. * Backend: Next.js API Routes (Serverless Functions) for real-time processing and AI orchestration. * AI/ML: Google Gemini 2.0 Flash Live API for real-time Speech-to-Text (STT) and Text-to-Speech (TTS), and core LLM inference. * Real-time Communication: WebSockets for continuous, low-latency audio streaming between client and server, and for sending text prompts. * Key Feature: Real-time Speaker Diarization (to distinguish user vs. crush).I. Frontend (Client-Side - User Interface & Interaction):A. Initial Setup Screen (/ route):* Header: Simple title "AI Rizz Assistant".* Crush Gender Selection:* Heading: "Who's your Crush?"* Two prominent buttons: Button label="Boy" icon="Male", Button label="Girl" icon="Female".* Helper text: "Selecting gender helps tailor AI suggestions."* Start Conversation button: Disabled initially, enables upon gender selection. Leads to /conversation route.* Microphone Access Prompt: Prior to starting or upon entering the conversation screen, prompt user for microphone access.B. Conversation Screen (/conversation route):* Header: Dynamic title (e.g., "Conversation with [Crush Gender]").* Real-time Status Indicator: div or span displaying current AI state:* Listening for Crush... (default state, AI awaiting crush's speech)* Processing Crush's Line... (AI processing incoming audio, performing STT and inferring)* AI Prompting User... (AI playing generated audio to user)* Listening for User Confirmation... (AI awaiting user's speech to confirm line delivery)* Error: [Message] (for connection or processing issues)* AI-Suggested Line Display:* Large p or div element to display the text of the AI's current suggested line. Initially, "Waiting for your crush to speak...".* Visually distinct (e.g., bold, larger font).* User's Last Spoken Line (Optional, for debugging/user context):* Smaller p or div below the AI suggestion, showing the last sentence the user spoke (as transcribed by the system).* Audio Activity Indicator: A visual div or canvas that animates (e.g., expanding/contracting waves, pulsing circle) to indicate microphone activity and speech detection. Changes color/pattern to differentiate when it's detecting the "Crush" vs. the "User."* Controls:* End Conversation button: Prominent, red/destructive styling. Terminates the WebSocket connection and resets the app.* (Future: Repeat Last Line, Skip Line buttons for advanced control, initially disabled)II. Backend (Server-Side - Next.js API Routes / Serverless Functions):A. /api/conversation/start (POST):* Receives crushGender (string: "boy" | "girl").* Initializes a new conversational session.* Establishes a WebSocket connection with the client. This will be the primary channel for real-time audio and text exchange.* Initializes Gemini 2.0 Flash Live API session using the crushGender as part of the initial prompt/context.* Returns session ID to client.B. WebSocket Connection Logic (within /api/conversation/stream or similar):* Receive Client Audio Stream: Continuously receives raw audio data (e.g., PCM, WAV chunks) from the client's microphone.* Real-time Audio Processing:* Voice Activity Detection (VAD): Identifies segments of speech from silence.* Crucial: Speaker Diarization Module: This is the most complex part. Integrates a real-time speaker diarization model/library (e.g., using open-source models, or potentially a separate Google Cloud AI service if available and performant enough for real-time) to:* Distinguish between "User's Voice" and "Crush's Voice" within the incoming combined audio stream.* Timestamp when each speaker starts/stops.* Audio Routing to Gemini Live API:* When "Crush's Voice" is detected and a complete utterance is formed:* Send the crush's transcribed text (from Live API's STT) to the core LLM within the Live API's context.* Prompt the LLM (Gemini 2.0 Flash) to generate a "rizz" response for the user based on the crush's line, considering the crushGender and conversation history. Example prompt: "Crush said: [Crush's transcribed line]. As a [boy/girl], generate a witty and charming response for the user to say. Keep it concise."* When "User's Voice" is detected after an AI prompt has been given:* Transcribe the user's speech using Live API's STT.* Verify if the user's spoken line sufficiently matches the previously dictated AI line (for confirmation). This is key for the state machine.* Update conversation history internally.* Conversation History Management: Maintain a rolling log of transcribed lines from both speakers (as identified by diarization) to provide context to the LLM.* Generate & Stream AI Prompt Audio:* Once Gemini 2.0 Flash Live API generates a text response for the user:* Use the Live API's TTS capabilities to convert this text into an audio stream.* Stream this audio data back to the client's earphones via the WebSocket.* State Management: Maintain the current state of the conversation (e.g., listeningForCrush, awaitingUserConfirmation, processing).C. Error Handling & Resilience:* Implement robust error handling for API failures, network interruptions, and audio processing issues.* Client-side reconnect logic for WebSockets.* Rate limit handling for Gemini API calls.III. Key Technical Considerations & Requirements: * Ultra-Low Latency: Every step (audio capture, diarization, STT, LLM inference, TTS, audio playback) must be optimized for milliseconds to ensure a natural conversation flow without noticeable delays for the user. * Speaker Diarization Accuracy: This is paramount. The system must reliably distinguish between the two voices in a live setting, which is a very hard problem. * Robust Microphone Access & Audio Buffering: Client-side must manage microphone input effectively, buffering and streaming audio chunks. * Seamless Audio Playback: Client-side must be able to play the incoming AI-generated audio stream into the user's earphones without disrupting mic input. * Context Window Management: Ensure the LLM consistently receives enough relevant conversation history within its context window