See create a AI Wingman that let's users impress their Crush. This app uses gemini 2.0 flash api as the main ai that listens to the user's conversation with their crush and in real time dictact the next line in user's earphones. It does this by using the Live API function in the gemini 2.0 flash api. Use this api key - AIzaSyBA5rbBeqJGYrZsQkk8OJ47099LRcMQzQo
Loading...
Make this not just a voice coach but Application Name: "AI Rizz Assistant"Core Goal: To provide real-time, in-ear AI-generated conversational prompts to the user based on their crush's speech, helping the user navigate a live conversation.High-Level Architecture & Technologies: * Frontend: React/Next.js (Vercel v0 standard), Web Audio API for mic access and audio playback. * Backend: Next.js API Routes (Serverless Functions) for real-time processing and AI orchestration. * AI/ML: Google Gemini 2.0 Flash Live API for real-time Speech-to-Text (STT) and Text-to-Speech (TTS), and core LLM inference. * Real-time Communication: WebSockets for continuous, low-latency audio streaming between client and server, and for sending text prompts. * Key Feature: Real-time Speaker Diarization (to distinguish user vs. crush).I. Frontend (Client-Side - User Interface & Interaction):A. Initial Setup Screen (/ route):* Header: Simple title "AI Rizz Assistant".* Crush Gender Selection:* Heading: "Who's your Crush?"* Two prominent buttons: Button label="Boy" icon="Male", Button label="Girl" icon="Female".* Helper text: "Selecting gender helps tailor AI suggestions."* Start Conversation button: Disabled initially, enables upon gender selection. Leads to /conversation route.* Microphone Access Prompt: Prior to starting or upon entering the conversation screen, prompt user for microphone access.B. Conversation Screen (/conversation route):* Header: Dynamic title (e.g., "Conversation with [Crush Gender]").* Real-time Status Indicator: div or span displaying current AI state:* Listening for Crush... (default state, AI awaiting crush's speech)* Processing Crush's Line... (AI processing incoming audio, performing STT and inferring)* AI Prompting User... (AI playing generated audio to user)* Listening for User Confirmation... (AI awaiting user's speech to confirm line delivery)* Error: [Message] (for connection or processing issues)* AI-Suggested Line Display:* Large p or div element to display the text of the AI's current suggested line. Initially, "Waiting for your crush to speak...".* Visually distinct (e.g., bold, larger font).* User's Last Spoken Line (Optional, for debugging/user context):* Smaller p or div below the AI suggestion, showing the last sentence the user spoke (as transcribed by the system).* Audio Activity Indicator: A visual div or canvas that animates (e.g., expanding/contracting waves, pulsing circle) to indicate microphone activity and speech detection. Changes color/pattern to differentiate when it's detecting the "Crush" vs. the "User."* Controls:* End Conversation button: Prominent, red/destructive styling. Terminates the WebSocket connection and resets the app.* (Future: Repeat Last Line, Skip Line buttons for advanced control, initially disabled)II. Backend (Server-Side - Next.js API Routes / Serverless Functions):A. /api/conversation/start (POST):* Receives crushGender (string: "boy" | "girl").* Initializes a new conversational session.* Establishes a WebSocket connection with the client. This will be the primary channel for real-time audio and text exchange.* Initializes Gemini 2.0 Flash Live API session using the crushGender as part of the initial prompt/context.* Returns session ID to client.B. WebSocket Connection Logic (within /api/conversation/stream or similar):* Receive Client Audio Stream: Continuously receives raw audio data (e.g., PCM, WAV chunks) from the client's microphone.* Real-time Audio Processing:* Voice Activity Detection (VAD): Identifies segments of speech from silence.* Crucial: Speaker Diarization Module: This is the most complex part. Integrates a real-time speaker diarization model/library (e.g., using open-source models, or potentially a separate Google Cloud AI service if available and performant enough for real-time) to:* Distinguish between "User's Voice" and "Crush's Voice" within the incoming combined audio stream.* Timestamp when each speaker starts/stops.* Audio Routing to Gemini Live API:* When "Crush's Voice" is detected and a complete utterance is formed:* Send the crush's transcribed text (from Live API's STT) to the core LLM within the Live API's context.* Prompt the LLM (Gemini 2.0 Flash) to generate a "rizz" response for the user based on the crush's line, considering the crushGender and conversation history. Example prompt: "Crush said: [Crush's transcribed line]. As a [boy/girl], generate a witty and charming response for the user to say. Keep it concise."* When "User's Voice" is detected after an AI prompt has been given:* Transcribe the user's speech using Live API's STT.* Verify if the user's spoken line sufficiently matches the previously dictated AI line (for confirmation). This is key for the state machine.* Update conversation history internally.* Conversation History Management: Maintain a rolling log of transcribed lines from both speakers (as identified by diarization) to provide context to the LLM.* Generate & Stream AI Prompt Audio:* Once Gemini 2.0 Flash Live API generates a text response for the user:* Use the Live API's TTS capabilities to convert this text into an audio stream.* Stream this audio data back to the client's earphones via the WebSocket.* State Management: Maintain the current state of the conversation (e.g., listeningForCrush, awaitingUserConfirmation, processing).C. Error Handling & Resilience:* Implement robust error handling for API failures, network interruptions, and audio processing issues.* Client-side reconnect logic for WebSockets.* Rate limit handling for Gemini API calls.III. Key Technical Considerations & Requirements: * Ultra-Low Latency: Every step (audio capture, diarization, STT, LLM inference, TTS, audio playback) must be optimized for milliseconds to ensure a natural conversation flow without noticeable delays for the user. * Speaker Diarization Accuracy: This is paramount. The system must reliably distinguish between the two voices in a live setting, which is a very hard problem. * Robust Microphone Access & Audio Buffering: Client-side must manage microphone input effectively, buffering and streaming audio chunks. * Seamless Audio Playback: Client-side must be able to play the incoming AI-generated audio stream into the user's earphones without disrupting mic input. * Context Window Management: Ensure the LLM consistently receives enough relevant conversation history within its context window
Loading...