JARVIS AI Assistant - v0 by Vercel

🧠 FINAL PROMPT FOR AI DEVELOPER (J.A.R.V.I.S.-like Voice Assistant)Project Goal:Build a fully voice-controlled AI assistant in Python inspired by Tony Stark's J.A.R.V.I.S. The assistant must feel intelligent, emotionally responsive, and futuristic. It should operate via voice commands in Hindi, English, or mixed Hinglish, and always respond in polished English.🔹 Core Functionalities Required:🎤 1. Voice InterfaceWake-word based activation (e.g., “Hey Jarvis”)Understand Hindi, English, and mixed Hinglish voice inputAlways respond in fluent, polite, intelligent EnglishUse high-quality Text-to-Speech engine (pyttsx3, gTTS, or better)Personal greetings (e.g., "Welcome back, sir", "I've missed your voice")Should understand casual and mixed speech like:“Jarvis, YouTube khol do”“Jarvis, kal ke emails dikha do”“Chrome me GitHub kholo”“Mujhe yaad dilana 6 baje meeting hai”🧠 2. Self-Learning & Correction HandlingIf the assistant doesn't understand a command:Politely ask the user to explainLearn from the explanation and save the new commandNext time, execute it automaticallyStore memory in a local database (JSON or SQLite)Match future similar commands using fuzzy logic (e.g., fuzzywuzzy)🌐 3. Auto Learning via YouTube / InternetIf user gives a command that the assistant cannot handle:Automatically search YouTube or the internetExtract video transcriptUse OpenAI GPT or similar to understand the conceptLearn what to do from that contentReply:“I didn’t know that, so I taught myself. I’m ready now.”Always use safe, verified sources (filter content > 4.5★, with transcript)🖥 4. Full System ControlOpen/close applicationsFile system access (create, delete, move, rename files/folders)Take screenshots, record screenLock, shutdown, restart, sleep PCVolume, brightness, display settingsClipboard read/writeControl mouse and keyboard via voice (pyautogui)🔍 5. Live Screen AwarenessDetect which window or app is currently activeUse OCR (pytesseract) to extract on-screen textRecognize context and offer relevant helpExample:"You're editing a document in Word. Would you like me to proofread it?"🎞 6. Video Scene UnderstandingUser can say: “Jarvis, watch this video and tell me what’s happening.”Features:Detect scene cuts (scenedetect)Transcribe audio (whisper or speech_recognition)Use GPT to summarize scene by sceneDetect emotion, action, context in videoGenerate screenplay-style summaryLearn from it and remember for later🧩 7. Personality & Emotional TouchCustom greetings, based on time/weatherSoft-spoken, British-accented optional voiceEmotional memory (e.g., remember user preferences or last interaction)Fun replies when idle or unused (e.g., jokes, quotes)💾 8. Long-Term Memory & RecallSave all user commands, explanations, responsesCan recall memory when asked:“Jarvis, what did you learn yesterday?”“Jarvis, kya kal ki video yaad hai?”📦 Technologies & Libraries to UsePurpose Suggested ToolsVoice Recognition speech_recognition, whisperText-to-Speech pyttsx3, gTTS, edge-ttsLanguage Translation googletrans, transformers, langdetectGPT Integration openai.ChatCompletion, transformersYouTube Transcript youtube-transcript-api, yt_dlpVideo Processing opencv, scenedetect, moviepyScreen Reading pygetwindow, pytesseract, pyautoguiMemory json, sqlite3, tinydbFuzzy Matching fuzzywuzzy, difflibGUI (optional) tkinter, PyQt5, webview✅ Additional RequirementsModular, well-documented codeEasy to add new commands (plugin-friendly)Works offline as much as possibleLogs all unknown commands and how they were learnedFuture-ready: can later integrate with APIs like Gmail, Calendar, Spotify🧠 Example Interaction:You: “Jarvis, mujhe kal ki video yaad hai jo maine dikhayi thi?”Jarvis:“Yes, sir. It was a courtroom drama. A man confessed to a crime. Would you like me to summarize it again?”You: “Jarvis, mujhe samajh nahi aaya pie chart banana.”Jarvis (after checking YouTube):“No worries. I found a video tutorial and learned the steps. Would you like me to generate a pie chart from a CSV now?”🧩 Final Notes:This assistant should feel like a true digital partner — one that speaks clearly, learns constantly, and evolves naturally.It should never feel “dumb” — if it doesn’t know, it should go find out — just like a real intelligent being.✅ Prompt Usage Instructions:Copy the above and paste it into:ChatGPT / GitHub Copilot / Claude / Any developer briefingUse as a project specification documentOr give to a freelancer / AI developer as your exact visionomprehensive Project Specification: J.A.R.V.I.S.-Style AI Assistant1. Voice Interface (Hindi & Hinglish Input, English Response)Wake-word activation (“Hey Jarvis”) ensuring immediate, voice-initiated interaction.Accepts Hindi, English, and mixed Hinglish speech, converting it to accurate, case-insensitive text.Always responds back in eloquent, intelligent English — ensuring consistency and clarity.Tone of voice should be calm, warm, and slightly British — reminiscent of J.A.R.V.I.S.’s persona.Voice mode must include a whisper/quiet option for nighttime use and polite contexts.Context-aware greetings like “Good evening, sir. What shall we accomplish now?”Able to handle casual colloquial inputs seamlessly — e.g., “Jarvis, kal ka reminder set kar do.”Deft in recognizing mixed commands — e.g., “Jarvis, open WhatsApp aur messages dikha.”Supports language detection and auto-routing into appropriate processing pipelines.Text-to-speech engine must include customization—speed, voice gender, and emotional tone.Must gracefully prompt listeners when inaudible or unclear; e.g., “Could you repeat that, please?”Should confirm voice commands clearly, summarizing back: “Sure — opening Google Chrome now.”Maintain a buffer/state memory of prior commands for continuity in conversation.Should detect if the user is distracted or silent, and gently prompt assistance.Fully modular interface — easy to swap speech engines or voices in future.2. Intelligent Self-Learning & CorrectionsDetects when a command isn't understood — and responds with empathy, like “I’m not sure what you meant, sir.”Asks for clarification in natural language — e.g., “Could you please explain what ‘XYZ’ means?”Captures the user’s explanation and stores it in memory (JSON or SQLite) for future use.Fuzzy matching to detect similar commands, so learned behavior is matched later automatically.Ability to update existing mappings if user redefines or refines a previously learned command.Includes timestamps, context tags, and execution logs for each learned rule.Provides a “learning summary” on demand: “Jarvis, show me what you learned today.”Memory storage should be queryable—e.g., “What does ‘design folder’ do again?”Allows easy export or cleaning of learned commands for maintenance.Includes safeguards to prevent conflicting learned behaviors from overwriting core functions.Learning should be interactive—Jarvis confirms understanding: “Got it, I’ll open ‘Figma_Projects’ next time.”Edge-case handling: if user still unclear, repeats explanation until mapped confidently.Offers a training mode: user can teach multiple commands in one session (“Jarvis learning mode”).Memory should persist across sessions and be lightweight for quick lookup.Learning module should be decoupled for future plugin integration.3. Auto-Learning from the Internet & YouTubeWhen encountered with an unknown concept, Jarvis says, “Let me look that up for you in the background.”It searches YouTube (and optionally web) for the most relevant tutorials — prioritizing high ratings and transcripts.Downloads the transcript from the top video using Youtube Transcript API or similar.Feeds transcript to GPT (or similar model) with prompt to explain in simple, actionable terms for execution.Parses that explanation into a structured action (e.g., commands, code snippet, procedure).Confirms with user: “I learned how to do that — shall I execute now?”Stores learned procedure in memory with a clear “source = learned from YouTube” tag.Provides a summary: “Here’s what I learned: …” before executing.Filters results — discards overly long content, or videos without text data.Allows repeated refinement — if initial explanation isn’t clear, auto fetches a second resource.Stores metadata: video title, URL, date accessed for traceability.Caches learned items so future requests do not hit YouTube API.Provides a safe mode — restricts to verified/trusted sources only (like educational channels).Learns in a way that’s transparent: “By the way, I learned this from a tutorial at 10:32 of the video.”Module is modular and can be toggled on/off or limited to manual activation.4. Full System Control (Desktop & OS Interaction)Open, close, and control applications via voice (e.g., Chrome, VS Code, Notepad).Control system-level operations: shutdown, reboot, sleep, lock workstation.Manage volume and brightness — “Set volume to 30%” or “Dim screen to 50%.”Navigate using mouse and keyboard via voice commands using pyautogui.Create, delete, rename, move, copy files and folders — including search by name or date.Access clipboard contents and manipulate copy-paste via voice.Take screenshots or record screen upon request.Interact with external drives or cloud directories if required.Manage system notifications — read, dismiss, or open from voice.Provide system health info — CPU, RAM, battery status with alerts.Secure actions — ask confirmation before destructive tasks like deleting folders.Provide undo/redo voice commands based on state history.Manage application window positions — move, resize, minimize, maximize.Allow macro creation — record series of actions and replay by voice.Safety first — every critical action should have a confirmation step or fallback.5. Live Screen Awareness & Context SensitivityContinuously monitor the active window or context — determine foreground app or browser tab.Use OCR (pytesseract) to read visible text on screen—like headlines, document titles, dialog boxes.Capture screen screenshots on demand or periodically for context.Understand context — e.g., editing a Word doc, watching YouTube, coding in an IDE.Provide proactive suggestions — e.g., “You’re composing an email; should I proofread it?”Understand window titles — like “Settings”, “Terminal”, or “Figma_Projects – VSCode”.Recognize when a video is playing, paused, buffering, and adjust responses accordingly.Combine active app detection with voice: “Jarvis, what’s open?” → Replies with list.Track user focus — if idle for a while, gently prompt: “Need help, sir?”Highlight meaningful screen events — popups, errors, or notifications — voice accordingly.Use semantic filtering — don’t announce every minor change (e.g., cursor moves).Allow user to ask: “What’s happening on screen?” and get a meaningful, contextual answer.Enforce privacy: screen data is analyzed locally and not stored beyond need.Context awareness integrates with memory — “Last time you were editing report.docx.”Component is decoupled and easy to enhance later with vision models.6. Video Scene Understanding & Script GenerationOn “Jarvis, watch this video,” the assistant begins processing visuals and audio.Detects scene transitions and timestamps using scenedetect or similar tools.Transcribes dialogues or narration via whisper, generating time-aligned speech text.For each scene, summarizes in screenplay-like detail — characters, setting, action.Recognizes emotional tone in speech or visual cues (face expressions, tone).Uses GPT (or multimodal model) for narrative description — “He looks remorseful…”Produces scene-by-scene script or summary with timestamps.Stores key learnings per scene as memory snippets.Should support various video formats — local and streaming (via URL).Allows follow-up: “What did that man say at 02:32?” — Jarvis references transcript.Offers question-based exploration: “Why was he crying?” or “What did the document say?”Memory aware: if a scene repeats in future, Jarvis recalls previous summary.Modular: separate video processing, transcription, summarization, and memory modules.Offers export: save script as DOC or text file for user analysis.Error-handling — gracefully signals if video format unsupported or processing fails.7. Personality, Emotional Touch & Long-Term MemoryJarvis greets you depending on time of day — calming in morning, soothing in evening.Remembers preferences — e.g., “You prefer silent mode in the evenings.”Includes periodic motivational quotes, gentle humor when idle (“Still here, sir?”).Responds empathetically — “I sense you’re tired. Want me to play something relaxing?”Stores personal data securely — user’s name, preferred mode, favorite applications.Offers memory recall — e.g., “Jarvis, remind me what you learned yesterday.”Friendly voice personality—expressively calm, confident, and supportive.Encourages user — “You’ve completed 5 tasks today — very productive!”Keeps a dialogue log (locally) for introspection or debugging.Customizable mood settings — serious, playful, strictly professional.Offers mini actions — like telling a joke or interesting fact if idle.Optionally integrates ambient sounds or hum to feel alive.Excels at tone-matching — user in sad mode → soft expression, concern.Efficient memory: recall events like “last file you opened” or “last video you asked about.”Personality module is pluggable for future upgrades.