#4203 Speech-to-Text-to-LLM-to-Report

#4203 Provide a Python app to perform real-time speech-to-text and provide intermittent LLM analysis and reporting of the conversation

To build a real-time speech-to-text (STT) app with intermittent LLM analysis, the most efficient modern approach is to use a streaming SDK (like AssemblyAI or Deepgram) for low-latency transcription and a threading/buffer system to send chunks of that text to an LLM (like OpenAI's GPT-4o) every $N$ seconds. ............................

This must be locally, on-machine

To move this entirely on-machine, we shift from cloud APIs to local inference. This requires more hardware "oomph" (specifically a GPU with 8GB+ VRAM is ideal, though CPUs work slower), but it ensures total privacy and zero API costs.

We will use:

Faster-Whisper: A highly optimized version of OpenAI’s Whisper for real-time STT.
Ollama: The industry standard for running LLMs (like Llama 3 or Mistral) locally.

1. Prerequisites

First, install Ollama and pull a model:

ollama pull llama3

Then, install the Python dependencies:

pip install faster-whisper streamlink PyAudio ollama

2. The Local STT & Analysis App

This script uses a Thread-safe Queue to move text from the "Ear" (Whisper) to the "Brain" (Ollama) without blocking the audio stream.

import time
import threading
import queue
import ollama
from faster_whisper import WhisperModel
import pyaudio
import numpy as np

# --- CONFIGURATION ---
MODEL_SIZE = "base"  # Options: tiny, base, small, medium (base is fast for real-time)
LLM_MODEL = "llama3" # The model you pulled via Ollama
ANALYSIS_INTERVAL = 30 # Seconds between LLM reports

class LocalVoiceAssistant:
    def __init__(self):
        print(f"Loading Whisper model ({MODEL_SIZE})...")
        # Use "cuda" if you have an NVIDIA GPU, otherwise "cpu"
        self.stt_model = WhisperModel(MODEL_SIZE, device="cpu", compute_type="int8")
        
        self.text_buffer = []
        self.analysis_queue = queue.Queue()
        self.running = True

    def listen_loop(self):
        """Captures audio from mic and transcribes in real-time."""
        p = pyaudio.PyAudio()
        stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1024)
        
        print("🎤 Local STT Active. Speak into the mic...")
        
        audio_data = b""
        silence_threshold = 1.0 # seconds of silence before processing a 'chunk'
        
        while self.running:
            chunk = stream.read(1024, exception_on_overflow=False)
            audio_data += chunk
            
            # Simple logic: process audio in 3-second blocks for transcription
            if len(audio_data) >= 16000 * 2 * 3: 
                audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
                segments, _ = self.stt_model.transcribe(audio_np, beam_size=5)
                
                for segment in segments:
                    if segment.text.strip():
                        print(f"[Live] {segment.text}")
                        self.text_buffer.append(segment.text)
                
                audio_data = b"" # Reset buffer

    def analysis_loop(self):
        """Periodically sends accumulated text to the local LLM."""
        last_analysis_time = time.time()
        
        while self.running:
            time.sleep(1)
            if time.time() - last_analysis_time >= ANALYSIS_INTERVAL:
                if self.text_buffer:
                    full_text = " ".join(self.text_buffer)
                    self.text_buffer = [] # Clear buffer
                    
                    print("\n--- 🧠 OLLAMA ANALYZING ---")
                    try:
                        response = ollama.chat(model=LLM_MODEL, messages=[
                            {'role': 'system', 'content': 'Summarize this conversation segment concisely.'},
                            {'role': 'user', 'content': full_text},
                        ])
                        print(f"📝 LOCAL REPORT: {response['message']['content']}\n")
                    except Exception as e:
                        print(f"LLM Error: {e}")
                        
                    last_analysis_time = time.time()

    def start(self):
        t1 = threading.Thread(target=self.listen_loop)
        t2 = threading.Thread(target=self.analysis_loop)
        t1.start()
        t2.start()
        t1.join()
        t2.join()

if __name__ == "__main__":
    assistant = LocalVoiceAssistant()
    try:
        assistant.start()
    except KeyboardInterrupt:
        assistant.running = False
        print("\nStopping...")

Why this setup?

Faster-Whisper: It's significantly faster than the standard OpenAI implementation. On a modern CPU, the base model can transcribe audio faster than real-time.
Ollama: It manages the LLM lifecycle. If you have a Mac (M1/M2/M3) or a PC with an NVIDIA card, Ollama will automatically use the GPU to make analysis nearly instant.
Privacy: No data leaves your network interface.

Performance Tweak

If the transcription is lagging behind your speech:

Change device="cpu" to device="cuda" in the WhisperModel initialization (requires an NVIDIA GPU).
Switch MODEL_SIZE to "tiny". It’s less accurate but extremely fast on older hardware.

Would you like me to help you refine the "System Prompt" to look for specific keywords or triggers in your local conversations?

Yes

To make this tool truly useful for your workflow, we should refine the "Brain" of the app to act as an automated S&MA (Safety & Mission Assurance) Analyst.

Given your focus on NASA standards like NPR 7150.2D and CMMI-DEV, we can program the local LLM to flag compliance risks, mission-critical keywords, and integration points for your Knowledge Graph projects.

Updated System Prompt

Replace the ollama.chat section in the previous script with this targeted logic. This forces the LLM to filter the transcript through your specific professional lens.

# --- REFINED ANALYSIS LOGIC ---
SYSTEM_PROMPT = """
You are an expert NASA S&MA (Safety & Mission Assurance) AI assistant. 
Analyze the provided conversation transcript for:
1. COMPLIANCE: Note any mention of NPR 7150.2D, NASA-STD-8739.8B, or CMMI standards.
2. TECHNICAL KEYWORDS: Flag discussions on Semantic Search, RAG, Knowledge Graphs, or SQLite3.
3. MISSION IMPACT: Identify risks or mission contingency planning for 'Moon to Mars' or 'Gateway'.
4. ACTION ITEMS: Extract specific software engineering tasks for the user.

Format the output as a concise 'Mission Status Report'. If no relevant info is found, say 'No mission-critical data detected.'
"""

# Inside your analysis_loop:
response = ollama.chat(model=LLM_MODEL, messages=[
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': f"Transcript Segment: {full_text}"},
])

Adding a "Hotword" Trigger

If you want to trigger a manual report immediately (e.g., when you say "Note that" or "Record this"), you can add a simple string check in the listen_loop:

# Add this inside the 'for segment in segments' loop:
trigger_phrases = ["record this", "note that", "compliance check"]

if any(phrase in segment.text.lower() for phrase in trigger_phrases):
    print("\n⚠️ TRIGGER PHRASE DETECTED - FORCING ANALYSIS...")
    # You can call a manual analysis function here

Strategic Performance Note

Since you are using Ollama locally, and you mentioned owning an NVIDIA RTX 3090 Ti, you should ensure the code is utilizing your VRAM:

For STT: Change WhisperModel(MODEL_SIZE, device="cuda", compute_type="float16"). With 24GB of VRAM, you can easily run the large-v3 model in real-time with near-perfect accuracy.
For LLM: Your 3090 Ti can run Llama 3.1 8B or Gemma 2 9B entirely in VRAM with zero lag.

Would you like me to help you write a script that automatically saves these intermittent reports into a local SQLite3 database for your Knowledge Graph?

Does faster-whisper required a GPU?

No, faster-whisper does not strictly require a GPU. It is highly optimized to run on a CPU as well.

Because it uses the CTranslate2 engine, it is significantly more efficient than the original OpenAI Whisper implementation on standard processors.

Running on CPU

To run it on your CPU, you simply change the device parameter during initialization:

from faster_whisper import WhisperModel

# Use "cpu" for the device
# Use "int8" to make it faster/lighter on the CPU
model = WhisperModel("base", device="cpu", compute_type="int8")

Performance Expectations

Speed: On a modern CPU, the base or small models usually transcribe faster than real-time. The large-v3 model will be noticeably slower on a CPU but still functional.
Latency: There might be a slight delay (a few seconds) between when audio is captured and when the text appears, compared to the near-instant response of a GPU.

Why you might still use your GPU

Since you have an NVIDIA RTX 3090 Ti, using the GPU is highly recommended for this specific task:

Zero CPU Load: It offloads the heavy lifting from your processor, leaving it free for your other solopreneur workflows or compilation tasks.
Higher Accuracy: You can run the large-v3 model (the most accurate) with almost no latency. On a CPU, large-v3 might struggle to keep up with real-time speech.
VRAM: Your 24GB of VRAM is massive for this; the model will only use about 2–4GB.

Would you like me to show you the 1-line change to the code to toggle between CPU and your 3090 Ti automatically?

How can I set audio in to internal audio?

To capture internal audio (system sounds) on Windows for your Python app, you typically have two options: using WASAPI Loopback (modern/clean) or Stereo Mix (legacy/simple).

Since you are running locally and likely want the highest quality for your transcription, I recommend using WASAPI Loopback via the PyAudioWPatch library.

Option 1: The "Modern" Way (WASAPI Loopback)

This method intercepts the digital audio stream going to your speakers without needing to enable hidden Windows settings.

1. Install the specialized library:

pip install PyAudioWPatch

2. Update your script to find the loopback device: Replace your pyaudio.PyAudio() initialization and stream opening with this logic:

import pyaudiowpatch as pyaudio

def get_loopback_device():
    p = pyaudio.PyAudio()
    try:
        # Get default WASAPI info
        wasapi_info = p.get_host_api_info_by_type(pyaudio.paWASAPI)
    except OSError:
        print("WASAPI not found.")
        return None

    # Find the loopback analogue for your default speakers
    default_speakers = p.get_default_output_device_info()
    loopback_device = None
    
    for i in range(p.get_device_count()):
        dev = p.get_device_info_by_index(i)
        if dev["isLoopbackDevice"] and dev["hostApi"] == wasapi_info["index"]:
            # Match the name to your speakers if possible
            if default_speakers["name"] in dev["name"]:
                loopback_device = dev
                break
    
    if not loopback_device:
        # Fallback to the first loopback found
        loopback_device = p.get_default_wasapi_loopback()
        
    return loopback_device

# In your stream opening code:
device = get_loopback_device()
stream = p.open(
    format=pyaudio.paInt16,
    channels=device["maxInputChannels"],
    rate=int(device["defaultSampleRate"]),
    input=True,
    input_device_index=device["index"]
)

Option 2: The "Old School" Way (Stereo Mix)

If you don't want to use extra libraries, you can use the built-in Windows "Stereo Mix."

Enable Stereo Mix in Windows:
- Right-click the Speaker Icon in your taskbar -> Sounds (or Sound Settings > More Sound Settings).
- Go to the Recording tab.
- Right-click in the list and check "Show Disabled Devices."
- Right-click Stereo Mix and select Enable.

Find the Index: Run a small script to print your device indices:

import pyaudio
p = pyaudio.PyAudio()
for i in range(p.get_device_count()):
    print(f"Index {i}: {p.get_device_info_by_index(i)['name']}")

Set the Index: In your main script, set input_device_index to the number corresponding to "Stereo Mix."

Pro-Tip for your Setup

Since you have a powerful RTX 3090 Ti, capturing internal audio is perfect for transcribing Zoom calls, YouTube research, or technical webinars directly into your Knowledge Graph.

Would you like me to help you modify the audio processing loop so it handles the 2-channel (stereo) audio that usually comes from internal system sounds? (Standard Whisper usually expects 1-channel mono).

From <https://gemini.google.com/app/446ecbcd63d1148d> Google Gemini 3.0 (Fast)

Search This Blog

Learning from ChatGPT