Building a Jarvis-like AI Program with ZeroLM & ChatGPT
Introduction
In an environment overflowing with data, a voice-activated assistant similar to Jarvis from Iron Man can revolutionize the way we interact with information. In this enhanced tutorial, we will be developing "ZLM GPT," a basic voice-activated assistant that uses ZeroLM and various Python libraries.
Objective
Our goal is to create an AI assistant capable of understanding and responding to user inputs through voice, using ZeroLM API, a platform that leverages powerful language models like GPT-3 and GPT-4.
Requirements
- Python (Python >=3.9
- ZeroLM API Key HERE
- A system with a microphone (tested on MacOS)
- Several Python Libraries (listed below)
Prerequisites:
-
Install Required Libraries:
pip install requests numpy sounddevice soundfile pyglet python-dotenv
-
Configure Environment Variables: Create a
.env
file and store your ZeroLM API Key securely in it.ZERO_LM_API_KEY=<Your ZeroLM API Key>
Assistant Workflow
- The assistant starts by recording ambient noise to set a sound threshold for detecting voice commands.
- It listens to the user and records the user’s voice if it is louder than the ambient noise level.
- The recorded voice is then converted to text.
- This text is sent to ZeroLM, and a response is generated.
- The response is converted back to speech and played to the user.
Code Breakdown
Below is a comprehensive walkthrough of the Python code used to create the AI assistant. Each section is modular and explained in detail for easy replication in other programming languages.
1. Import Libraries and Load Environment Variables:
import os
import requests
import numpy as np
import sounddevice as sd
import soundfile as sf
import pyglet
import json
from dotenv import load_dotenv
load_dotenv()
ZERO_LM_API_KEY = os.getenv("ZERO_LM_API_KEY")
2. Initialize Constants and Variables:
Setup the initial constants and variables that will be used throughout the code, including API URLs, and audio recording parameters.
ZERO_LM_API_URL = 'https://zero-api.civai.co'
RATE = 44100
CHANNELS = 2
AMBIENT_SECONDS = 3
RECORD_SECONDS = 10
THRESHOLD_MULTIPLIER = 1.8
ambient_max_amplitude = None
3. Define Utility Functions:
a. API Call Function:
Create a utility function to handle the API calls to ZeroLM.
def zerolm_call(endpoint, method="GET", params=None, data=None, files=None):
headers = {'Authorization': f'Bearer {ZERO_LM_API_KEY}'}
url = f"{ZERO_LM_API_URL}{endpoint}"
if method == "POST":
headers['Content-Type'] = 'application/json'
response = requests.post(url, headers=headers, json=data, files=files)
else:
response = requests.get(url, headers=headers, params=params)
content_type = response.headers.get('Content-Type')
if 'application/json' in content_type:
return response.json()
else:
return response.content
b. Speech to Text:
Convert user’s speech to text using ZeroLM’s transcribe service.
def speech_to_text(audio_file_path):
with open(audio_file_path, 'rb') as audio_file:
transcription_content = zerolm_call('/transcribe', method="POST", files={'file': audio_file})
return transcription_content.get('transcription')
c. Text to Speech:
Convert the received text response to speech using ZeroLM’s TTS service.
def text_to_speech(text):
audio_content = zerolm_call('/tts', method="GET", params={"text": text})
audio_out = "audio_output.wav"
with open(audio_out, "wb") as out:
out.write(audio_content)
return audio_out
d. Audio Recording and Playing:
Record user's voice and play back the response received from ZeroLM.
def record_audio():
global ambient_max_amplitude
if ambient_max_amplitude is None:
ambient_frames = sd.rec(int(RATE * AMBIENT_SECONDS), samplerate=RATE, channels=CHANNELS, dtype=np.int16)
sd.wait()
ambient_max_amplitude = np.max(np.abs(ambient_frames))
while True:
frames = sd.rec(int(RATE * RECORD_SECONDS), samplerate=RATE, channels=CHANNELS, dtype=np.int16)
sd.wait()
max_amplitude = np.max(np.abs(frames))
if max_amplitude > ambient_max_amplitude * THRESHOLD_MULTIPLIER:
sf.write("audio.wav", frames, RATE)
return "audio.wav"
def play_audio(audio_file):
music = pyglet.media.load(audio_file)
player = music.play()
player.on_eos = pyglet.app.exit
pyglet.app.run()
4. Main Loop:
if __name__ == "__main__":
conversation_history = ""
while True:
audio_file_path = record_audio()
transcription = speech_to_text(audio_file_path)
print(f"You: {transcription}")
conversation_history += f"User: {transcription}\\\\n"
response_text = zerolm_call('/chat', method="POST", data={'history': conversation_history}).get('response')
print(f"ZLM GPT: {response_text}")
conversation_history += f"ZLM GPT: {response_text}\\\\n"
response_audio_file = text_to_speech(response_text)
play_audio(response_audio_file)
Execution Flow:
- The main loop starts with ambient noise recording to set a threshold.
- The assistant then continuously listens to the user.
- When the user speaks, the assistant transcribes the speech, processes it using ZeroLM, and plays the response back to the user.
Conclusion:
With ZeroLM, constructing a voice-interactive assistant like Jarvis is no longer confined to fiction. This tutorial offers a foundation to build a simple yet intelligent assistant capable of comprehending and intelligently responding to commands.
Feel free to expand upon this foundation, integrating additional features, and utilizing the extensive capabilities of ZeroLM to develop more advanced applications.
👉 Explore the complete & fully functional source code on Github Gist.
Disclaimer
Friendly reminder: While unlocking new potentials with this tutorial, please don’t create the next Ultron! If any world-conquering AIs arise, I'm claiming no responsibility!
Let's keep our inventions on the side of good and avoid any unnecessary calls to the Avengers 🙏.
Happy coding, and remember, create responsibly!
Comments
Post a Comment