Train Your Own Generative AI Chatbot Using Movie Dialogues
Overview
This project demonstrates how to train a custom Generative AI chatbot using the Cornell Movie Dialogues dataset and the pre-trained DialoGPT model from Hugging Face. It enables personalized chatbot development with human-like conversational ability, suitable for entertainment, storytelling, or virtual assistants.
Key Features
- Uses real-world movie dialogue to mimic natural human conversation.
- Fine-tunes Microsoft’s DialoGPT (
DialoGPT-small) for chat applications. - Prepares the data in a question-answer format for easy training.
- Fully customizable: extend with your own data, deploy to web or voice interfaces.
Dataset Used
Cornell Movie Dialog Corpus
Contains over 220,000 conversational exchanges between characters from 617 movies.
Files used:
movie_lines.txt: All movie lines with metadatamovie_conversations.txt: Sequence of conversation line IDs
Working Steps
1. Get the Dataset
Download from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html Extract movie_lines.txt and movie_conversations.txt
2. Prepare the Dataset
import os
def load_conversations(lines_file, conv_file):
id2line = {}
with open(lines_file, encoding='utf-8', errors='ignore') as f:
for line in f:
parts = line.strip().split(" +++$+++ ")
if len(parts) == 5:
id2line[parts[0]] = parts[4]
conversations = []
with open(conv_file, encoding='utf-8', errors='ignore') as f:
for line in f:
parts = line.strip().split(" +++$+++ ")
if len(parts) == 4:
utterance_ids = eval(parts[3])
for i in range(len(utterance_ids) - 1):
conversations.append((id2line[utterance_ids[i]], id2line[utterance_ids[i + 1]]))
return conversations
conversations = load_conversations("movie_lines.txt", "movie_conversations.txt")
print("Example pair:\\n", conversations[0])
3. Install Required Libraries
pip install datasets transformers
4. Tokenize and Prepare Dataset
from datasets import Dataset
pairs = [{"input": q, "output": a} for q, a in conversations[:10000]]
dataset = Dataset.from_list(pairs)
def tokenize(example):
input_text = example["input"] + tokenizer.eos_token
output_text = example["output"] + tokenizer.eos_token
full_text = input_text + output_text
tokens = tokenizer(full_text, truncation=True, padding="max_length", max_length=128)
tokens["labels"] = tokens["input_ids"].copy()
return tokens
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
tokenized_dataset = dataset.map(tokenize)
5. Fine-Tune the Model
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")
training_args = TrainingArguments(
output_dir="./genai-moviebot",
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=1000,
save_total_limit=2,
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
trainer.train()
6. Save Your Fine-Tuned Model
model.save_pretrained("./moviebot")
tokenizer.save_pretrained("./moviebot")
7. Chat with Your MovieBot
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./moviebot")
model = AutoModelForCausalLM.from_pretrained("./moviebot")
def chat_with_moviebot():
print("MovieBot: Let's chat! (type 'quit' to exit)")
history = None
while True:
user_input = input("You: ")
if user_input.lower() == "quit":
break
new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')
input_ids = torch.cat([history, new_input_ids], dim=-1) if history is not None else new_input_ids
output = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
print("MovieBot:", response)
history = output
chat_with_moviebot()
Technologies Used
- Python 3
- Hugging Face Transformers
- PyTorch
- Datasets Library
- Cornell Movie Dialogs Dataset
Applications
- AI-based Virtual Assistants
- Role-playing or Movie Character Bots
- Entertainment or Chat Simulation Tools
- Fine-tuned Voice Agents or Story Narrators
Future Enhancements
- Train with multi-turn conversations
- Add voice interaction with gTTS or pyttsx3
- Build a GUI using Gradio or Streamlit
- Deploy the model using Hugging Face Spaces, Flask, or FastAPI
Comments
Post a Comment