Train Your Own Generative AI Chatbot Using Movie Dialogues
Overview
This project demonstrates how to train a custom Generative AI chatbot using the Cornell Movie Dialogues dataset and the pre-trained DialoGPT model from Hugging Face. It enables personalized chatbot development with human-like conversational ability, suitable for entertainment, storytelling, or virtual assistants.
Key Features
- Uses real-world movie dialogue to mimic natural human conversation.
- Fine-tunes Microsoft’s DialoGPT (
DialoGPT-small
) for chat applications. - Prepares the data in a question-answer format for easy training.
- Fully customizable: extend with your own data, deploy to web or voice interfaces.
Dataset Used
Cornell Movie Dialog Corpus
Contains over 220,000 conversational exchanges between characters from 617 movies.
Files used:
movie_lines.txt
: All movie lines with metadatamovie_conversations.txt
: Sequence of conversation line IDs
Working Steps
1. Get the Dataset
Download from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html Extract movie_lines.txt and movie_conversations.txt
2. Prepare the Dataset
import os def load_conversations(lines_file, conv_file): id2line = {} with open(lines_file, encoding='utf-8', errors='ignore') as f: for line in f: parts = line.strip().split(" +++$+++ ") if len(parts) == 5: id2line[parts[0]] = parts[4] conversations = [] with open(conv_file, encoding='utf-8', errors='ignore') as f: for line in f: parts = line.strip().split(" +++$+++ ") if len(parts) == 4: utterance_ids = eval(parts[3]) for i in range(len(utterance_ids) - 1): conversations.append((id2line[utterance_ids[i]], id2line[utterance_ids[i + 1]])) return conversations conversations = load_conversations("movie_lines.txt", "movie_conversations.txt") print("Example pair:\\n", conversations[0])
3. Install Required Libraries
pip install datasets transformers
4. Tokenize and Prepare Dataset
from datasets import Dataset pairs = [{"input": q, "output": a} for q, a in conversations[:10000]] dataset = Dataset.from_list(pairs) def tokenize(example): input_text = example["input"] + tokenizer.eos_token output_text = example["output"] + tokenizer.eos_token full_text = input_text + output_text tokens = tokenizer(full_text, truncation=True, padding="max_length", max_length=128) tokens["labels"] = tokens["input_ids"].copy() return tokens from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small") tokenized_dataset = dataset.map(tokenize)
5. Fine-Tune the Model
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small") training_args = TrainingArguments( output_dir="./genai-moviebot", num_train_epochs=3, per_device_train_batch_size=4, save_steps=1000, save_total_limit=2, logging_dir="./logs", ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset, ) trainer.train()
6. Save Your Fine-Tuned Model
model.save_pretrained("./moviebot") tokenizer.save_pretrained("./moviebot")
7. Chat with Your MovieBot
from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("./moviebot") model = AutoModelForCausalLM.from_pretrained("./moviebot") def chat_with_moviebot(): print("MovieBot: Let's chat! (type 'quit' to exit)") history = None while True: user_input = input("You: ") if user_input.lower() == "quit": break new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt') input_ids = torch.cat([history, new_input_ids], dim=-1) if history is not None else new_input_ids output = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id) response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True) print("MovieBot:", response) history = output chat_with_moviebot()
Technologies Used
- Python 3
- Hugging Face Transformers
- PyTorch
- Datasets Library
- Cornell Movie Dialogs Dataset
Applications
- AI-based Virtual Assistants
- Role-playing or Movie Character Bots
- Entertainment or Chat Simulation Tools
- Fine-tuned Voice Agents or Story Narrators
Future Enhancements
- Train with multi-turn conversations
- Add voice interaction with gTTS or pyttsx3
- Build a GUI using Gradio or Streamlit
- Deploy the model using Hugging Face Spaces, Flask, or FastAPI
No comments:
Post a Comment