Train Your Own Generative AI Chatbot Using Movie Dialogues

Overview

This project demonstrates how to train a custom Generative AI chatbot using the Cornell Movie Dialogues dataset and the pre-trained DialoGPT model from Hugging Face. It enables personalized chatbot development with human-like conversational ability, suitable for entertainment, storytelling, or virtual assistants.

Key Features

Uses real-world movie dialogue to mimic natural human conversation.
Fine-tunes Microsoft’s DialoGPT (DialoGPT-small) for chat applications.
Prepares the data in a question-answer format for easy training.
Fully customizable: extend with your own data, deploy to web or voice interfaces.

Dataset Used

Cornell Movie Dialog Corpus
Contains over 220,000 conversational exchanges between characters from 617 movies.
Files used:

movie_lines.txt: All movie lines with metadata
movie_conversations.txt: Sequence of conversation line IDs

Working Steps

1. Get the Dataset

Download from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Extract movie_lines.txt and movie_conversations.txt

2. Prepare the Dataset

import os

def load_conversations(lines_file, conv_file):
    id2line = {}
    with open(lines_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 5:
                id2line[parts[0]] = parts[4]

    conversations = []
    with open(conv_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 4:
                utterance_ids = eval(parts[3])
                for i in range(len(utterance_ids) - 1):
                    conversations.append((id2line[utterance_ids[i]], id2line[utterance_ids[i + 1]]))
    return conversations

conversations = load_conversations("movie_lines.txt", "movie_conversations.txt")
print("Example pair:\\n", conversations[0])

3. Install Required Libraries

pip install datasets transformers

4. Tokenize and Prepare Dataset

from datasets import Dataset

pairs = [{"input": q, "output": a} for q, a in conversations[:10000]]

dataset = Dataset.from_list(pairs)

def tokenize(example):
    input_text = example["input"] + tokenizer.eos_token
    output_text = example["output"] + tokenizer.eos_token
    full_text = input_text + output_text
    tokens = tokenizer(full_text, truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
tokenized_dataset = dataset.map(tokenize)

5. Fine-Tune the Model

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

training_args = TrainingArguments(
    output_dir="./genai-moviebot",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

6. Save Your Fine-Tuned Model

model.save_pretrained("./moviebot")
tokenizer.save_pretrained("./moviebot")

7. Chat with Your MovieBot

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./moviebot")
model = AutoModelForCausalLM.from_pretrained("./moviebot")

def chat_with_moviebot():
    print("MovieBot: Let's chat! (type 'quit' to exit)")
    history = None
    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            break
        new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')
        input_ids = torch.cat([history, new_input_ids], dim=-1) if history is not None else new_input_ids
        output = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
        response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
        print("MovieBot:", response)
        history = output

chat_with_moviebot()

Technologies Used

Python 3
Hugging Face Transformers
PyTorch
Datasets Library
Cornell Movie Dialogs Dataset

Applications

AI-based Virtual Assistants
Role-playing or Movie Character Bots
Entertainment or Chat Simulation Tools
Fine-tuned Voice Agents or Story Narrators

Future Enhancements

Train with multi-turn conversations
Add voice interaction with gTTS or pyttsx3
Build a GUI using Gradio or Streamlit
Deploy the model using Hugging Face Spaces, Flask, or FastAPI

Search This Blog

Train Your Own Generative AI Chatbot Using Movie Dialogues

Train Your Own Generative AI Chatbot Using Movie Dialogues

Overview

Key Features

Dataset Used

Working Steps

1. Get the Dataset

2. Prepare the Dataset

3. Install Required Libraries

4. Tokenize and Prepare Dataset

5. Fine-Tune the Model

6. Save Your Fine-Tuned Model

7. Chat with Your MovieBot

Technologies Used

Applications

Future Enhancements

Comments

Post a Comment