Train Your Own Generative AI Chatbot Using Movie Dialogues

 



Train Your Own Generative AI Chatbot Using Movie Dialogues

Overview

This project demonstrates how to train a custom Generative AI chatbot using the Cornell Movie Dialogues dataset and the pre-trained DialoGPT model from Hugging Face. It enables personalized chatbot development with human-like conversational ability, suitable for entertainment, storytelling, or virtual assistants.

Key Features

  • Uses real-world movie dialogue to mimic natural human conversation.
  • Fine-tunes Microsoft’s DialoGPT (DialoGPT-small) for chat applications.
  • Prepares the data in a question-answer format for easy training.
  • Fully customizable: extend with your own data, deploy to web or voice interfaces.

Dataset Used

Cornell Movie Dialog Corpus
Contains over 220,000 conversational exchanges between characters from 617 movies.
Files used:

  • movie_lines.txt: All movie lines with metadata
  • movie_conversations.txt: Sequence of conversation line IDs

Working Steps

1. Get the Dataset

Download from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Extract movie_lines.txt and movie_conversations.txt

2. Prepare the Dataset

import os

def load_conversations(lines_file, conv_file):
    id2line = {}
    with open(lines_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 5:
                id2line[parts[0]] = parts[4]

    conversations = []
    with open(conv_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 4:
                utterance_ids = eval(parts[3])
                for i in range(len(utterance_ids) - 1):
                    conversations.append((id2line[utterance_ids[i]], id2line[utterance_ids[i + 1]]))
    return conversations

conversations = load_conversations("movie_lines.txt", "movie_conversations.txt")
print("Example pair:\\n", conversations[0])

3. Install Required Libraries

pip install datasets transformers

4. Tokenize and Prepare Dataset

from datasets import Dataset

pairs = [{"input": q, "output": a} for q, a in conversations[:10000]]

dataset = Dataset.from_list(pairs)

def tokenize(example):
    input_text = example["input"] + tokenizer.eos_token
    output_text = example["output"] + tokenizer.eos_token
    full_text = input_text + output_text
    tokens = tokenizer(full_text, truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
tokenized_dataset = dataset.map(tokenize)

5. Fine-Tune the Model

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

training_args = TrainingArguments(
    output_dir="./genai-moviebot",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

6. Save Your Fine-Tuned Model

model.save_pretrained("./moviebot")
tokenizer.save_pretrained("./moviebot")

7. Chat with Your MovieBot

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./moviebot")
model = AutoModelForCausalLM.from_pretrained("./moviebot")

def chat_with_moviebot():
    print("MovieBot: Let's chat! (type 'quit' to exit)")
    history = None
    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            break
        new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')
        input_ids = torch.cat([history, new_input_ids], dim=-1) if history is not None else new_input_ids
        output = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
        response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
        print("MovieBot:", response)
        history = output

chat_with_moviebot()

Technologies Used

  • Python 3
  • Hugging Face Transformers
  • PyTorch
  • Datasets Library
  • Cornell Movie Dialogs Dataset

Applications

  • AI-based Virtual Assistants
  • Role-playing or Movie Character Bots
  • Entertainment or Chat Simulation Tools
  • Fine-tuned Voice Agents or Story Narrators

Future Enhancements

  • Train with multi-turn conversations
  • Add voice interaction with gTTS or pyttsx3
  • Build a GUI using Gradio or Streamlit
  • Deploy the model using Hugging Face Spaces, Flask, or FastAPI

Comments