Search This Blog

Sunday, 13 April 2025

Train Your Own Generative AI Chatbot Using Movie Dialogues

 



Train Your Own Generative AI Chatbot Using Movie Dialogues

Overview

This project demonstrates how to train a custom Generative AI chatbot using the Cornell Movie Dialogues dataset and the pre-trained DialoGPT model from Hugging Face. It enables personalized chatbot development with human-like conversational ability, suitable for entertainment, storytelling, or virtual assistants.

Key Features

  • Uses real-world movie dialogue to mimic natural human conversation.
  • Fine-tunes Microsoft’s DialoGPT (DialoGPT-small) for chat applications.
  • Prepares the data in a question-answer format for easy training.
  • Fully customizable: extend with your own data, deploy to web or voice interfaces.

Dataset Used

Cornell Movie Dialog Corpus
Contains over 220,000 conversational exchanges between characters from 617 movies.
Files used:

  • movie_lines.txt: All movie lines with metadata
  • movie_conversations.txt: Sequence of conversation line IDs

Working Steps

1. Get the Dataset

Download from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Extract movie_lines.txt and movie_conversations.txt

2. Prepare the Dataset

import os

def load_conversations(lines_file, conv_file):
    id2line = {}
    with open(lines_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 5:
                id2line[parts[0]] = parts[4]

    conversations = []
    with open(conv_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 4:
                utterance_ids = eval(parts[3])
                for i in range(len(utterance_ids) - 1):
                    conversations.append((id2line[utterance_ids[i]], id2line[utterance_ids[i + 1]]))
    return conversations

conversations = load_conversations("movie_lines.txt", "movie_conversations.txt")
print("Example pair:\\n", conversations[0])

3. Install Required Libraries

pip install datasets transformers

4. Tokenize and Prepare Dataset

from datasets import Dataset

pairs = [{"input": q, "output": a} for q, a in conversations[:10000]]

dataset = Dataset.from_list(pairs)

def tokenize(example):
    input_text = example["input"] + tokenizer.eos_token
    output_text = example["output"] + tokenizer.eos_token
    full_text = input_text + output_text
    tokens = tokenizer(full_text, truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
tokenized_dataset = dataset.map(tokenize)

5. Fine-Tune the Model

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

training_args = TrainingArguments(
    output_dir="./genai-moviebot",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

6. Save Your Fine-Tuned Model

model.save_pretrained("./moviebot")
tokenizer.save_pretrained("./moviebot")

7. Chat with Your MovieBot

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./moviebot")
model = AutoModelForCausalLM.from_pretrained("./moviebot")

def chat_with_moviebot():
    print("MovieBot: Let's chat! (type 'quit' to exit)")
    history = None
    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            break
        new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')
        input_ids = torch.cat([history, new_input_ids], dim=-1) if history is not None else new_input_ids
        output = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
        response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
        print("MovieBot:", response)
        history = output

chat_with_moviebot()

Technologies Used

  • Python 3
  • Hugging Face Transformers
  • PyTorch
  • Datasets Library
  • Cornell Movie Dialogs Dataset

Applications

  • AI-based Virtual Assistants
  • Role-playing or Movie Character Bots
  • Entertainment or Chat Simulation Tools
  • Fine-tuned Voice Agents or Story Narrators

Future Enhancements

  • Train with multi-turn conversations
  • Add voice interaction with gTTS or pyttsx3
  • Build a GUI using Gradio or Streamlit
  • Deploy the model using Hugging Face Spaces, Flask, or FastAPI

No comments:

Post a Comment

MongoDB - Tips, Tricks, and Deep Dive

Mastering MongoDB Geospatial Queries: A Practical Guide Location-based services are ubiquitous in today's applications, from ride-shar...