Search This Blog

Sunday, 13 April 2025

Train Your Own Generative AI Chatbot Using Movie Dialogues

 

Train Your Own Generative AI Chatbot Using Movie Dialogues

Overview

This project demonstrates how to train a custom Generative AI chatbot using the Cornell Movie Dialogues dataset and the pre-trained DialoGPT model from Hugging Face. It enables personalized chatbot development with human-like conversational ability, suitable for entertainment, storytelling, or virtual assistants.

Key Features

  • Uses real-world movie dialogue to mimic natural human conversation.
  • Fine-tunes Microsoft’s DialoGPT (DialoGPT-small) for chat applications.
  • Prepares the data in a question-answer format for easy training.
  • Fully customizable: extend with your own data, deploy to web or voice interfaces.

Dataset Used

Cornell Movie Dialog Corpus
Contains over 220,000 conversational exchanges between characters from 617 movies.
Files used:

  • movie_lines.txt: All movie lines with metadata
  • movie_conversations.txt: Sequence of conversation line IDs

Working Steps

1. Get the Dataset

Download from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Extract movie_lines.txt and movie_conversations.txt

2. Prepare the Dataset

import os

def load_conversations(lines_file, conv_file):
    id2line = {}
    with open(lines_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 5:
                id2line[parts[0]] = parts[4]

    conversations = []
    with open(conv_file, encoding='utf-8', errors='ignore') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 4:
                utterance_ids = eval(parts[3])
                for i in range(len(utterance_ids) - 1):
                    conversations.append((id2line[utterance_ids[i]], id2line[utterance_ids[i + 1]]))
    return conversations

conversations = load_conversations("movie_lines.txt", "movie_conversations.txt")
print("Example pair:\\n", conversations[0])

3. Install Required Libraries

pip install datasets transformers

4. Tokenize and Prepare Dataset

from datasets import Dataset

pairs = [{"input": q, "output": a} for q, a in conversations[:10000]]

dataset = Dataset.from_list(pairs)

def tokenize(example):
    input_text = example["input"] + tokenizer.eos_token
    output_text = example["output"] + tokenizer.eos_token
    full_text = input_text + output_text
    tokens = tokenizer(full_text, truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
tokenized_dataset = dataset.map(tokenize)

5. Fine-Tune the Model

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

training_args = TrainingArguments(
    output_dir="./genai-moviebot",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

6. Save Your Fine-Tuned Model

model.save_pretrained("./moviebot")
tokenizer.save_pretrained("./moviebot")

7. Chat with Your MovieBot

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./moviebot")
model = AutoModelForCausalLM.from_pretrained("./moviebot")

def chat_with_moviebot():
    print("MovieBot: Let's chat! (type 'quit' to exit)")
    history = None
    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            break
        new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')
        input_ids = torch.cat([history, new_input_ids], dim=-1) if history is not None else new_input_ids
        output = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
        response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
        print("MovieBot:", response)
        history = output

chat_with_moviebot()

Technologies Used

  • Python 3
  • Hugging Face Transformers
  • PyTorch
  • Datasets Library
  • Cornell Movie Dialogs Dataset

Applications

  • AI-based Virtual Assistants
  • Role-playing or Movie Character Bots
  • Entertainment or Chat Simulation Tools
  • Fine-tuned Voice Agents or Story Narrators

Future Enhancements

  • Train with multi-turn conversations
  • Add voice interaction with gTTS or pyttsx3
  • Build a GUI using Gradio or Streamlit
  • Deploy the model using Hugging Face Spaces, Flask, or FastAPI

Tuesday, 8 April 2025

Two Month Masterclass on Python Programming Language



Hello Guys, Register in this Amazing Two Month Masterclass on Python Programming Language

If you have any queries, You can email me - suryansh070104@gmail.com

After Registration you will be added to a WhatsApp group

You will learn amazing topics and Real life Projects which can be used in your Daily Life to make your Tasks Easy and These projects will help you to Achieve Great Jobs in Ai, ML, CyberSecurity, DataScience, Etc.

Real life Projects:

Face Recognition Program,
Voice to Text Translator
and Many More awesome scripts

REGISTER HERE NOW !!

Click Here to REGISTER !

What python can do!



 

Registration Form

Pay Now via UPI
Powered by Ai Dev Surya

Train Your Own Generative AI Chatbot Using Movie Dialogues

  Train Your Own Generative AI Chatbot Using Movie Dialogues Overview This project demonstrates how to train a custom Generative AI chat...