Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)

Hey, it is day 20 day of the 30 days 30 ML projects challenge. Here is the full code for your LDA topic modeling project using the New York Times Comments Dataset. It includes all steps: data preprocessing, LDA model building, and visualization.

If you want to see the code, you can find it here: GIT REPO.

Code Flow

1. Importing Libraries

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora
import gensim
import pyLDAvis.gensim
  • Pandas: We use this to load and manipulate the dataset.
  • nltk: The Natural Language Toolkit (NLTK) helps with text preprocessing (tokenization, stopwords, lemmatization).
  • re: This is the regular expressions library in Python, which we use to remove unwanted characters (e.g., punctuation).
  • gensim: A library for topic modeling that implements LDA, as well as word embeddings, text similarity, and more.
  • pyLDAvis: A visualization tool specifically for LDA topic models, which provides an interactive display of the topics.

2. Loading the Dataset

df = pd.read_csv('your_dataset.csv')  # Replace with the path to your dataset

# Select the 'snippet' column
text_data = df['snippet'].dropna().tolist()
  • We load the New York Times Comments Dataset into a pandas DataFrame. This dataset has multiple columns, but we’re interested in the text-related columns.
  • We use the snippet column, which contains short text snippets of the articles. This will be our source of text for topic modeling.

3. Preprocessing Setup

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
  • Download NLTK Resources: We download NLTK’s stopwords, tokenization (punkt), and lemmatization resources (wordnet) so that we can preprocess the text.
  • Stopwords: Words like “and,” “is,” “in,” etc., which don’t contribute much to the meaning of the text, are removed to improve the quality of the topic modeling.
  • Lemmatizer: This reduces words to their base form. For example, “running” becomes “run.” Lemmatization helps group words with similar meanings.

4. Text Preprocessing

def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'\W+', ' ', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return tokens
  • Convert to Lowercase: We convert everything to lowercase to avoid “Apple” and “apple” being treated as separate words.
  • Remove Punctuation and Numbers: Using regular expressions, we remove unwanted characters like punctuation (. or ,) and numbers.
  • Tokenization: Tokenizing breaks the text into individual words.
  • Stopword Removal and Lemmatization: We remove stopwords and apply lemmatization to reduce words to their base form.
cleaned_data = [preprocess(text) for text in text_data]
  • We apply the preprocess function to each snippet of text and store the cleaned, tokenized data.

5. Creating Dictionary and Corpus

dictionary = corpora.Dictionary(cleaned_data)
corpus = [dictionary.doc2bow(text) for text in cleaned_data]
  • Dictionary: The dictionary maps each unique word in the dataset to a unique integer ID. This is necessary for Gensim’s LDA model to operate.
  • Corpus: The corpus is a Bag-of-Words (BoW) representation of the text. It converts each document (snippet) into a list of tuples where each tuple represents the word’s ID and its count in the document.

Example

If the cleaned text looks like this: ['apple', 'banana', 'apple'], the dictionary might map:

  • “apple” → 0
  • “banana” → 1 The corpus for this snippet would be [(0, 2), (1, 1)], meaning “apple” appears twice and “banana” appears once.

6. Building the LDA Model

lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=10)
  • LDA Model: We create the LDA model using Gensim.
    • corpus: The corpus (Bag-of-Words) representation of the cleaned data.
    • num_topics=5: We ask LDA to find 5 topics. You can adjust this to any number of topics you expect.
    • id2word=dictionary: This parameter maps the word IDs in the corpus back to actual words.
    • passes=10: This specifies how many times the algorithm should pass over the entire corpus. More passes can lead to better topic distribution but will take longer.

7. Displaying Topics

topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)
  • Print Topics: This displays the top 10 words in each of the 5 topics. LDA uses a probabilistic approach to assign words to topics, so each topic is represented by the words most likely to appear in that topic.

Output Example:

0: 0.025*"apple" + 0.018*"banana" + 0.015*"market" + ...
1: 0.021*"company" + 0.019*"technology" + 0.017*"innovation" + ...
  • Interpretation: The output indicates that words like “apple,” “banana,” and “market” are prominent in Topic 0, while words like “company,” “technology,” and “innovation” are more frequent in Topic 1.

8. Visualizing Topics using pyLDAvis

pyLDAvis.enable_notebook()
lda_vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_vis)
  • pyLDAvis: This tool provides an interactive visualization for topic models. It shows how the topics are distributed across documents and which words are strongly associated with each topic.
  • Interactivity: You can explore each topic by clicking on it and seeing the most frequent words in that topic.

Final Thoughts:

  • Adjust Number of Topics: Based on the coherence of the topics, you might want to adjust num_topics to a different number (e.g., 3, 7, 10).
  • Preprocessing: If the results are not meaningful, consider improving the preprocessing step (e.g., adding more custom stopwords).
  • Further Analysis: You can explore which articles are most strongly associated with each topic and gain more insights from the model.

Gratitude

It was my first problem on NLTK. I did not perform good however i plan to solve good 10-15 problems on it in future to get better understanding of this topic.

Stay Tuned for day 21!

Posts in this series