Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
Hey, it is day 20 day of the 30 days 30 ML projects challenge. Here is the full code for your LDA topic modeling project using the New York Times Comments Dataset. It includes all steps: data preprocessing, LDA model building, and visualization.
If you want to see the code, you can find it here: GIT REPO.
Code Flow
1. Importing Libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora
import gensim
import pyLDAvis.gensim
- Pandas: We use this to load and manipulate the dataset.
- nltk: The Natural Language Toolkit (NLTK) helps with text preprocessing (tokenization, stopwords, lemmatization).
- re: This is the regular expressions library in Python, which we use to remove unwanted characters (e.g., punctuation).
- gensim: A library for topic modeling that implements LDA, as well as word embeddings, text similarity, and more.
- pyLDAvis: A visualization tool specifically for LDA topic models, which provides an interactive display of the topics.
2. Loading the Dataset
df = pd.read_csv('your_dataset.csv') # Replace with the path to your dataset
# Select the 'snippet' column
text_data = df['snippet'].dropna().tolist()
- We load the New York Times Comments Dataset into a pandas DataFrame. This dataset has multiple columns, but we’re interested in the text-related columns.
- We use the snippet column, which contains short text snippets of the articles. This will be our source of text for topic modeling.
3. Preprocessing Setup
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
- Download NLTK Resources: We download NLTK’s stopwords, tokenization (punkt), and lemmatization resources (wordnet) so that we can preprocess the text.
- Stopwords: Words like “and,” “is,” “in,” etc., which don’t contribute much to the meaning of the text, are removed to improve the quality of the topic modeling.
- Lemmatizer: This reduces words to their base form. For example, “running” becomes “run.” Lemmatization helps group words with similar meanings.
4. Text Preprocessing
def preprocess(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation and numbers
text = re.sub(r'\W+', ' ', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords and lemmatize
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
return tokens
- Convert to Lowercase: We convert everything to lowercase to avoid “Apple” and “apple” being treated as separate words.
- Remove Punctuation and Numbers: Using regular expressions, we remove unwanted characters like punctuation (. or ,) and numbers.
- Tokenization: Tokenizing breaks the text into individual words.
- Stopword Removal and Lemmatization: We remove stopwords and apply lemmatization to reduce words to their base form.
cleaned_data = [preprocess(text) for text in text_data]
- We apply the preprocess function to each snippet of text and store the cleaned, tokenized data.
5. Creating Dictionary and Corpus
dictionary = corpora.Dictionary(cleaned_data)
corpus = [dictionary.doc2bow(text) for text in cleaned_data]
- Dictionary: The dictionary maps each unique word in the dataset to a unique integer ID. This is necessary for Gensim’s LDA model to operate.
- Corpus: The corpus is a Bag-of-Words (BoW) representation of the text. It converts each document (snippet) into a list of tuples where each tuple represents the word’s ID and its count in the document.
Example
If the cleaned text looks like this: ['apple', 'banana', 'apple']
, the dictionary might map:
- “apple” → 0
- “banana” → 1
The corpus for this snippet would be
[(0, 2), (1, 1)]
, meaning “apple” appears twice and “banana” appears once.
6. Building the LDA Model
lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=10)
- LDA Model: We create the LDA model using Gensim.
- corpus: The corpus (Bag-of-Words) representation of the cleaned data.
- num_topics=5: We ask LDA to find 5 topics. You can adjust this to any number of topics you expect.
- id2word=dictionary: This parameter maps the word IDs in the corpus back to actual words.
- passes=10: This specifies how many times the algorithm should pass over the entire corpus. More passes can lead to better topic distribution but will take longer.
7. Displaying Topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
print(topic)
- Print Topics: This displays the top 10 words in each of the 5 topics. LDA uses a probabilistic approach to assign words to topics, so each topic is represented by the words most likely to appear in that topic.
Output Example:
0: 0.025*"apple" + 0.018*"banana" + 0.015*"market" + ...
1: 0.021*"company" + 0.019*"technology" + 0.017*"innovation" + ...
- Interpretation: The output indicates that words like “apple,” “banana,” and “market” are prominent in Topic 0, while words like “company,” “technology,” and “innovation” are more frequent in Topic 1.
8. Visualizing Topics using pyLDAvis
pyLDAvis.enable_notebook()
lda_vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_vis)
- pyLDAvis: This tool provides an interactive visualization for topic models. It shows how the topics are distributed across documents and which words are strongly associated with each topic.
- Interactivity: You can explore each topic by clicking on it and seeing the most frequent words in that topic.
Final Thoughts:
- Adjust Number of Topics: Based on the coherence of the topics, you might want to adjust num_topics to a different number (e.g., 3, 7, 10).
- Preprocessing: If the results are not meaningful, consider improving the preprocessing step (e.g., adding more custom stopwords).
- Further Analysis: You can explore which articles are most strongly associated with each topic and gain more insights from the model.
Gratitude
It was my first problem on NLTK. I did not perform good however i plan to solve good 10-15 problems on it in future to get better understanding of this topic.
Stay Tuned for day 21!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge