An example seeded topic model
seeded_lda.Rmd1. What is LDA?
Latent Dirichlet Allocation (LDA) is a probabilistic topic modelling algorithm used to discover hidden topics in a collection of documents.
Each document is represented as a mixture of topics.
Each topic is represented as a mixture of words.
Example: If you analyse news articles, LDA might find topics like "Politics," "Sports," or "Technology," each characterised by related words.
2. What is Seeded LDA?
Problem with Standard LDA
LDA is an unsupervised method—it doesn’t know what topics should be, so the discovered topics may not always align with what you expect.
Solution: Seeded LDA
In Seeded LDA, we provide seed words for each topic. This helps guide the topic formation process so that topics align with meaningful categories.
Example Use Case:
If analyzing scientific literature, we may seed topics with specific words:
“Climate Change” → [“warming”, “carbon”, “emissions”]
“Biodiversity” → [“species”, “habitat”, “conservation”]
3. Seeded LDA Example in R
We will use the textmineR package to perform Seeded LDA.
Step 2: Load Example Text Data
We will use a small dataset of articles.
documents <- c(
"Climate change is causing rising temperatures and increasing emissions.",
"Renewable energy sources like solar and wind help reduce carbon footprint.",
"Biodiversity is under threat due to habitat destruction and pollution.",
"Technology companies invest in artificial intelligence and data science.",
"Deep learning and neural networks are advancing AI research."
)Each document is a separate text entry, and we want to identify themes in them.
Step 3: Preprocess Text Data
LDA works best with cleaned text. We will:
- Convert text to lowercase
- Remove stopwords (e.g., “the”, “is”, “and”)
- Remove punctuation
- Convert text into a document-term matrix (DTM)
# Create a corpus
corpus <- Corpus(VectorSource(documents))
# Preprocess: Convert to lowercase, remove stopwords, remove punctuation
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
# Create a document-term matrix (DTM)
dtm <- CreateDtm(doc_vec = sapply(corpus, as.character),
doc_names = paste0("doc", 1:length(documents)),
ngram_window = c(1,2)) # Unigrams & bigramsStep 4: Define Seed Words
We guide the model using seed words for each topic:
Step 5: Fit Seeded LDA Model
We train the Seeded LDA model:
# Train Seeded LDA
lda_model <- FitLdaModel(dtm = dtm,
k = length(seed_words), # Number of topics
iterations = 500,
burnin = 100,
seed_words = seed_words)Step 6: View Topic Assignments
After training, check which topics dominate each document:
lda_model$theta # Topic proportions per document
#> t_1 t_2 t_3
#> doc1 0.5772358 0.08943089 0.3333333
#> doc2 0.1578947 0.68421053 0.1578947
#> doc3 0.3009709 0.30097087 0.3980583
#> doc4 0.3333333 0.17073171 0.4959350
#> doc5 0.2520325 0.41463415 0.3333333Here each row = a document and each column = a topic.
4. How to Interpret Results?
Each topic is a probability distribution over words. Each document is a probability distribution over topics. Words that strongly define topics can be extracted:
get_top_terms <- function(phi, num_words = 10) {
apply(phi, 1, function(topic) {
names(sort(topic, decreasing = TRUE))[1:num_words]
})
}
get_top_terms(lda_model$phi) # Top words for each topic
#> t_1 t_2
#> [1,] "advancing" "advancing"
#> [2,] "advancing_ai" "advancing_ai"
#> [3,] "ai" "ai"
#> [4,] "ai_research" "ai_research"
#> [5,] "artificial" "artificial"
#> [6,] "artificial_intelligence" "artificial_intelligence"
#> [7,] "biodiversity" "biodiversity"
#> [8,] "biodiversity_threat" "biodiversity_threat"
#> [9,] "carbon" "carbon"
#> [10,] "carbon_footprint" "carbon_footprint"
#> t_3
#> [1,] "advancing"
#> [2,] "advancing_ai"
#> [3,] "ai"
#> [4,] "ai_research"
#> [5,] "artificial"
#> [6,] "artificial_intelligence"
#> [7,] "biodiversity"
#> [8,] "biodiversity_threat"
#> [9,] "carbon"
#> [10,] "carbon_footprint"5. Standard LDA vs. Seeded LDA
Feature |Standard LDA| Seeded LDA Topic Discovery |Unsupervised (random topics)| Semi-supervised (guided topics) Accuracy| Can be noisy| More aligned with domain knowledge Flexibility| Fully data-driven| Requires predefined seed words Use Case| General topic discovery| When you have expected topics