Skip to content

Reducing outliers in BERTopic #34

@ddenz

Description

@ddenz

Hi, this is a great package and it has improved topic modelling I've been doing with BERTopic. Thanks!

However, I am encountering a problem when updating the topics to reduce the size of the outlier topic. The vectorizer doesn't have the required structure to use this useful functionality of the BERTopic package. A code snippet is below:

from bertopic import BERTopic
from keyphrase_vectorizers import KeyphraseCountVectorizer

docs = ... # load the documents

vectorizer = KeyphraseCountVectorizer(spacy_exclude=['parser', 'ner'])

topic_model = BERTopic(vectorizer_model=vectorizer)
topics, probs = topic_model.fit_transform(docs)

new_topics = topic_model.reduce_outliers(docs, topics)
topic_model.update_topics(docs, topics=new_topics)

This gives: AttributeError: 'KeyphraseCountVectorizer' object has no attribute 'build_tokenizer'

I understand that your package does alternative tokenization, so I guess using the default scikit-learn tokenizer from their vectorizer classes would not preserve the results of your package.

It would be great if you could provide a fix or suggest a workaround for this?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions