Reducing outliers in BERTopic

Hi, this is a great package and it has improved topic modelling I've been doing with BERTopic. Thanks!

However, I am encountering a problem when updating the topics to reduce the size of the outlier topic. The vectorizer doesn't have the required structure to use this useful functionality of the BERTopic package. A code snippet is below:

```
from bertopic import BERTopic
from keyphrase_vectorizers import KeyphraseCountVectorizer

docs = ... # load the documents

vectorizer = KeyphraseCountVectorizer(spacy_exclude=['parser', 'ner'])

topic_model = BERTopic(vectorizer_model=vectorizer)
topics, probs = topic_model.fit_transform(docs)

new_topics = topic_model.reduce_outliers(docs, topics)
topic_model.update_topics(docs, topics=new_topics)
```

This gives: `
AttributeError: 'KeyphraseCountVectorizer' object has no attribute 'build_tokenizer'`

I understand that your package does alternative tokenization, so I guess using the default `scikit-learn` tokenizer from their vectorizer classes would not preserve the results of your package.

It would be great if you could provide a fix or suggest a workaround for this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing outliers in BERTopic #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reducing outliers in BERTopic #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions