Hi, this is a great package and it has improved topic modelling I've been doing with BERTopic. Thanks!
However, I am encountering a problem when updating the topics to reduce the size of the outlier topic. The vectorizer doesn't have the required structure to use this useful functionality of the BERTopic package. A code snippet is below:
from bertopic import BERTopic
from keyphrase_vectorizers import KeyphraseCountVectorizer
docs = ... # load the documents
vectorizer = KeyphraseCountVectorizer(spacy_exclude=['parser', 'ner'])
topic_model = BERTopic(vectorizer_model=vectorizer)
topics, probs = topic_model.fit_transform(docs)
new_topics = topic_model.reduce_outliers(docs, topics)
topic_model.update_topics(docs, topics=new_topics)
This gives: AttributeError: 'KeyphraseCountVectorizer' object has no attribute 'build_tokenizer'
I understand that your package does alternative tokenization, so I guess using the default scikit-learn tokenizer from their vectorizer classes would not preserve the results of your package.
It would be great if you could provide a fix or suggest a workaround for this?
Hi, this is a great package and it has improved topic modelling I've been doing with BERTopic. Thanks!
However, I am encountering a problem when updating the topics to reduce the size of the outlier topic. The vectorizer doesn't have the required structure to use this useful functionality of the BERTopic package. A code snippet is below:
This gives:
AttributeError: 'KeyphraseCountVectorizer' object has no attribute 'build_tokenizer'I understand that your package does alternative tokenization, so I guess using the default
scikit-learntokenizer from their vectorizer classes would not preserve the results of your package.It would be great if you could provide a fix or suggest a workaround for this?