Skip to content

Divide by zero error when trying to use KeyphraseCountVectorizer with BERTopic #26

@Pratik--Patel

Description

@Pratik--Patel

I am trying to use KeyphraseCountVectorizer using the example provided here https://github.com/TimSchopf/KeyphraseVectorizers#topic-modeling-with-bertopic-and-keyphrasevectorizers

from keyphrase_vectorizers import KeyphraseCountVectorizer
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# load text documents
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
# only use subset of the data 
docs = docs[:5000]

# train topic model with KeyphraseCountVectorizer
keyphrase_topic_model = BERTopic(vectorizer_model=KeyphraseCountVectorizer())
keyphrase_topics, keyphrase_probs = keyphrase_topic_model.fit_transform(docs)

This produces following error.

RuntimeWarning: divide by zero encountered in true_divide
  idf = np.log((avg_nr_samples / df)+1)

I have run it on various datasets of various sizes and the error is consistent.

I have posted the same question in BERTopic here MaartenGr/BERTopic#1050

Any idea about this would be very helpful.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions