Skip to content

afafelwafi/TweetsPopularity

Repository files navigation

TweetsPopularity

The subject was chosen in my 3rd year in school to cover the whole landscape, from abstract concepts of statistics to very practical skills in computer sciences.

Project's idea:

We want is to detect as soon as possible tweets that are likely to become popular, where popularity of a tweet is defined as the number of times this tweet will be retweeted, i.e. forwarded by users to their followers. Because retweets propagate with a cascade effect, a tweet and all retweets it triggered, build up what is hereafter called a cascade. Once we are able to predict popularities of tweets, it is straight-forward to identify the most promising tweets with the highest expected popularity. So to summarize, the problem is to guess the final size of cascades, just by observing the beginning of these cascade during a given observation time window, like, say, the ten first minutes.

Architechture:

image

Components:

  • The tweet generator simulates the arrival in real time of tweets by reading them from files and posting them into a Kafka topic called tweets. The C++ source code of the generator is provided to you and is ready-to-use. There is only one node running the generator code. The logger is also provided and ready-to-use. Its purpose is for debugging and setting up your application. It collects and displays log messages from the processing nodes you develop (i.e. collector, estimator, predictor, learner) in order for you to better understand the state of your distributed application.
  • The tweet collector is in charge of gathering retweets and grouping them into cascades, before delivering them to the subsequent processing stages: first it transmits the partial time series of cascades to the estimator through topic cascade_series in order to estimate the final size of a cascade. Second it transmits the final size of finished cascades through the topic cascade_properties in order to compute the error of prediction and to feed a training dataset to train random forests.
  • The Hawkes estimator is in charge of estimating parameters of a Hawkes process given its partial time series. This is likely the critical node as it runs an optimization algorithm (see ModStat labwork) that can be relatively slow for large cascades. Once parameters have been estimated, their values are posted into the topic cascade_properties.
  • The predictor is in charge of predicting popularity of a tweet, given the parameters of the underlying Hawkes process estimated and sent by estimator nodes. In version V1 this prediction step could have been done directly in the estimator nodes (as it can be done in one single line of code…) but in V1', the prediction also integrates random forests that are trained offline. For this reason, the estimation and prediction tasks have been split into two distinct stages. Every predictor therefore embeds a random forest that is received from the learner and specifically trained for a given observation window. Because we want to manage multiple observation windows of different lengths and because we want every predictor to have one single random forest (this way we keep the memory footprint of predictors small and we can also easily introduce a new observation window by introducing dynamically a new predictor), every predictor node is specialized for one given observation length, typically 5, 10, 20 or 30 minutes for the News dataset (choice of appropriate observation windows depends on the processed tweets). Once the popularity has been computed, the predictor posts it to the alerts topic. Later on, when the predictor learns form the collector the real size of the cascade, it uses this information to submit into topic samples a new training example to the learner. In addition, it submits a new prediction error in topic stats. Finally, predictors also refresh their random forest model as soon as a new one, corresponding to its observation length, gets available in topic models.
  • The learner stage is in charge of collecting training samples and learn random forest models for various observation lengths. The learner and predictor stages have been split as the predictor must react in real time and be responsive whereas learning random forests can work offline but might occasionally require a large amount of processing time. Learners periodically train new models from the set of samples collected so far and post these models into topic models to broadcast them to predictors. It’s up to you to define a criteria to decide when to run a new training.
  • The dashboard is a tool that dynamically displays the hot-topics posted in the alerts topic, i.e tweets with the highest predicted popularities. If you have enough time, alerts can optionally be sent to a visualization tool as Graphana through the Loki log management tool.
  • The monitor is another tool that monitors the performance of your system, i.e the time evolution of prediction errors posted in the stats topic. Again if you have enough time, the prediction errors can be sent to Graphana through Prometheus, a software to collect and query timeseries.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors