With the rise of original content from streaming giants, our company seeks to launch a new movie studio. However, entering the film industry without prior experience introduces financial risks. This project analyzes box office data to uncover what types of films perform best and provides actionable insights to guide content, budgeting, and release strategies.
The objective is to help studio leadership make informed decisions on film selection and production through data. The analysis identifies factors associated with high-grossing movies, such as genre, release timing, ratings, and budget efficiency.
The analysis combines five public datasets and an IMDb SQLite database to provide a rich foundation for exploration.
- The Numbers (
tn.movie_budgets.csv): Production budgets and gross earnings. - Box Office Mojo (
bom.movie_gross.csv): Studio-level revenue by year. - Rotten Tomatoes (
rt.movie_info.tsv&rt.reviews.tsv): Film metadata and critic reviews. - TMDB (
tmdb.movies.csv): Popularity scores, ratings, genres. - IMDb (
im.db): Movie metadata, ratings, and relationships.
Each dataset underwent cleaning and preprocessing including:
- Currency symbol removal and numeric conversion.
- Null value imputation or dropping based on column importance.
- Genre normalization and date standardization.
- Creation of engineered fields:
profit,ROI. - Merging datasets on title, release year, or ID.
Data profiling, missing value checks, duplicate handling, and outlier detection ensured the reliability of insights. Datasets with missing critical values were cleaned, and outliers filtered using IQR.
- Genres: Thriller and mixed-genre films show the highest average profits.
- Ratings: Higher IMDb ratings correlate with increased worldwide gross.
- Budgets: ROI is not guaranteed by large budgets; an optimal mid-range budget is often more effective.
- Seasonality: Summer and winter holidays are prime periods for box office performance.
- Statistical Tests:
- ANOVA confirmed genre and rating influence on revenue and vote count.
- Pearson correlation showed a weak negative relation between runtime and average rating.
- Prioritize High-Yield Genres: Thriller, Comedy, and genre blends.
- Invest in Quality: Critical acclaim is linked to higher financial success.
- Time Releases Strategically: Target summer and winter holidays for maximum exposure.
Talanta-hela
┣ data
┃ ┣ tn.movie_budgets.csv
┃ ┣ bom.movie_gross.csv
┃ ┣ rt.movie_info.tsv
┃ ┣ rt.reviews.tsv
┃ ┣ tmdb.movies.csv
┃ ┗ im.db
┣ visuals
┃ ┣ freshness_by_genre.png
┃ ┗ genre_distribution.png
┣ notebooks
┃ ┣ StatisticalTesting_on_hypotheses.ipynb
┃ ┗ data_validation.ipynb
┣ outputs
┃ ┗ final_report.docx
┗ README.md- Python (Pandas, NumPy)
- Matplotlib, Seaborn, Plotly (Visualizations)
- SQLite (IMDb)
- SciPy (ANOVA, correlation tests)
- Jupyter Notebooks
This project is provided for educational and exploratory purposes.