This repository contains a comprehensive text analysis pipeline developed in Python for extracting insights from a collection of URLs.
The extraction.py script is responsible for extracting text content from a list of URLs. It handles cases where 404 errors are encountered and generates an error log containing the URLs with 404 errors.
-
Install Python: Ensure you have Python installed on your system. If not, download and install the latest version of Python.
-
Open Terminal/Command Prompt: Launch your terminal or command prompt.
-
Navigate to File Directory: Use the
cdcommand to navigate to the directory where theextraction.pyfile is located. -
Execute Script: Run the script by entering the following command:
python extraction.py -
Complete Execution: Allow the script to complete the data extraction process. Make sure all necessary dependencies are installed prior to execution.
The analysis.py script utilizes the extracted text data to perform sentiment analysis, calculate readability metrics, and generate an output Excel file with insightful metrics for further analysis.
-
Install Required Libraries: Ensure that all necessary Python libraries and dependencies (such as pandas, textblob, nltk, etc.) are installed on your system.
-
Open Terminal/Command Prompt: Launch your terminal or command prompt.
-
Navigate to File Directory: Use the
cdcommand to navigate to the directory where theanalysis.pyfile is located. -
Execute Script: Run the script by entering the following command:
python analysis.py -
Wait for Completion: Allow the script to run and complete the analysis process. Make sure that any required data files, such as input data or text files, are available in the specified directories as indicated within the script.
Following these instructions will help you to successfully execute the extraction.py and analysis.py scripts and perform the desired data extraction and analysis tasks.