The AI Data Extraction Dashboard is a web-based tool that allows users to upload a CSV file or connect to their Google Sheets account, select specific columns, and extract information from web search results using AI. The tool uses Google Sheets API to retrieve data from Google Sheets, integrates with SerpApi for web search, and leverages Groq API for extracting specific information from the results. The extracted data can be downloaded in CSV format for further analysis.
- Upload CSV files or connect to Google Sheets dynamically.
- Select a column for processing and enter query templates to fetch specific information from web search results.
- Use AI (Groq API) to process search results and extract structured data.
- Download the extracted data as a CSV file.
To get started with the project, follow the steps below.
- Python 3.7 or higher
- Install the required Python packages using the following command:
pip install -r requirements.txtThe requirements.txt should include the following libraries:
streamlitpandasserpapirequestspython-dotenvgoogle-authgoogle-auth-oauthlibgoogle-api-python-client
Ensure you have the following environment variables set up:
SERPAPI_KEY: Your SerpApi key for performing web searches.GROQ_API_KEY: Your Groq API key for extracting information from the web search results.
To set these variables, you can create a .env file in the project root and add your keys:
SERPAPI_KEY=your_serpapi_key
GROQ_API_KEY=your_groq_api_key
- Create a project in the Google Developer Console.
- Enable the Google Sheets API.
- Download the
credentials.jsonfile and place it in the project root. - When running the project for the first time, it will prompt you to authenticate using your Google account.
Once everything is set up, you can run the application using Streamlit:
streamlit run app.pyThis will launch the dashboard in your browser, where you can interact with the tool.
- CSV Upload: Click on the "Upload CSV file" button to upload your file. The dashboard will display the first few rows of the CSV for review.
- Google Sheets: Enter the Google Sheets ID and the range (e.g.,
Sheet1!A1:Z100) to fetch data directly from your Google Sheets.
- Column Selection: Select the column from your dataset that contains the entities you want to search for (e.g., names or keywords).
- Information Type: Enter the type of information you're looking for (e.g., emails, phone numbers).
- Query Template: Define the template for the query you want to generate. For example: "Get the {information} of {entity}".
The tool will perform web searches using SerpApi to retrieve organic search results, then use Groq API to extract the specified information from those results. The extraction will be done in parallel for each entity.
After processing, the extracted data will be displayed on the dashboard. You can download the results in CSV format by clicking the "Download Extracted Data" button.
Used for retrieving data from Google Sheets, allowing users to dynamically fetch data based on the Google Sheets ID and range. The project uses the Google API Client to authenticate and interact with Google Sheets.
SerpApi is used for retrieving organic search results from Google. It allows us to perform web searches programmatically, which can then be used for further processing.
The Groq API is used for extracting information from web search results. It processes the search result data using a language model to extract structured information based on the user-defined query template.
While the core features align with the outlined requirements, several additional functionalities have been incorporated into the project to enhance user experience and flexibility:
Feature: A dynamic description of the dataset is generated based on the columns of the uploaded CSV or Google Sheet. Purpose: This feature helps users understand the structure and content of the dataset before proceeding with further actions, enhancing usability.
Feature: The web search and LLM processing are handled in parallel for multiple entities to speed up the data retrieval and extraction process. Purpose: This enables quicker processing of large datasets, ensuring a smooth user experience even with many entities.
Feature: A retry mechanism is implemented for both web searches and LLM processing to handle issues like rate limits or temporary failures in API calls. Purpose: Ensures reliability and robustness of the system, reducing the chances of failures due to external service interruptions or rate-limiting issues.
Feature: Detailed error messages and status updates are provided to users throughout the process. Purpose: Users are kept informed in case of errors, which helps them understand and correct any issues with their input or the system.
Feature: Tracking the progress of the API, search results processes using a Progress Bar. Purpose: Users can be able to track the output processings progress with the help of a visual and dynamic Progress Bar in the Application.
This README provides an overview of how to set up, use, and understand the dependencies of the AI Data Extraction Dashboard project.