A tool to scrape LinkedIn profiles and company pages.
-
Clone this repository
-
Create a virtual environment and activate it
-
Install uv (if not already installed):
# On macOS and Linux curl -LsSf https://astral.sh/uv/install.sh | sh # On Windows powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Install the required dependencies using uv:
# Install project dependencies uv install # Or synchronize all dependencies uv sync
-
Ensure your virtual environment is activated after running
uv sync:# On macOS/Linux source venv/bin/activate # On Windows .\venv\Scripts\activate
Before using the scraper, you need to set up your LinkedIn credentials in the .env file:
- Copy the
.env.examplefile to.env - Edit the
.envfile and add your LinkedIn username and password:LINKEDIN_USERNAME=your_email@example.com LINKEDIN_PASSWORD=your_password - Register at api.together.xyz to get an API key
- Add your Together API key to the
.envfile:TOGETHER_API_KEY=your_api_key
The LinkedIn scraper can be used to scrape either a LinkedIn profile or a company page.
The scraper accepts the following command line arguments:
--profile: Specify that you want to scrape a LinkedIn profile 👤--company: Specify that you want to scrape a LinkedIn company page 🏢--name: Specify the profile username or company name to scrape (required)--llm: Enable LLM-powered extraction to improve data quality 🤖
Note: You must use either --profile or --company, but not both.
-
To scrape a LinkedIn profile:
python run.py --profile --name username
-
To scrape a LinkedIn company page:
python run.py --company --name companyname
-
To scrape a LinkedIn profile with LLM extraction:
python run.py --profile --name username --llm
-
To scrape a LinkedIn company page with LLM extraction:
python run.py --company --name companyname --llm
The scraped data will be saved to the data/ directory in JSON format:
- Profiles:
data/profile_username.json - Companies:
data/company_companyname.json
The scraper includes an advanced LLM (Large Language Model) extraction feature that significantly improves data quality:
- How it works: Rather than relying purely on traditional HTML parsing, the
--llmflag activates the LLM extractor that processes the raw HTML through the Together AI platform. - Advantages:
- Extracts structured data even from complex HTML layouts
- Better handles dynamic content and varied page structures
- Produces more consistent JSON output format
- Intelligently identifies relevant information from profile/company pages
The LLM extractor uses the DeepSeek-V3 model via the Together API to transform raw HTML into well-structured profile or company data.
To test the LLM extraction functionality:
pytest tests/test_llm.pyThis will verify that the LLM can properly extract information from HTML content.
- ⏱️ The scraper uses browser automation to navigate LinkedIn, so it may take some time to complete.
- 🔒 LinkedIn may occasionally show CAPTCHA or verification screens, which can cause the scraping to fail.
⚠️ Excessive use of this tool may lead to your LinkedIn account being temporarily restricted.
This project utilizes or is inspired by these excellent open-source projects:
- uv - An extremely fast Python package and project manager, written in Rust.
- python-blueprint - A best practices template for Python projects.