🏥 DATASUS Hospitalization Analysis (BI Project)

A comprehensive Business Intelligence solution to extract, transform, and analyze public hospitalization data (SIHSUS) from the Brazilian Health System (DATASUS).

Scope: State of Bahia (BA) | Period: 2019 to 2024.

🏗️ Project Architecture

This project follows a complete End-to-End Data Engineering pipeline:

Extraction: Collecting .dbc files from DATASUS public FTP.
Transformation (Bronze Layer): Converting binary .dbc to .dbf (using Tabwin) and then to .csv (using Python/Pandas).
Loading (Silver Layer): Ingesting raw CSV data into PostgreSQL (public schema).
Modeling (Gold Layer): Transforming data into a Star/Snowflake Schema Data Warehouse (dw schema).
Visualization: Interactive Dashboards using Microsoft Power BI.

🚀 1. Setup & Installation

This project uses Python. First, create a virtual environment and install dependencies.

On Windows

python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txt

On macOS and Linux

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

⚙️ 2. ETL Process Pipeline

The project follows a linear pipeline to process public health data.

Step 1: Extraction (Download)

We acquire the raw data files from the official DATASUS FTP server.

Source: DATASUS File Transfer Portal
System: SIHSUS (Hospital Information System).
File Type: RD (Reduced AIH) - Provides hospitalization details.

Data Scope:

Location: State of Bahia (BA).
Period: Jan/2019 to Dec/2024.
Format: Compressed binary files (.dbc).

Step 2: Decompression (DBC to DBF)

DATASUS files use a proprietary compression format (.dbc) that is not natively readable by standard data tools.

Use the TabWin software (official tool from DATASUS).
Navigate to Arquivo > Comprime/Expande .DBF.
Select the downloaded .dbc files.
Save the resulting .dbf files into the input_dbf/ project directory.

Step 3: Conversion (DBF to CSV)

We use Python to convert the legacy .dbf format into standard .csv files, handling character encoding (Latin-1) and delimiters automatically.

Ensure your .dbf files are in input_dbf/ and run:

python src/convert_dbf_to_csv.py

Output: Clean .csv files will be generated in the output_csv/ folder.

Step 4: Loading to Database

Finally, we ingest the CSV data into the PostgreSQL database. This script reads the files, cleans header names, and inserts them into the public schema (Staging Area).

python src/import_data.py

🗄️ 3. Data Warehouse Modeling

In this stage, raw data from the Staging Area (public schema) is transformed into a structured Dimensional Model inside the dw schema. This structure is optimized for high-performance analytical queries (OLAP).

Architecture Strategy

We implemented a Hybrid Star/Snowflake Schema. While most dimensions link directly to the Fact table, the Establishment dimension is normalized to link with Location, ensuring data integrity and reducing redundancy.

Key Components

Fact Table (Fato_Internacoes): Contains over 5 million records. It stores quantitative metrics (Cost, ICU Value, Days of Stay) and foreign keys to all dimensions.
- Granularity: One row per hospitalization authorization (AIH).
Dimensions:
- Dim_Pessoa: Stores patient demographics. We applied strict data cleaning rules here to standardize Age Units and Race/Color codes (mapping legacy codes like '1' to standard '01').
- Dim_Estabelecimento: Linked to Dim_Localizacao (Snowflake pattern) to organize hospitals by municipality.
- Dim_CID & Dim_Procedimento: Includes logic to handle "orphan" codes (codes present in historical records but missing from official tables).

🛠️ Building the Data Warehouse

To reproduce the environment, execute the SQL scripts located in the sql/ folder in the following order:

Schema Definition: Creates the tables and relationships (Foreign Keys).
```
psql -d dados_brutos_sus -f sql/01_create_schema.sql
```
Dimension Loading: Extracts unique values from raw data, cleans inconsistencies, and populates dimension tables.
```
psql -d dados_brutos_sus -f sql/02_populate_dimensions.sql
```
Fact Table Loading: Performs lookups to generate surrogate keys and loads transactional data (filtered for 2019-2024).
```
psql -d dados_brutos_sus -f sql/03_populate_fact.sql
```

📊 4. Business Intelligence (Power BI)

The final deliverable is an interactive Power BI Dashboard containing 9 analytical screens, designed to support strategic decision-making in public health management.

💡 Explore More: > * 📊 Access the Interactive Dashboard Online (No login required)

🖼️ View all screenshots in the Screenshots Folder

📥 Download the source Power BI File (.pbix)

Dashboard Structure:

Executive Overview: High-level KPIs (Admissions, Costs, Mortality) and geographical distribution map.
Time Analysis: Historical evolution and seasonality comparison (Year-over-Year).
Clinical Profile: Top Diagnoses (ICD-10) and Procedures, highlighting diseases with long hospital stays.
Patient Demographics: Population analysis via Age Pyramid, Sex, and Race/Color distribution.
Mortality Analysis: Lethality rates, deadliest diseases, and risk correlation by age group.
Age Cycle Analysis: Heatmap visualization showing the epidemiological transition of diseases across different life stages.
COVID-19 Monitor: Dedicated panel filtered for Coronavirus codes (B342) to analyze pandemic impact.
Financial Audit (Part 1): Cost breakdown using Decomposition Tree (Drill-down: Municipality → Disease → Procedure).
Financial Audit (Part 2): Detailed composition of expenses (Hospital Services vs. Professional Fees vs. ICU costs).

🛠️ Technologies

Language: Python 3.10+
Libraries: Pandas, SQLAlchemy, Psycopg2, Dbf.
Database: PostgreSQL 16.
Visualization: Microsoft Power BI.
IDE: VS Code & pgAdmin 4.

✒️ Authors

Filipe Neves - Data Engineer & BI Developer - LinkedIn Profile
- Responsible for the end-to-end development lifecycle: Data Extraction, Python ETL scripts, Data Warehouse Modeling (SQL), and Power BI Dashboard implementation.
Elias Reis - Project Collaborator & Repository Maintainer - LinkedIn Profile
- Responsible for repository management, documentation review, and academic support.

Project developed for the Special Topics in Databases course in the Information Systems degree program at Universidade do Estado da Bahia (UNEB).

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
bi		bi
docs		docs
sql		sql
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏥 DATASUS Hospitalization Analysis (BI Project)

🏗️ Project Architecture

🚀 1. Setup & Installation

On Windows

On macOS and Linux

⚙️ 2. ETL Process Pipeline

Step 1: Extraction (Download)

Step 2: Decompression (DBC to DBF)

Step 3: Conversion (DBF to CSV)

Step 4: Loading to Database

🗄️ 3. Data Warehouse Modeling

Architecture Strategy

Key Components

🛠️ Building the Data Warehouse

📊 4. Business Intelligence (Power BI)

Dashboard Structure:

🛠️ Technologies

✒️ Authors

About

Uh oh!

Releases

Packages

Languages

License

FNevs/datasus-analytics

Folders and files

Latest commit

History

Repository files navigation

🏥 DATASUS Hospitalization Analysis (BI Project)

🏗️ Project Architecture

🚀 1. Setup & Installation

On Windows

On macOS and Linux

⚙️ 2. ETL Process Pipeline

Step 1: Extraction (Download)

Step 2: Decompression (DBC to DBF)

Step 3: Conversion (DBF to CSV)

Step 4: Loading to Database

🗄️ 3. Data Warehouse Modeling

Architecture Strategy

Key Components

🛠️ Building the Data Warehouse

📊 4. Business Intelligence (Power BI)

Dashboard Structure:

🛠️ Technologies

✒️ Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages