A comprehensive Business Intelligence solution to extract, transform, and analyze public hospitalization data (SIHSUS) from the Brazilian Health System (DATASUS).
Scope: State of Bahia (BA) | Period: 2019 to 2024.
This project follows a complete End-to-End Data Engineering pipeline:
- Extraction: Collecting
.dbcfiles from DATASUS public FTP. - Transformation (Bronze Layer): Converting binary
.dbcto.dbf(using Tabwin) and then to.csv(using Python/Pandas). - Loading (Silver Layer): Ingesting raw CSV data into PostgreSQL (
publicschema). - Modeling (Gold Layer): Transforming data into a Star/Snowflake Schema Data Warehouse (
dwschema). - Visualization: Interactive Dashboards using Microsoft Power BI.
This project uses Python. First, create a virtual environment and install dependencies.
python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txtpython3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
The project follows a linear pipeline to process public health data.
We acquire the raw data files from the official DATASUS FTP server.
- Source: DATASUS File Transfer Portal
- System: SIHSUS (Hospital Information System).
- File Type:
RD(Reduced AIH) - Provides hospitalization details.
Data Scope:
- Location: State of Bahia (BA).
- Period: Jan/2019 to Dec/2024.
- Format: Compressed binary files (
.dbc).
DATASUS files use a proprietary compression format (.dbc) that is not natively readable by standard data tools.
- Use the TabWin software (official tool from DATASUS).
- Navigate to
Arquivo>Comprime/Expande .DBF. - Select the downloaded
.dbcfiles. - Save the resulting
.dbffiles into theinput_dbf/project directory.
We use Python to convert the legacy .dbf format into standard .csv files, handling character encoding (Latin-1) and delimiters automatically.
Ensure your .dbf files are in input_dbf/ and run:
python src/convert_dbf_to_csv.pyOutput: Clean .csv files will be generated in the output_csv/ folder.
Finally, we ingest the CSV data into the PostgreSQL database. This script reads the files, cleans header names, and inserts them into the public schema (Staging Area).
python src/import_data.pyIn this stage, raw data from the Staging Area (public schema) is transformed into a structured Dimensional Model inside the dw schema. This structure is optimized for high-performance analytical queries (OLAP).
We implemented a Hybrid Star/Snowflake Schema. While most dimensions link directly to the Fact table, the Establishment dimension is normalized to link with Location, ensuring data integrity and reducing redundancy.
-
Fact Table (
Fato_Internacoes): Contains over 5 million records. It stores quantitative metrics (Cost, ICU Value, Days of Stay) and foreign keys to all dimensions.- Granularity: One row per hospitalization authorization (AIH).
-
Dimensions:
Dim_Pessoa: Stores patient demographics. We applied strict data cleaning rules here to standardize Age Units and Race/Color codes (mapping legacy codes like'1'to standard'01').Dim_Estabelecimento: Linked toDim_Localizacao(Snowflake pattern) to organize hospitals by municipality.Dim_CID&Dim_Procedimento: Includes logic to handle "orphan" codes (codes present in historical records but missing from official tables).
To reproduce the environment, execute the SQL scripts located in the sql/ folder in the following order:
-
Schema Definition: Creates the tables and relationships (Foreign Keys).
psql -d dados_brutos_sus -f sql/01_create_schema.sql
-
Dimension Loading: Extracts unique values from raw data, cleans inconsistencies, and populates dimension tables.
psql -d dados_brutos_sus -f sql/02_populate_dimensions.sql
-
Fact Table Loading: Performs lookups to generate surrogate keys and loads transactional data (filtered for 2019-2024).
psql -d dados_brutos_sus -f sql/03_populate_fact.sql
The final deliverable is an interactive Power BI Dashboard containing 9 analytical screens, designed to support strategic decision-making in public health management.
💡 Explore More: > * 📊 Access the Interactive Dashboard Online (No login required)
- 🖼️ View all screenshots in the Screenshots Folder
- 📥 Download the source Power BI File (.pbix)
- Executive Overview: High-level KPIs (Admissions, Costs, Mortality) and geographical distribution map.
- Time Analysis: Historical evolution and seasonality comparison (Year-over-Year).
- Clinical Profile: Top Diagnoses (ICD-10) and Procedures, highlighting diseases with long hospital stays.
- Patient Demographics: Population analysis via Age Pyramid, Sex, and Race/Color distribution.
- Mortality Analysis: Lethality rates, deadliest diseases, and risk correlation by age group.
- Age Cycle Analysis: Heatmap visualization showing the epidemiological transition of diseases across different life stages.
- COVID-19 Monitor: Dedicated panel filtered for Coronavirus codes (
B342) to analyze pandemic impact. - Financial Audit (Part 1): Cost breakdown using Decomposition Tree (Drill-down: Municipality → Disease → Procedure).
- Financial Audit (Part 2): Detailed composition of expenses (Hospital Services vs. Professional Fees vs. ICU costs).
- Language: Python 3.10+
- Libraries: Pandas, SQLAlchemy, Psycopg2, Dbf.
- Database: PostgreSQL 16.
- Visualization: Microsoft Power BI.
- IDE: VS Code & pgAdmin 4.
-
Filipe Neves - Data Engineer & BI Developer - LinkedIn Profile
- Responsible for the end-to-end development lifecycle: Data Extraction, Python ETL scripts, Data Warehouse Modeling (SQL), and Power BI Dashboard implementation.
-
Elias Reis - Project Collaborator & Repository Maintainer - LinkedIn Profile
- Responsible for repository management, documentation review, and academic support.
Project developed for the Special Topics in Databases course in the Information Systems degree program at Universidade do Estado da Bahia (UNEB).

