Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@ exclude_lines =
ignore_errors = True
omit =
tests/*
sqlquerygraph.py
loader.py
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ __pycache__/
# neo4j
data/databases/*
data/transactions/*
data/dbms/*
neo4j/databases/*
neo4j/transactions/*
neo4j/dbms/*
logs/*

# tests / coverage reports
Expand Down
48 changes: 40 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,20 @@ Parse your SQL queries and represent their structure as a graph.

Currently, we implement the ability of representing how each of the tables in a set of SQL query scripts depend on each other.

```cypher
MATCH p=(r:Reporting)-[:HAS_TABLE_DEPENDENCY]->()-[:HAS_TABLE_DEPENDENCY]->()
WHERE r.table_name='user_activity'
RETURN p
```
![](./guide/img/table_dependency.png)

## Requirements
To run the code in here, ensure your system meets the following requirements:
- Unix-like operating system (macOS, Linux, ...) - though it might work on Windows;
- Python 3.8 or above; and
- [Poetry](https://python-poetry.org/docs/) installed.
- [`direnv`](https://direnv.net/) installed, including shell hooks;
- [`.envrc`](https://github.com/avisionh/sqlquerygraph/blob/main/.envrc) allowed/trusted by `direnv` to use the environment variables - see [below](#allowingtrusting-envrc);
- [`.envrc`](https://github.com/avisionh/sqlquerygraph/blob/main/.envrc) allowed/trusted by `direnv` to use the environment variables - see [below](#set-up);

<!--Note there may be some Python IDE-specific requirements around loading environment variables, which are not considered here. -->

Expand All @@ -41,34 +48,59 @@ python sqlquerygraph.py -sd 'sql' -ed 'neo4j' -rd 'github_repos' 'analytics' 're

### Run neo4j graph database
We use [neo4j](https://neo4j.com/) for this project to visualise the dependencies between tables. To install neo4j locally using Docker Compose, follow the below instructions:
1. Install Docker
1. Install and open Docker
+ For Mac OSX, install Docker and Docker Compose together [here](https://docs.docker.com/docker-for-mac/install/).
+ For Linux, install Docker [here](https://docs.docker.com/engine/install/) and then follow these [instructions](https://docs.docker.com/compose/install/) to install docker-compose.
+ For Windows, install Docker and Docker Compose together [here](https://docs.docker.com/docker-for-windows/install/).
1. Create a new file, `.secrets`, in the directory where this `README.md` file sits, and store the following in there. This allows you to set the password for your local neo4j instance without exposing it.
```shell script
```
export NEO4J_AUTH=neo4j/<your_password>
export NEO4J_AUTH=neo4j
export NEO4J_AUTH=<your_password>
```
1. Within this directory that has the `docker-compose.yml` file, run the below in your shell/terminal:
1. Update your `.env` file to take in the new `.secrets` file you created by entering the below in your shell/terminal:
```shell script
direnv allow
```
1. Download the neo4j image. Within this directory that has the `docker-compose.yml` file, run the below in your shell/terminal:
```shell script
docker-compose up -d
docker-compose up
```
1. If it's the first time you have downloaded the neo4j docker image, wait awhile (maybe an hour, depends on your machine specs). If you have downloaded the neo4j docker image before (such as going through these instructions), then wait a few minutes. Then launch neo4j locally via opening your web browser and entering the following web address:
- http://localhost:7474/browser/
1. If it's the first time you have downloaded the neo4j docker image, wait awhile (maybe an hour, depends on your machine specs). If you have downloaded the neo4j docker image before (such as going through these instructions), then wait a few minutes. You will know when it's ready when you get the following message in your terminal:
```
...
neo4j | 2021-05-26 06:40:15.270+0000 INFO Bolt enabled on 0.0.0.0:7687.
neo4j | 2021-05-26 06:40:16.412+0000 INFO Remote interface available at http://localhost:7474/
neo4j | 2021-05-26 06:40:16.414+0000 INFO Started.
```
Then launch neo4j locally via opening your web browser and entering the following web address:
- http://localhost:7474/
1. The username and password will be:
```
username: neo4j
password: <your_password>
```
1. Load the data into the database through entering the following in a separate terminal:
```
docker exec -it neo4j bash
# move .csv files into neo4j's import/ directory
mv data/*csv import/
```
1. In your local terminal:
```shell script
python -m loader.py --file 'neo4j/<name_of_cypher_file'
```
1. When you have finished playing with your local neo4j instance, remember to stop it running by executing the below in your shell/terminal:
```shell script
# see name of container running, which most likely is called 'neo4j'
docker ps
# stop container running
docker stop neo4j
docker stop <name_of_container>
```

***

## Acknowledgements
This builds on the excellent [moz-sql-parser](https://github.com/mozilla/moz-sql-parser) package.

With thanks also to the [Google Cloud Public Dataset Program](https://cloud.google.com/solutions/datasets) for which the SQL queries in this repo are based off the program's [GitHub repos](https://console.cloud.google.com/marketplace/product/github/github-repos) dataset.
16 changes: 0 additions & 16 deletions data/loader.sh

This file was deleted.

12 changes: 6 additions & 6 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# https://thibaut-deveraux.medium.com/how-to-install-neo4j-with-docker-compose-36e3ba939af0
version: '3.9'
version: '3.8'

services:
neo4j:
Expand All @@ -9,15 +9,15 @@ services:
# pass .env file to container
env_file: .env
ports:
- 7474:7474
- 7687:7687
- 7474:7474 # web client
- 7687:7687 # db default port
volumes:
# cannot move files to import/ folder in neo4j because it's read-only
# https://neo4j.com/docs/operations-manual/current/configuration/file-locations/
# but can move from docker neo4j bash terminal
- ./neo4j:/data
- ./loader.py:/loader.py
environment:
- NEO4j_dbms.security.auth_enabled='true'
# listen to incoming connections
- NEO4J_dbms.connector.bolt.listen_address=0.0.0.0:7687
# Raise memory limits
- NEO4J_dbms_memory_pagecache_size=2G
- NEO4J_dbms.memory.heap.initial_size=2G
Expand Down
Binary file added guide/img/table_dependency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import os
import argparse

from py2neo import Graph

NEO4J_AUTH = (os.getenv(key="NEO4J_USERNAME"), os.getenv(key="NEO4J_PASSWORD"))

g = Graph(auth=NEO4J_AUTH, host="localhost", port=7687, scheme="bolt")


if __name__ == """__main__""":
argp = argparse.ArgumentParser()
argp.add_argument("-f", "--file", type=str, help="Path for where Cypher query is.")
args = argp.parse_args()

print(f"Reading {args.file}\n")
print("*******************************************\n")
with open(file=args.file, mode="r") as f:
queries = f.read()

print(f"Formatting {args.file} for importing into neo4j\n")
print("*******************************************\n")
queries = queries.split(sep=";")
queries = [txt for txt in queries if txt != "\n"]

print(f"Executing {args.file} in neo4j\n")
print("*******************************************\n")
for query in queries:
g.evaluate(cypher=query)
1 change: 1 addition & 0 deletions neo4j/analytics_analytics_dependency.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
table_dataset,table_name,dependency_dataset,dependency_name
1 change: 1 addition & 0 deletions neo4j/analytics_github_repos_dependency.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
table_dataset,table_name,dependency_dataset,dependency_name
1 change: 1 addition & 0 deletions neo4j/analytics_tables.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
table_dataset,table_name
50 changes: 50 additions & 0 deletions neo4j/example_import.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
// Create constraints on table_name property to ensure each label has unique table_name
CREATE CONSTRAINT table_name_ConstraintReporting ON (r:Reporting)
ASSERT r.table_name IS UNIQUE;
CREATE CONSTRAINT table_name_ConstraintAnalytics ON (a:Analytics)
ASSERT a.table_name IS UNIQUE;
CREATE CONSTRAINT table_name_ConstraintGithub_Repos ON (g:Github_Repos)
ASSERT g.table_name IS UNIQUE;

// Create table nodes to join later
USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///reporting_tables.csv" AS csvLine
CREATE (:Reporting {table_name: toString(csvLine.table_name), table_dataset: toString(csvLine.table_dataset), import_datetime: datetime()});

USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///analytics_tables.csv" AS csvLine
CREATE (:Analytics {table_name: toString(csvLine.table_name), table_dataset: toString(csvLine.table_dataset), import_datetime: datetime()});

USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///github_repos_tables.csv" AS csvLine
CREATE (:GithubRepos {table_name: toString(csvLine.table_name), table_dataset: toString(csvLine.table_dataset), import_datetime: datetime()});

// Load table dependency data
USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///reporting_analytics_dependency.csv" AS csvLine
MERGE (r:Reporting {table_name: toString(csvLine.table_name), table_dataset: toString(csvLine.table_dataset)})
MERGE (a:Analytics {table_name: toString(csvLine.dependency_name), table_dataset: toString(csvLine.dependency_dataset)})
CREATE (r)-[:HAS_TABLE_DEPENDENCY {import_datetime: datetime()}]->(a);
USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///reporting_github_repos_dependency.csv" AS csvLine
MERGE (r:Reporting {table_name: toString(csvLine.table_name), table_dataset: toString(csvLine.table_dataset)})
MERGE (g:GithubRepos {table_name: toString(csvLine.dependency_name), table_dataset: toString(csvLine.dependency_dataset)})
CREATE (r)-[:HAS_TABLE_DEPENDENCY {import_datetime: datetime()}]->(g);

USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///analytics_analytics_dependency.csv" AS csvLine
MERGE (a1:Analytics {table_name: toString(csvLine.table_name), table_dataset: toString(csvLine.table_dataset)})
MERGE (a2:Analytics {table_name: toString(csvLine.dependency_name), table_dataset: toString(csvLine.dependency_dataset)})
CREATE (a1)-[:HAS_TABLE_DEPENDENCY {import_datetime: datetime()}]->(a2);
USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM "file:///analytics_github_repos_dependency.csv" AS csvLine
MERGE (a:Analytics {table_name: toString(csvLine.table_name), table_dataset: toString(csvLine.table_dataset)})
MERGE (g:GithubRepos {table_name: toString(csvLine.dependency_name), table_dataset: toString(csvLine.dependency_dataset)})
CREATE (a)-[:HAS_TABLE_DEPENDENCY {import_datetime: datetime()}]->(g);

// Delete all nodes with relationships
MATCH (a)-[r]->()
DELETE a, r;

// Delete all nodes with no relationships
MATCH (a)
DELETE a;

// Drop constraints and correspondingly, index
call db.constraints
DROP CONSTRAINT table_name_ConstraintReporting;
DROP CONSTRAINT table_name_ConstraintAnalytics;
DROP CONSTRAINT table_name_ConstraintGithub_Repos;
1 change: 1 addition & 0 deletions neo4j/github_repos_tables.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
table_dataset,table_name
1 change: 1 addition & 0 deletions neo4j/reporting_analytics_dependency.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
table_dataset,table_name,dependency_dataset,dependency_name
1 change: 1 addition & 0 deletions neo4j/reporting_github_repos_dependency.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
table_dataset,table_name,dependency_dataset,dependency_name
1 change: 1 addition & 0 deletions neo4j/reporting_tables.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
table_dataset,table_name
Loading