Describe the bug
When the matching condition is on columns and combined with OR, the result retuned by LEFT JOIN is similar to INNER JOIN, which is incorrect.
LEFT JOIN should not reduce the number of rows returned, i.e. it should return all the records from the left table, and the matched records from the right table. When there is no match, the result is a NULL, i.e. display as empty.
INNER JOIN reduces the number of rows, i.e. it only returns the records that have matching values in both tables.
To Reproduce
- Use DataFusion CLI:
# in dir datafusion/datafusion-cli
$ cargo build
$ ./target/debug/datafusion-cli
- Write sample data
CREATE OR REPLACE TABLE employees(emp_id INT, name VARCHAR) AS VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol');
CREATE OR REPLACE TABLE department(emp_id INT, department VARCHAR) AS VALUES (1, 'HR'), (3, 'Engineering'), (4, 'Sales');
- This is how
employees table and department table look like
> SELECT * FROM employees;
+--------+-------+
| emp_id | name |
+--------+-------+
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
+--------+-------+
3 row(s) fetched.
Elapsed 0.009 seconds.
> SELECT * FROM department;
+--------+-------------+
| emp_id | department |
+--------+-------------+
| 1 | HR |
| 3 | Engineering |
| 4 | Sales |
+--------+-------------+
3 row(s) fetched.
Elapsed 0.006 seconds.
- Query
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');
+--------+-------+-------------+
| emp_id | name | department |
+--------+-------+-------------+
| 1 | Alice | HR |
| 1 | Alice | Engineering |
| 1 | Alice | Sales |
| 2 | Bob | HR |
| 2 | Bob | Engineering |
| 2 | Bob | Sales |
+--------+-------+-------------+ <-- should have one more row for Carol
6 row(s) fetched.
Elapsed 0.013 seconds.
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
+--------+------+------------+
| emp_id | name | department |
+--------+------+------------+
+--------+------+------------+ <-- should have three rows, 1 row for Alice, 1 row for Bob, and 1 row for Carol
0 row(s) fetched.
Elapsed 0.014 seconds.
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'NotExist');
+--------+-------+-------------+
| emp_id | name | department |
+--------+-------+-------------+
| 1 | Alice | HR |
| 1 | Alice | Engineering |
| 1 | Alice | Sales |
+--------+-------+-------------+ <-- should have two more rows, 1 row for Bob and 1 row for Carol
3 row(s) fetched.
Elapsed 0.014 seconds.
Expected behavior
Postgres shows the expected results.
- In psql cli, drop tables and create new tables with data
DROP TABLE IF EXISTS employees;
DROP TABLE IF EXISTS department;
# create tables
CREATE TABLE employees(emp_id INT, name VARCHAR);
CREATE TABLE department(emp_id INT, dept_name VARCHAR);
# write sample data
INSERT INTO employees (emp_id, name) VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol');
INSERT INTO department (emp_id, dept_name) VALUES (1, 'HR'), (3, 'Engineering'), (4, 'Sales');
- This is how two tables look like:
chunchun=# SELECT * FROM employees;
emp_id | name
--------+-------
1 | Alice
2 | Bob
3 | Carol
(3 rows)
chunchun=# SELECT * FROM department;
emp_id | dept_name
--------+-------------
1 | HR
3 | Engineering
4 | Sales
(3 rows)
- Query
SELECT e.emp_id, e.name, d.department
FROM employees AS e
LEFT JOIN department AS d
ON (e.name = 'Alice' OR e.name = 'Bob');
emp_id | name | department
--------+-------+-------------
1 | Alice | HR
1 | Alice | Engineering
1 | Alice | Sales
2 | Bob | HR
2 | Bob | Engineering
2 | Bob | Sales
3 | Carol |
(7 rows)
SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
emp_id | name | department
--------+-------+------------
1 | Alice |
2 | Bob |
3 | Carol |
(3 rows)
SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'NotExist');
emp_id | name | department
--------+-------+-------------
1 | Alice | HR
1 | Alice | Engineering
1 | Alice | Sales
2 | Bob |
3 | Carol |
(5 rows)
Additional context
No response
Describe the bug
When the matching condition is on columns and combined with
OR, the result retuned byLEFT JOINis similar toINNER JOIN, which is incorrect.LEFT JOINshould not reduce the number of rows returned, i.e. it should return all the records from the left table, and the matched records from the right table. When there is no match, the result is aNULL, i.e. display as empty.INNER JOINreduces the number of rows, i.e. it only returns the records that have matching values in both tables.To Reproduce
# in dir datafusion/datafusion-cli $ cargo build $ ./target/debug/datafusion-cliemployeestable anddepartmenttable look likeExpected behavior
Postgres shows the expected results.
Additional context
No response