Introduction to SQL Joins
A simplified, but not oversimplified, tutorial
I. Premise
The goal of this article is to give a simple yet rigorous treatment of the different join operations in SQL. In the end, you will have learned how to interpret and use a wide range of SQL join operations with confidence. An excellent resource is Wikipedia but I would like to condense it even further.
Important aspects of SQL joins that we will not address:
- how the constraints / data types of columns are changed after join
- how NULL entries are dealt with (we assume that the table has no NULLs)
- other edge cases
This tutorial presumes some familiarity with set theory and SQL SELECT... FROM...
query.
II. The inner join
There are really only three join types in SQL really – inner join, left/right (outer) join, and outer join. Let’s start with inner join.
Consider the following setting. There are two tables, TA and TB. The columns of TA are A and the columns of TA are B. We can think of a column as a domain (set of admissible values, e.g. integers, floats, strings) together with a name, and each row of a table as an element in the Cartesian product of these domains.
As an example, if A consists of two columns, “id” (positive integer-valued) and “last_name” (string-valued), then a row is just an (id, last_name) pair where id is a positive integer and last_name is a string.
The inner join of TA and TB returns a table with columns (domain) A x B with the rows (a, b) that satisfy a relation ~ on A x B.
The relation is abstract and assigns a value of TRUE or FALSE to each of a~b. We abbreviate “a~b is TRUE” using “a~b”.
Typically, the ~ relation is some sort of equality constraint on the common column (e.g. “id”) of the tables. A slightly more complicated example is if A and B are integer domains and a~b could mean “a+1 < b”. Then, 3~5 is TRUE, 1~1 is FALSE, and 5~3 is FALSE, etc. [1]
The basic syntax for inner join looks like:
SELECT [columns]
FROM [TA] INNER JOIN [TB]
ON [relation]
Example 1: TA has columns “id” and “last_name”, and TB has columns “id” and “first_name”. Want a table with columns “id”, “last_name”, and “first_name”.
Solution:
SELECT TA.*, TB.first_name
FROM TA INNER JOIN TB
ON TA.id = TB.id
Note that if we do SELECT *
instead, then the resulting joint table will have two “id” columns, which will mess things up. We will discuss how to handle them in the latter sections.
III. The outer joins
Now we can talk about the outer joins. The only difference between the left outer join and the inner join is that, if there is no data b in TB such that a~b (is TRUE), then we add the entry (a, NUL) where NUL is a special element. [2]
The left outer join of TA and TB returns a table with columns (domain) A x B with the rows (a, b) that satisfy a relation ~ on A x B, and also (a, NUL) where there is no b in TB such that a~b.
The right outer join behaves similarly. As an exercise, write out a definition of the right outer join in a similar fashion as above. [3]
The outer join (also called “full outer join”) is sort of a combination of the left and right outer joins. We add the entry (a, NUL) if there is no b in TB such that a~b, and we add the entry (NUL, b) if there is no a in TA such that a~b.
And this is it! As for the syntax, it’s the same as the basic syntax for inner join, but replace INNER JOIN
by LEFT JOIN
/ RIGHT JOIN
/ OUTER JOIN
.
IV. The USING
keyword
As we mentioned in Section II, the issue of duplicate columns does arise when using joins without explicitly selecting columns. The USING
keyword is one way to deal with that. It tells SQL which column(s) the join operation “used” for joining the tables and is limited to testing equality on those columns. The columns have to appear in both TA and TB.
Thus, in Example 1, we could also have done this:
SELECT *
FROM TA INNER JOIN TB
USING (id)
This will be transformed into an ON
clause that checks for equality between TA.id
and TB.id
, and then only one copy of the “id” column will be retained in the joint table.
V. Aliasing
Another way to deal with duplicate columns is aliasing – just rename the columns so that they are distinct. This is useful when you actually want to keep both of the duplicate columns.
Example 2: TA has columns “id” and “phone_no” (that stands for home phone number), and TB has columns “id” and “phone_no” (that stands for mobile phone number). Want a table with columns “id” and both phone numbers.
Solution:
SELECT id, TA.phone_no as home_phone_no, TB.phone_no as mobile_phone_no
FROM TA INNER JOIN TB
USING (id)
The data from the two tables will be combined, with the “phone_no” from table TA being under the column “home_phone_no” in the joint table, and the “phone_no” from table TB being under the column “mobile_phone_no” in the joint table.
VI. Other useful joins
We finish by describing two other common and useful joins: the cross join and the natural join.
The cross join of TA and TB returns a table with columns (domain) A x B with all the rows (a, b) for a in TA and b in TB.
The cross join is a special case of one of the joins. As an exercise, define the cross join in terms of one of the basic join operations. [4]
Now, the natural join. It is actually best to describe it as a special case of the “inner join… using…” operations.
Given tables TA and TB with common columns C, the natural join of TA and TB is the inner join of TA and TB “USING
” C (i.e. a~b if and only if they agree on all the common columns).
To use these joins, simply replace the INNER JOIN
etc. by CROSS JOIN
and NATURAL JOIN
in the basic syntax. Note that the ON
modifier is not applicable to cross join and natural join (since both of them have a default relation).
VII. Summary
That concludes the tutorial. To summarize the key points:
- The inner join of TA and TB is all data (a, b) that satisfy a relation ~.
- ~ need not be equality constraints and can be a general assignment of TRUE/FALSE values on all (a, b) pairs.
- The left/right/full outer joins makes all a in TA/all b in TB/both appear in the joint table, possibly with NUL values in the remaining fields.
- Cross join is inner join with no constraints (i.e. all TRUE relation)
- Natural join is inner join with equality constraints on all common columns
That’s it! Let me know if there are any questions or comments.
Footnotes
[1] The reader should not confuse the general relation here with the notion of equivalence relation. In particular, from the example presented just now, one can see that the relation needs not be reflexive, nor symmetric, not transitive.
[2] It is deliberately misspelled to distinguish from the SQL NULL values.
[3] The right outer join of TA and TB returns a table with columns (domain) A x B with the rows (a, b) that satisfy a relation ~ on A x B, and also (NUL, b) where there is no a in TA such that a~b.
[4] The cross join of TA and TB is the inner join of TA and TB where a~b is always TRUE.
License
Copyright 2016-present George Cushen.
Released under the MIT license.