Ten Shades of Intermediate SQL

Foreword

As I continue my journey in SQL, I would like to share with you a set of tools/functions that I found useful in exercises. For me, the biggest obstacle to expertise is not necessarily the lack of understanding (although, admittedly, I haven’t gained a good understanding of the time complexity of certain operations), but the unawareness of the existence of certain useful functions. This issue is especially pronounced if you are a self-learner like me.

The functions that I am going to introduce, I picked them up along the way of solving SQL 50, an SQL problem collection on Leetcode. The language standard will be MySQL.

Here is the list (click for quick access):

GROUP BY.. HAVING..
DISTINCT
LIMIT
IF(arg1, arg2, arg3)
ROUND(arg1, arg2)
UNION and UNION ALL
LIKE and REGEXP
GROUP_CONCAT
WITH.. AS..
ROW_NUMBER() OVER (PARTITION BY..) etc.

The approach will be through examples and the purpose is to expose the reader to these tools rather than to provide a complete understanding.

1. `GROUP BY.. HAVING..`

GROUP BY is a useful function for aggregating information within a group. For example, if you have a table “transactions”, with columns “transactionDate” and “amount”, and if you want to find out the total sales of each day, you would write the following query:

SELECT transactionDate as `date`, SUM(amount) as `totalSales`

FROM transactions

GROUP BY transactionDate;

HAVING is to groups like WHERE is to rows. It filters the groups by the specified condition. Thus, you can think of the following template

SELECT [columns] FROM [table]

GROUP BY [grouping]

HAVING [condition]

as a shorthand for

SELECT * FROM

(SELECT [columns] FROM [table]

GROUP BY [grouping]) as [alias]

WHERE [condition]

As a follow-up to the previous example, if you want to find out the days with total sales at least $100000, you would write the following query:

SELECT transactionDate as `date`, SUM(amount) as `totalSales`

FROM transactions

GROUP BY transactionDate

HAVING SUM(amount) >= 100000;

2. `DISTINCT`

As the name suggests, DISTINCT eliminates duplicate results returned by the SQL.

There are two useful ways for the DISTINCT function. The first, most typical way, is with the SELECT function. For example, if you are given a table “loginLog” of the ID “id” and login times “loginTime” of users, and you want to find out which users have logged in at least once, you would write the following query:

SELECT DISTINCT id FROM loginLog;

The DISTINCT keyword here ensures that users with multiple logins will not appear multiple times in the query result.

Note that you may use DISTINCT for SELECT statements with multiple columns:

SELECT DISTINCT [column 1], [column 2], … FROM [table]

You can think of it as first doing the normal query without DISTINCT, then removing the duplicates. Thus, it is possible that two records in the query result match on some fields (but not all).

The second use of DISTINCT is with aggregate functions like COUNT and SUM. For example, the following query counts the total number of users who have logged in at least once:

SELECT COUNT(DISTINCT id) as numUsers FROM loginLog;

3. `LIMIT`

(This section presumes knowledge of ORDER BY.)

The function LIMIT limits the number of rows output by the query. It is often used in combination with ORDER BY to select the top X results.

For example, if you want to select the 10 fastest runtimes (of submissions to a programming problem in an online judge) from a table “submissionLog” with columns “submissionId”, “result” (“success” or “fail”), and “runtime” (NULL if result is “fail”), we could write the following query:

SELECT * FROM submissionLog

WHERE result = “success”

ORDER BY runtime

LIMIT 10;

It is also possible to impose a lower limit to the rows returned. The basic syntax is LIMIT [offset] [number of rows] Thus, to limit the output to the 11th to 30th rows, we could write LIMIT 10 20.

4. `IF (arg1, arg2, arg3)`

This is exactly the same as the IF function in MS Excel. It checks whether the condition specified by arg1 is true. If yes, it returns the value specified by arg2, else it returns the value specified by arg3.

In the following example, you are asked to write a query to compute the total score of students on their assignments. If a submission is late, it gets a 25% penalty. Given a table “assignmentLog” with columns “studentId”, “assignment”, “score”, and “isLate”, (primary key being (studentId, assignment)), we would write the query as

SELECT studentId, SUM(IF(isLate, 0.75 * score, score)) as totalScore

FROM assignmentLog

GROUP BY studentId;

5. `ROUND(arg1, arg2)`

The ROUND function was useful for a lot of Leetcode problems as the requirement was to round decimal outputs to two decimal places. The syntax is ROUND([result], 2). Change 2 to the number of desired decimal places as needed.

6. `UNION` and `UNION ALL`

UNION ALL takes the results from two tables with matching columns, and put the results together. The basic syntax is

SELECT * FROM [Table 1] UNION ALL [Table 2];

UNION, which is really the more commonly-used function, puts the results from two tables together and removes duplicates afterwards. This is a rather important point because it means that duplicate entries within a table get eliminated after UNIONing with another table.

An exercise where UNION is useful is this (link to Leetcode). An exercise where UNION ALL is useful is this (link to Leetcode).

7. `LIKE` and `REGEXP`

LIKE and REGEXP are string matching functions. LIKE has limited functionality that can be nevertheless useful in simple situations. REGEXP leverages the full power of regular expression (see documentation).

The pattern string for LIKE is a combination of single characters, “_”, and “%”. The “_” is a special character that matches with any character. The “%” is a special character that matches with any string (even empty string).

As an example, suppose you have a table “names” with one column “name”, and you wish to find all names that start with “A”, you would write the following query:

SELECT * FROM names WHERE name LIKE “A%”;

To find all names that does not contain exactly five characters, you would write the following query (using LENGTH is, of course, fine as well):

SELECT * FROM names WHERE name NOT LIKE “_____”;

(If it is hard to tell, there are five _s between the quotes.)

The workings of regular expressions is out of scope here. Click on the link at the beginning of this section to learn about it. The query for selection all names that matches a certain pattern string is

SELECT * FROM names WHERE name REGEXP [pattern string];

8. `GROUP_CONCAT`

The function GROUP_CONCAT is an advanced aggregation function. By default, it takes a bunch of data specified by the query and enumerates them in a comma-separated list.

Let’s illustrate this function using an example (simplified from a Leetcode problem; link). Before going into the example, if you have not learned about the function DISTINCT, here is an intro.

If we have a table “sales” with columns “date” and “product”, indicating a sale of the product on the specified date, and if we want to list all items sold on any day, we could use the query

SELECT `date`, GROUP_CONCAT(DISTINCT product)

FROM sales

GROUP BY `date`;

Within each date group, all the products after removing duplicates gets concatenated into a comma-separated list.

In MySQL, we may use GROUP_CONCAT in conjunction with ORDER BY, but not with LIMIT (yet). Anyway, if in the above example we wish to order the items by alphabetical order, we could use the query

SELECT `date`,

GROUP_CONCAT(DISTINCT product ORDER BY product)

FROM sales

GROUP BY `date`;

9. `WITH.. AS..`

The WITH.. AS.. is one of the most important SQL features that keep complex queries clean and simple. It creates a common table expression (CTE), a temporary data set that can be used by other queries. It can be used for recursive (!) queries, but it is out of scope here. We will stick with non-recursive queries.

The basic syntax is

WITH [name] AS [query]

SELECT…

Then, in the SELECT query below, you may refer to the query results in the WITH clause by “name”.

Let’s look at an example. If you have a query “blackboxQuery” that returns “event”, “gold”, “silver”, “bronze”, representing the gold, silver, and bronze medalists in an event, and you want to use that query result to produce a table that returns a list of all the medalists, events, and the type of medal they get, we could write the following query

WITH m AS [content of blackboxQuery]

SELECT m.gold as medalist, m.event as event, “Gold” as medalType FROM m

UNION

SELECT m.silver as medalist, m.event as event, “Silver” as medalType FROM m

UNION

SELECT m.bronze as medalist, m.event as event, “Bronze” as medalType FROM m;

10. `ROW_NUMBER() OVER (PARTITION BY..)` etc.

Let’s first talk about ROW_NUMBER(). It assigns a row number to each entry of a table, in the order specified. There are similar functions such as RANK() and DENSE_RANK(). The rank is one plus the number of entries strictly before an entry. The dense rank is one plus the number of distinct entries strictly before an entry. Refer to the documentation for more detail.

As an example, if you wish to compute the rank of a student in an examination, and you are working with a table “exam” with columns “studentId”, “name”, and “score”, you would do the following:

SELECT name, RANK() OVER (ORDER BY score DESC) as rank

FROM exam;

(Note that ROW_NUMBER() won’t handle ties well.)

With PARTITION BY, the table is partitioned into sub-tables, and row numbers are assigned separately within each sub-table.

As an example, if you have a table “log” of sports events “event”, participants “participant”, and their finish time “finishTime” (the lower the better), and if you want to query the ranks of each record, you would do the following:

SELECT *,

ROW_NUMBER() OVER (PARTITION BY event ORDER BY finishTime) as rank

FROM log;

Summary

To summarize, we have discussed a bunch of intermediate-level MySQL functions. Here is a quick summary:

GROUP BY.. HAVING..: WHERE but for groups
DISTINCT: removes duplicates from tables
LIMIT: limit output rows, best used with ORDER BY
IF(arg1, arg2, arg3): if arg1 then arg2 else arg3
ROUND(arg1, arg2): rounds arg1 to arg2 decimal places
UNION and UNION ALL: put together entries from two tables
LIKE and REGEXP: string matching
GROUP_CONCAT: list elements beloging to a group
WITH.. AS..: refactors code, making a query result reusable
ROW_NUMBER() OVER (PARTITION BY..) etc.: gives a rank to each entry

With these functions added to your arsenal, you should be able to perform your everyday SQL queries with ease. Do practice either with projects or with online judge exercises, because practice makes perfect. Take care.

Foreword

1. GROUP BY.. HAVING..

2. DISTINCT

3. LIMIT

4. IF (arg1, arg2, arg3)

5. ROUND(arg1, arg2)

6. UNION and UNION ALL

7. LIKE and REGEXP

8. GROUP_CONCAT

9. WITH.. AS..

10. ROW_NUMBER() OVER (PARTITION BY..) etc.