What is a set

Sets are the data structures of mathematics. They give us a formal language for math that's like a programming language, but much more flexible, while still being precise. They are tremendously useful for data scientists and the kinds of mathematics they need.

A set is a collection of objects (called elements) with two key properties: the elements are unordered within the set, and there are no duplicates. A set is often defined by listing its elements, comma-delimited, with \( \{\} \)s.

So, for example, sets \( \{2,\text{cat},3/5\} \) and \( \{\text{cat},3/5,2\} \) are the same, since the order of elements is not meaningful. Similarly, sets \( \{\text{blue},22,\text{gold}\} \) \( \{\text{blue},\text{gold},22,\text{blue}\} \) are the same, since the duplicate "\( \text{blue} \)" element is ignored.


Important! Mathematical objects have "type", just like objects in programs, and a set is a type. So, 4 and \( \{4\} \) are different! The first is the number four, and the second is the set whose only element is the number four. So "\( 4 \)" is a number and "\( \{4\} \)" is a set. They have different type and therefore cannot be the same. By the same reason, \( \{4\} \) and \( \{\{4\}\} \) are not the same. Also by the same reason, \( 0 \) and \( \emptyset \) and "" are all different, because the first is a number, the second is a set, and the third is a string.

Set basics

So, how does this relate to databases

A set is a collection of distinct objects. In databases, a set can be thought of as a table and each row represents an element of the set. As an example, a table named Midn where each row contains a student’s information can be considered a set of Midn and each row is an individual element.

Lets now review the results of two SELECT statements.

SELECT Alpha, First, Last FROM Midn;
+--------+--------+-----------+
| Alpha  | First  | Last      |
+--------+--------+-----------+
| 261000 | Eddard | Stark     |
| 261001 | Rob    | Stark     |
| 261002 | Bran   | Stark     |
| 261234 | Tyrian | Lannister |
| 261235 | Cersei | Lannister |
| 261236 | Tywin  | Lannister |
| 261237 | Jaime  | Lannister |
+--------+--------+-----------+
7 rows in set (0.02 sec)
SELECT Alpha, AcYr, Course, EndOfTerm FROM Grades;
+--------+------+--------+-----------+
| Alpha  | AcYr | Course | EndOfTerm |
+--------+------+--------+-----------+
| 261001 | 2023 | NL110  | C         |
| 261002 | 2023 | NL110  | A         |
| 261234 | 2023 | HE111  | B         |
| 261234 | 2023 | HH104  | A         |
| 261234 | 2023 | NL110  | A         |
| 261234 | 2023 | NS101  | B         |
| 261234 | 2023 | SC111  | C         |
+--------+------+--------+-----------+
7 rows in set (0.01 sec)

You can consider any queries we make against a table to be a subset of the table, remember a subset can be the entire table or any smaller combination of the possible elements. If we had performed a query like SELECT * FROM Grades WHERE Course='NL110', then this would have created a subset of the table Grades only showing elements related to NL110.

Remember each row is an element of the larger set (table), and each element is also a subset of the larger set. A single element in our case would be:

+--------+------+--------+-----------+
| Alpha  | AcYr | Course | EndOfTerm |
+--------+------+--------+-----------+
| 261001 | 2023 | NL110  | C         |
+--------+------+--------+-----------+

If we were to JOIN the two tables without specifying what columns we wanted to use, our results would be Cartesian Product (as described above) where the results would be a new set of all possible pairs of the elements of the original two sets.

SELECT Midn.Alpha, First, Last, AcYr, Course, EndOfTerm FROM Midn, Grades;
+--------+--------+-----------+------+--------+-----------+
| Alpha  | First  | Last      | AcYr | Course | EndOfTerm |
+--------+--------+-----------+------+--------+-----------+
| 261237 | Jaime  | Lannister | 2023 | NL110  | C         |
| 261236 | Tywin  | Lannister | 2023 | NL110  | C         |
| 261235 | Cersei | Lannister | 2023 | NL110  | C         |
| 261234 | Tyrian | Lannister | 2023 | NL110  | C         |
| 261002 | Bran   | Stark     | 2023 | NL110  | C         |
| 261001 | Rob    | Stark     | 2023 | NL110  | C         |
| 261000 | Eddard | Stark     | 2023 | NL110  | C         |
| 261237 | Jaime  | Lannister | 2023 | NL110  | A         |
| 261236 | Tywin  | Lannister | 2023 | NL110  | A         |
| 261235 | Cersei | Lannister | 2023 | NL110  | A         |
| 261234 | Tyrian | Lannister | 2023 | NL110  | A         |
| 261002 | Bran   | Stark     | 2023 | NL110  | A         |
| 261001 | Rob    | Stark     | 2023 | NL110  | A         |
| 261000 | Eddard | Stark     | 2023 | NL110  | A         |
| 261237 | Jaime  | Lannister | 2023 | HE111  | B         |
| 261236 | Tywin  | Lannister | 2023 | HE111  | B         |
| 261235 | Cersei | Lannister | 2023 | HE111  | B         |
| 261234 | Tyrian | Lannister | 2023 | HE111  | B         |
| 261002 | Bran   | Stark     | 2023 | HE111  | B         |
| 261001 | Rob    | Stark     | 2023 | HE111  | B         |
| 261000 | Eddard | Stark     | 2023 | HE111  | B         |
| 261237 | Jaime  | Lannister | 2023 | HH104  | A         |
| 261236 | Tywin  | Lannister | 2023 | HH104  | A         |
| 261235 | Cersei | Lannister | 2023 | HH104  | A         |
| 261234 | Tyrian | Lannister | 2023 | HH104  | A         |
| 261002 | Bran   | Stark     | 2023 | HH104  | A         |
| 261001 | Rob    | Stark     | 2023 | HH104  | A         |
| 261000 | Eddard | Stark     | 2023 | HH104  | A         |
| 261237 | Jaime  | Lannister | 2023 | NL110  | A         |
| 261236 | Tywin  | Lannister | 2023 | NL110  | A         |
| 261235 | Cersei | Lannister | 2023 | NL110  | A         |
| 261234 | Tyrian | Lannister | 2023 | NL110  | A         |
| 261002 | Bran   | Stark     | 2023 | NL110  | A         |
| 261001 | Rob    | Stark     | 2023 | NL110  | A         |
| 261000 | Eddard | Stark     | 2023 | NL110  | A         |
| 261237 | Jaime  | Lannister | 2023 | NS101  | B         |
| 261236 | Tywin  | Lannister | 2023 | NS101  | B         |
| 261235 | Cersei | Lannister | 2023 | NS101  | B         |
| 261234 | Tyrian | Lannister | 2023 | NS101  | B         |
| 261002 | Bran   | Stark     | 2023 | NS101  | B         |
| 261001 | Rob    | Stark     | 2023 | NS101  | B         |
| 261000 | Eddard | Stark     | 2023 | NS101  | B         |
| 261237 | Jaime  | Lannister | 2023 | SC111  | C         |
| 261236 | Tywin  | Lannister | 2023 | SC111  | C         |
| 261235 | Cersei | Lannister | 2023 | SC111  | C         |
| 261234 | Tyrian | Lannister | 2023 | SC111  | C         |
| 261002 | Bran   | Stark     | 2023 | SC111  | C         |
| 261001 | Rob    | Stark     | 2023 | SC111  | C         |
| 261000 | Eddard | Stark     | 2023 | SC111  | C         |
+--------+--------+-----------+------+--------+-----------+
49 rows in set (0.00 sec)

Of course this is not what we want to retrieve from the database as all of our students grades were mixed, but it is the effect of what happens when we don't properly join the tables. In reality this is a CROSS JOIN that would be the result of something similar to SELECT * FROM Midn CROSS JOIN Grades;.

What we really wanted to find was the intersection between the two tables, but a traditional set intersection would only show elements of the set (rows in our case) that were identical between the two queries. If we only wanted to find the Alphas this could be accomplished by using INTERSECT in MySQL.

SELECT Alpha FROM Midn INTERSECT SELECT Alpha FROM Grades;
+--------+
| Alpha  |
+--------+
| 261001 |
| 261002 |
| 261234 |
+--------+
3 rows in set (0.00 sec)

In our case, we want to define what fields (or columns) to compare, in essense we are performing a MySQL intersection by running a operation over sets with our two tables.

SELECT Midn.Alpha, First, Last, AcYr, Course, EndOfTerm FROM Midn INNER JOIN Grades USING(Alpha);
+--------+--------+-----------+------+--------+-----------+
| Alpha  | First  | Last      | AcYr | Course | EndOfTerm |
+--------+--------+-----------+------+--------+-----------+
| 261001 | Rob    | Stark     | 2023 | NL110  | C         |
| 261002 | Bran   | Stark     | 2023 | NL110  | A         |
| 261234 | Tyrian | Lannister | 2023 | HE111  | B         |
| 261234 | Tyrian | Lannister | 2023 | HH104  | A         |
| 261234 | Tyrian | Lannister | 2023 | NL110  | A         |
| 261234 | Tyrian | Lannister | 2023 | NS101  | B         |
| 261234 | Tyrian | Lannister | 2023 | SC111  | C         |
+--------+--------+-----------+------+--------+-----------+
7 rows in set (0.00 sec)

We can think of our INNER JOIN as the intersection over our two resulting sets above, \(A\) and \(B\), and their respective rows \(x\) and \(y\), performed by the operation over the appropriate columns. $$ C = \sum_{x \in A} \sum_{y \in B} f(x,y,*columns) $$ Again, this action is really just the intersection of the rows of the tables as we are allowing ourselves a bit of leeway in the requirements for the elements (or comparison of the elements) in the set to support intersection, or if you dislike that concept, imagine that the database is finding the proper intersection using the identical columns and then adding the additionally requested columns.

Continuing this thought, when we are comparing these two tables via the Alpha column, it is as if the other columns in the respective tables are dependent on this particular column as if the remaining columns support this column or could be defined by it. This is actually the concept of functional dependency that we will begin exploring in the next lecture.

Problems

  1. Let \(A = \{x1, x3, x5, \ldots\}\). Give an element of \(A\) other than \(x1\), \(x3\), and \(x5\).
  2. Let \(B = \{\{0\},\{1\},\{2\}\}\). True or false, the elements of \(B\) are numbers.
  3. Let \(U = \{\ a,\ \{a\},\ \{\ \}\ \}\). Write out all the subsets of \(U\). Hint: there are eight subsets. Why? \(|U|=3\) and for each of the three elements you have two choices: put the element in the subset, or leave it out. This gives \(2\cdot 2\cdot 2 = 8\) possible outcomes.
  4. What is the difference between the following:
    • INNER JOIN
    • OUTER JOIN
    • LEFT JOIN
    • JOIN
  5. What are the following NATURAL JOIN, CROSS JOIN, and RIGHT JOIN, and why should we avoid these?