Hi all of my Jedi Knights and Sith Lords,
Today you will JOIN me in my adventure to teach you more SQL (with Attitude) to continue on towards our final goal of creating a game. Yes, again, you heard that right!! We are making a game! It's going to go very slowly, b/c (let's face it) I also have a day job and a life outside of Steemit. Maybe not much of a life outside of Steemit, but it's there and my wife appreciates that I pay attention to her.
So, today's Lesson: Using the JOIN statement.
Using the JOIN Statement in Your Queries
Of the many statements and keywords that you will use in querying data, the JOIN is the one that will be the most confusing in actual use but the most effective in filtering unwanted data out of the results. The JOIN keyword essentially tells us to take a pair of tables and join them together. It's like if you had a pair of Excel-based spreadsheets and needed to include both worksheets as a single result-set. In Excel you'd have to find some common linkage or know what corresponds to what. Then you'd have to actually do some Excel magic (which I'm not going into any further as this is not an Excel Tutorial).
In SQL, the JOIN will effectively do this for you as long as you know what the Foreign Keys are between the tables. When I say Foreign Keys, I mean those data columns that will be the same in both tables. As a quick example, let's say I have school children and backpacks that I've set up in a database (don't ask why... maybe the teacher wants to store this information for some organizational reason). The child has been given a unique ID that they are set up under in a table called "SchoolChildren"; and the backpacks are given a name-tag that is attached to it... in the database, the "BackPack" table has a column called "backpack_id" which directly corresponds to the child's ID. The child's ID would be a Primary Key and backpack's "backpack_id" would be a Foreign Key. Let's say now that child "10101" has 5 backpacks (1 for homework, 1 for Show-and-Tell, 1 for Snacks, 1 for extra clothes, and 1 for toys... it could happen). How do you tie back all 5 backpacks? by giving each one the same "10101" value for its "backpack_id".
Certain SQL-based systems can query to pull in all of the data regardless of commonalities; however, Microsoft SQL Server Management Studio (which is the interface I use) does not allow an OUTER JOIN which does this. What we can do, though, is pull in all of the data from 1 table regardless of commonality with the other (or vice versa). To do so, we would use the LEFT OUTER JOIN and RIGHT OUTER JOIN to specify which table to use as the base table.
As an example, I have mocked up those 2 tables and added some data to each:
[The 2 Tables separately]
...and now, if I perform a LEFT OUTER JOIN and a RIGHT OUTER JOIN against the 2 tables I can pull back all data from 1 table and any corresponding details from the other.
[LEFT OUTER JOIN (image, top results) & RIGHT OUTER JOIN (image, bottom results)]
The LEFT OUTER JOIN takes all data from the left side of the JOIN "conversation" and any pertinent details from the right side of it. To use the above examples, the one using the LEFT OUTER JOIN shows all data from the "SchoolChildren" table and only the common details in the "BackPack" table.
Conversely, the RIGHT OUTER JOIN takes all data from the right side of the JOIN "conversation" and any pertinent details from the left side of it. Again using the above examples, the one using the RIGHT OUTER JOIN shows all data from the "BackPack" table and only the common details in the "SchoolChildren" table.
Lastly, there is another common JOIN statement that only shows data if it's common to both tables... the INNER JOIN statement (as can be seen below). This statement tells the SQL interface to query and return results only if the details occur in both of the tables.
[The INNER JOIN Query]
Extending the JOIN Statements
It is accurate to say, then, that you can join together multiple tables in this way as long as there are common links between the details. I have personally had to deal with joining as many as 20 tables together (loose estimate) to pull in certain details that pertain to all items. This data is then used within a tool called Crystal Reports that allows the report to query data to populate information on the report. It is fundamental to my normal job.
Merging data in this way becomes more and more complicated by which JOIN you used as well as which field filters (the ON clause that you can see in the image examples) that you use. If you use the incorrect filters, the wrong data will populate or you will find yourself missing the data completely. I have done this many times over my years of querying data. This is why I like to have a second set of eyes on my queries at times... as the thought process goes, someone else might see something that you do not.
The ON Clause
Final thoughts for the day will explain the ON clause of a SQL query. The ON clause is almost like the WHERE clause that works to filter certain information based on the tables that are being joined together to merge into a single result set. This is where you can make your queries slightly less bulky.
Where you can use the WHERE clause after-the-fact to limit results, the ON clause tends to work before-the-fact and only pulls data that links to the tables in question. Additionally, the biggest positive is that you can link many different filtered rows to not show up with a single ON clause segment that might take many WHERE clause segments.
In effect, using the ON clause instead of the WHERE clause can speed up the data returns as it will start to filter out data at the get-go instead of waiting until all data is returned. To explain this a bit further, think of it in this processing way:
(assuming there are 4 tables joined with 20,000 rows in each, trying to filter down to 42 rows of corresponding data)
- Select data.
- Pull all (80,000) rows.
- Filter against WHERE clause.
- Remove 39,958 rows of data.
- Select data from first and second table.
- Remove rows based on the ON clause.
- Take remaining rows (let's say 1,255) and join to 3rd table.
- Join and remove rows based on the ON clause.
- Take the remaining rows (lets say 433) and join to 4th table.
- Join and remove rows based on the ON clause.
While the ON clause, visually, has more steps that are apparent (and I'm really condensing the process for brevity's sake), it is quite conceivable that using a WHERE clause to join 4 tables each with 20,000 rows might take a few minutes at a time, then another minute or more to remove the redundant data. The ON clause, conversely, would have 40,000 rows the first time (might take a couple minutes to return) but then the next processes take less than a minute total because it's not searching through the sheer volume of rows in the first 2 tables to return data.
It is very conceivable and logical to think that the WHERE clause could take double or triple the time. I don't have any real-world results and examples to show this in action, but having worked a while in the SQL world, I have had to determine the best way to pull data... otherwise built-in "Time Out" errors could cause the longer queries to fail. In the IT world and Data Analytics world, speed is almost as important as accuracy.
With that I'll leave you to review this information and try to really absorb the meat of what I'm saying.
See you next time.