Lesson 1: Retrieving Data From a Relational Database

Lesson 1: Retrieving Data From a Relational Database jls27

Overview

Overview eab14

One of the core skills required by database professionals is the ability to manipulate and retrieve data held in a database using Structured Query Language (SQL, sometimes pronounced "sequel"). As we'll see, queries can be written to create tables, add records to tables, update existing table records, and retrieve records meeting certain criteria. This last function, retrieving data through SELECT queries, is arguably the most important because of its utility in answering the questions that led the database developer to create the database in the first place. SELECT queries are also the most fun type to learn about, so, in this first lesson, we will focus on using them to retrieve data from an existing database. Later, we'll see how new databases can be designed, implemented, populated and updated using other types of queries.

Objectives

At the successful completion of this lesson, students should be able to:

  • use the Query Builder GUI in MS-Access to create basic SQL SELECT queries;
  • understand how settings made in the Query Builder translate to the various clauses in an SQL statement;
  • use aggregation functions on grouped records to calculate counts, sums, and averages;
  • construct a query based on another query (subquery).

Questions?

Conversation and comments in this course will take place within the course discussion forums. If you have any questions now or at any point during this week, please feel free to post them to the Lesson 1 Discussion Forum.

Checklist

Checklist jls27

Lesson 1 is one week in length (see the Canvas Calendar for specific due dates). To finish this lesson, you must complete the activities listed below:

  1. Download the baseball stats.accdb (right-click and Save As) Access database that will be referenced during the lesson.
  2. Work through Lesson 1.
  3. Complete Project 1 and upload its deliverables to the Project 1 Dropbox.
  4. Complete the Lesson 1 Quiz.

SELECT Query Basics

SELECT Query Basics jls27

Why MS-Access?

A number of RDBMS vendors provide a GUI to aid their users in developing queries. These can be particularly helpful to novice users, as it enables them to learn the overarching concepts involved in query development without getting bogged down in syntax details. For this reason, we will start the course with Microsoft Access, which provides perhaps the most user-friendly interface.

A. Download an Access database and review its tables

Throughout this lesson, we'll use a database of baseball statistics to help demonstrate the basics of SELECT queries.

  1. Click to download the baseball database.
  2. Open the database in MS Access.

    Screen capture of MS Access Navigation Pane, access to queries.

    One part of the Access interface that you'll use frequently is the "Navigation Pane," which is situated on the left side of the application window. The top of the Navigation Pane is just beneath the "Ribbon" (the strip of controls that runs horizontally along the top of the window).

    The Navigation Pane provides access to the objects stored in the database, such as tables, queries, forms, and reports. When you first open the baseball_stats.accdb database, the Navigation Pane should appear with the word Tables at the top, indicating that it is listing the tables stored in the database (PLAYERS, STATS, and TEAMS).

  3. Double-click on a table's name in the Navigation Pane to open it. Open all three tables and review the content. Note that the STATS table contains an ID for each player rather than his name. The names associated with the IDs are stored in the PLAYERS table.

B. Write a simple SELECT query

With our first query, we'll retrieve data from selected fields in the STATS table.

  1. Click on the Create tab near the top of the application window.
  2. Next, click on the Query Design button (found on the left side of the Create Ribbon in the group of commands labeled as Queries).
    When you do this in Access 2010 and higher, the ribbon switches to the Design ribbon.
  3. In the Show Table dialog, double-click on the STATS table to add it to the query and click Close.
  4. Double-click on PLAYER_ID in the list of fields in the STATS table to add that field to the design grid below.
  5. Repeat this step to add the YEAR and RBI fields.

    Screen capture showing StatsTable Query1 Design.
  6. At any time, you can view the SQL that's created by your GUI settings by accessing the View drop-down list on the far-left side of Design Ribbon (it is also available when you have the Home tab selected, as shown below).
    As you go through the next steps, look at the SQL that corresponds to the queries you are building.

    Screen capture of View List with SQL view circled in red.

C. Restrict the returned records to a desired subset

  1. From the same View drop-down list, select Design View to return to the query design GUI.
  2. In the design grid, set the Criteria value for the RBI field to >99.

    Screen capture showing the entry of a selection criterion.
  3. Test the query by clicking on the red exclamation point on the top left next to the View dropdown (it should return 103 records).

D. Sort the returned records

  1. Return to Design View by selecting it from the View dropdown.
  2. In the design grid, click in the Sort cell under the RBI column and select Descending from the drop-down list. This will sort the records from highest RBI total to lowest.
  3. Test the query.

E. Add additional criteria to the selection

  1. Return to Design View and set the Criteria value for the YEAR field to >1989. This will limit the results to seasons of over 100 RBI since 1990.
  2. Test the query (it should return 53 records).
  3. Return to Design View and modify the Criteria value for the YEAR field to >1989 And <2000, which will further limit the results to just the 1990s.
  4. Test the query (it should return 34 records).
  5. Return to Design View and change the Criteria value for the YEAR field back to >1989, beneath that cell (in the or: cell) add <1960.

    As you should be able to guess, I'm asking you to write a query that identifies 100-RBI seasons since 1989 OR prior to 1960. However, the query as written at this point doesn't quite yield that result; look at the WHERE line in the SQL view. Instead, it would return 100-RBI seasons since 1989 and all seasons prior to 1960 (not just the 100-RBI ones). To produce the desired result, you need to repeat the >99 criterion in the RBI field's or: cell. Check the SQL view to see the change.

    Screen capture of Query1Stats that indicates changes as entered in StatsTable Query1 Design.
  6. Test the query (it should return 74 records).

You've probably recognized by now that the output from these queries is not particularly human friendly. In the next part of the lesson, we'll see how to use a join between the two tables to add the names of the players to the query output.


Credit for all screenshots: © Penn State is licensed under CC BY-NC-SA 4.0

Joining Data From Multiple Tables

Joining Data From Multiple Tables jls27

One of the essential functions of an RDBMS is the ability to pull together data from multiple tables. In this section, we'll make a join between our PLAYERS and STATS tables to produce output in which the stats are matched up with the players who compiled them.

A. Retrieve data from multiple tables using a join

  1. Return to Design View and click on the Show Table button (near the middle of the Ribbon).
  2. In the Show Table dialog, double-click on the PLAYERS table to add it to the query, then Close the Show Table dialog. Create the join by click-holding on the PLAYER_ID field name in one of the tables and dragging to the PLAYER_ID field name in the other table. If you make the wrong join connection, you can right-click on the line and delete the join.

    Note:

    You'll probably notice the key symbol next to PLAYER_ID in the PLAYERS table and next to PLAYER_ID, YEAR and TEAM in the STATS table. This symbol indicates that the field is part of the table's primary key, a topic we'll discuss more later. For now, it's good to understand that a field need not be a key field to participate in a join, and that while the fields often share the same name (as is the case here), that is not required. What's important is that the fields contain matching values of compatible data types (e.g., a join between a numeric zip code field and a string zip code field will not work because of the mismatch between data types).

    Screen capture of Joining - Stats, Players, Tables

  3. Modify the query so that it again just selects 100-RBI seasons without any restrictions on the YEAR.
  4. Double-click on the FIRST_NAME and LAST_NAME fields to add them to the query.
  5. Test the query. Looking at the output, you may be thinking that the PLAYER_ID value is not really of much use anymore and that the names would be more intuitive on the left side rather than on the right.
  6. To remove the PLAYER_ID field from the output, return to Design View and click on the thin gray bar above the PLAYER_ID field in the design grid to highlight that field. Press the Delete key on your keyboard to remove that field from the design grid (and the query results).
  7. To move the name fields, click on the FIRST_NAME field's selection bar in the design grid and drag to the right so that you also select the LAST_NAME field. With both fields selected, click the selection bar and drag them to the left so that they now appear before the YEAR and RBI fields.

    Screen capture of Joining - Reorder, Remove, Fields
  8. Test the query.

B. Use the Expression Builder to concatenate values from multiple fields

  1. Return to Design View and right-click on the Field cell of the FIRST_NAME column; on the context menu that appears, click on Build to open the Expression Builder.

    You should see the FIRST_NAME field listed in the expression box already. Note the Expression Elements box in the lower left of the dialog, which provides access to the Functions built into Access (e.g., the LCase function for turning a string into all lower-case characters), to the tables and fields in the baseball_stats.accdb database (e.g., STATS), to a list of Constants (e.g., True and False) and to a list of Operators (e.g., +, -, And, Or). All of these elements can be used to produce expressions in either the Field or Criteria rows of the design grid. Note that it's also OK to type the expression directly if you know the correct syntax. Here, we're going to create an expression that will concatenate the FIRST_NAME value with the LAST_NAME value in one column.
  2. First, delete the FIRST_NAME text from the expression box.
  3. Expand the list of objects in the baseball_stats.accdb database by clicking on the plus sign to its left. Then expand the list of Tables.
  4. Click on PLAYERS to display its fields in the middle box. Double-click on the FIRST_NAME field to add it to the expression. (Note that the field is added to the expression in a special syntax that includes the name of its parent table. This is done in case there are other tables involved in the query that also have a FIRST_NAME field. If there were, this syntax would ensure that the correct FIRST_NAME field would be utilized.)
  5. The concatenation character in Access is the ampersand (&). Click in the expression text box after [PLAYERS]![FIRST_NAME] to get a cursor and type:

    & ' ' &

    So, you type &-space-single quote-space-single quote-space-&. The purpose of this step is to add a space between the FIRST_NAME and LAST_NAME values in the query results.
  6. Double-click on the LAST_NAME field in the field list to add it to the expression.
  7. Finally, place the cursor at the beginning of the expression and type the following, followed by a space:

    PLAYER: 

    This specifies the column's name in the output. Your expression should now look like this:

    Screen capture of Expression Builder Elements and Categories - Joining Player Name Concat

  8. When you've finished writing the expression, click OK to dismiss the Expression Builder.
  9. The LAST_NAME column is now redundant, so remove it from the design grid as described above.
  10. Run the query and confirm that the name values concatenated properly.

C. Sorting by a field not included in the output

Suppose you wanted to sort this list by the last name, then first name, and then by the number of RBI. This would make it easier to look at each player's best seasons in terms of RBI production.

  1. In Design View, double-click on LAST_NAME, then FIRST_NAME to add them back to the design grid.
  2. Choose the Ascending option in the Sort cell for both the LAST_NAME and FIRST_NAME fields.
  3. Again, these fields are redundant with the addition of the PLAYER expression. Uncheck the Show checkbox for both fields so that they are not included in the query results. Note that the ordering of the Sort fields in the design grid is important. The results are sorted first based on the left-most Sort field, then by the next Sort field, then the next, etc.
  4. If necessary, click on the thin gray selection bar above the RBI field in the design grid to select that column. Click and drag to the right so that it is moved to the right of the _NAME columns.
  5. Choose Descending as the Sort option for the RBI field.

    Screen capture of Design View, Sorting By RBI LastName, First Name
  6. Run the query and confirm that the records sorted properly. (Hank Aaron should be listed first.)

D. Performing an outer join

If you were to count the number of players being displayed by our query, you'd find that there are 10 (PLAYER_IDs 1 through 10).

  1. Open the PLAYERS table and note that there is a player (Alex Rodriguez, PLAYER_ID 11) who is not appearing in the query output.
  2. Now open the STATS table and note that there are records with a PLAYER_ID of 12, but no matching player with that ID in the PLAYERS table. These records aren't included in the query output either.

    The unmatched records in these tables aren't appearing in the query output because the join between the tables is an inner join. Inner joins display data only from records that have a match in the join field values.

    To display data from the unmatched records, we need to use an outer join.

    Close the tables.
  3. Leave Query1 open, as we'll come back to it in a moment. Create a new query (Create > Query Design).
  4. Add the PLAYERS and STATS tables to the design grid of the new query, and then close the Show Table dialog.
  5. Create a join between the PLAYER_ID fields.
  6. Add the PLAYER_ID, FIRST_NAME and LAST_NAME fields from the PLAYERS table to the design grid.
  7. Add the PLAYER_ID, YEAR and HR fields from the STATS table.
  8. Choose the Sort Ascending option for the PLAYER_ID field (the one from the STATS table) and the YEAR field.
  9. Right-click on the line connecting the PLAYER_ID fields and select Join Properties.

    Note that join option 1 is currently selected (only include rows where the joined fields from both tables are equal).
  10. Click on the Include ALL records from 'PLAYERS' option, then click OK. Whether this is option 2 or 3 will depend on whether you clicked and dragged from PLAYERS to STATS or from STATS to PLAYERS. In any case, note that the line connecting the key fields is now an arrow pointing from the PLAYERS table to the STATS table.
  11. Test the query. The query outputs PLAYER_ID 11 (at the top) in addition to 1 through 10. Because there are no matching records for PLAYER_ID 11 in the STATS table, the STATS columns in the output contain NULL values.
  12. Return to Design View and bring up the Join Properties dialog again. This time choose option 3, so that all records from the STATS table will be shown, then click OK. The arrow connecting the key fields is now reversed, pointing from the right (STATS) table to the left (PLAYERS) table.
  13. Test the query. If you scroll to the bottom, you'll see that the query outputs the stats for PLAYER_ID 12. This player has no match in the PLAYERS table, so the PLAYERS columns in the output contain NULL values.
  14. Close Query2 without saving.

E. Performing a cross join

Another type of join that we'll use later when we write spatial queries in PostGIS is the cross join, which produces the Cartesian, or cross, product of the two tables. If one table has 10 records and the other 5, the cross join will output 50 records.

  1. Create another new query adding the PLAYERS and TEAMS tables to the design grid. You won't see a join line automatically connecting fields in the two tables (they don't share a field in common). This lack of a line connecting the tables is actually what we want, as it will cause a cross join to be performed.
  2. Add the PLAYER_ID field from the PLAYERS table and the ABBREV field from the TEAMS table to the design grid.
  3. Test the query. Note that the query outputs 275 records (the product of 11 players and 25 teams). Each record represents a different player/team combination.
  4. Close Query2 without saving.

The usefulness of this kind of join may not be obvious given this particular example. For a better example, imagine a table of products that includes each product's retail price and a table of states and their sales taxes. Using a cross join between these two tables, you could calculate the sales price of each product in each state.

F. Creating a query with multiple joins

Finally, let's put together a query that joins together data from 3 tables.

  1. Create another new query adding the PLAYERS, STATS and TEAMS tables to the design grid.
  2. Create the PLAYER_ID-based join between the PLAYERS and STATS tables.
  3. Click on the TEAM field in the STATS table and drag over to the ABBREV field in the TEAMS table, (recall that field names do not have to match when creating a join).
  4. Add the FIRST_NAME and LAST_NAME fields from the PLAYERS table, the YEAR and HR fields from the STATS table and the CITY and NICKNAME fields from the TEAMS table.
  5. Test the query.
  6. Close Query2 without saving.

In this way, it is possible to create queries containing several joins. However, in a real database, you are likely to notice a drop in performance as the number of joins increases.

In this section, we saw how to pull together data from two or more related tables, how to concatenate values from two text fields, and how to sort query records based on fields not included in the output. This enabled us to produce a much more user- friendly output than we had at the end of the previous section. In the next section, we'll further explore the power of SELECT queries by compiling career statistics and calculating new statistics on the fly from values stored in the STATS table.


Credit for all screenshots: © Penn State is licensed under CC BY-NC-SA 4.0

Aggregating Data

Aggregating Data jls27

One of the beauties of SQL is in its ability to construct groupings from the values in a table and produce summary statistics for those groupings. In our baseball example, it is quite easy to calculate each player's career statistics, seasons played, etc. Let's see how this type of query is constructed by calculating each player's career RBI total.

A. Adding a GROUP BY clause to calculate the sum of values in a field

  1. Return to Design View (your Query 1 design should still be there) and remove the LAST_NAME and FIRST_NAME fields from the design grid.
  2. With the Design tab selected, click on the Totals button on the right side of the Ribbon in the Show/Hide group of buttons. This will add a Total row to the design grid with each field assigned the value Group By.
  3. Click in the Total cell under the RBI column and select Sum from the drop-down list (note the other options available on this list, especially Avg, Max, and Min).
  4. Grouping by both the player's name and the year won't yield the desired results, since the data in the STATS table are essentially already grouped in that way. We want the sum of the RBI values for each player across all years, so remove the YEAR column from the design grid and remove the >99 from the Criteria row.
  5. Modify the query so that the results are sorted from highest career RBI total to lowest.

    Screen capture of Aggregate SumRBI, sorted from highest career RBI total to lowest.
  6. Run the query and note that the RBI column name defaults to 'SumOfRBI'. You can override this by returning to Design View and entering a custom name (such as CareerRBI) and a colon before the RBI field name (following the PLAYER field example).

B. Use the Expression Builder to perform a calculation based on multiple field values

Home run hitters are often compared based on the frequency of their home run hitting (i.e., 'Player X averages a home run every Y times at bat'). Let's calculate this value for each player and season.

  1. Click on the Totals button on the query Design toolbar to toggle off the Total row in the design grid.
  2. Remove the RBI field from the design grid.
  3. Add the YEAR field to the design grid.
  4. In the third column (which is empty), right-click in the Field cell and click on Build to pull up the Expression Builder as done earlier.
  5. Navigate inside the database and click on STATS to see a list of its fields.
  6. Use the GUI to build the following expression:

    ABPERHR: [STATS]![AB]/[STATS]![HR]
  7. If a player had 0 HR in a season, this calculation will generate an error (can't divide by 0, test the query and look for #Div/0!), so add the HR field to the design grid, add >0 to its Criteria cell and uncheck its Show checkbox.
  8. Sort the query results (not shown below) by the ABPERHR value from lowest to highest (since low values are better than high values).

    Screen capture to show Aggregate_ABPERHR, and position of GUI to build the expression.
  9. Run the query and note that the calculated values include several digits after the decimal point. These digits aren't really necessary, so return to Design View, right-click on the ABPERHR field and select Properties.
  10. A Property Sheet pane will appear on the right side of the window. Select Fixed from the Format drop-down list and set the Decimal Places value to 1.
  11. Re-run the query and confirm that you get the desired results.

Credit for all screenshots: © Penn State is licensed under CC BY-NC-SA 4.0

SELECT Query Syntax

SELECT Query Syntax jls27

Hopefully, you've found the MS-Access query builder to be helpful in learning the basics of retrieving data from a database. As you move forward, it will become important for you to learn some of the syntax details (i.e., how to write the statements you saw in SQL View without the aid of a graphical query builder). That will be the focus of this part of the lesson.

A. The SELECT clause

All SELECT queries begin with a SELECT clause whose purpose is to specify which columns should be retrieved. The desired columns are separated by commas. Our first query in this lesson had a SELECT clause that looked like the following:

SELECT STATS.PLAYER_ID, STATS.HR, STATS.RBI

Note that each column name is preceded by the name of its parent table, followed by a dot. This is critical when building a query involving joins and one of the desired columns is found in multiple tables. Including the parent table eliminates any confusion the SQL interpreter would have in deciding which column should be retrieved. However, you should keep in mind that the table name can be omitted when the desired column is unambiguous. For example, because our simple query above is only retrieving data from one table, the SELECT clause could be reduced to:

SELECT PLAYER_ID, HR, RBI

Selecting all columns

The easiest way to retrieve data from a table is to substitute the asterisk character (*) for the list of columns:

SELECT *

This will retrieve data from all of the columns. While it's tempting to use this syntax because it's so much easier to type, you should be careful to do so only when you truly want all of the data held in the table or when the table is rather small. Retrieving all the data from a table can cause significant degradation in the performance of an SQL-based application, particularly when the data is being transmitted over the Internet. Grab only what you need!

Case sensitivity in SQL

Generally speaking, SQL is a case-insensitive language. You'll find that the following SELECT clause will retrieve the same data as the earlier ones:

SELECT player_id, hr, rbi

Why did I say "generally speaking?" There are some RDBMS that are case sensitive depending on the operating system they are installed on. Also, some RDBMS have administrative settings that make it possible to turn case sensitivity on. For these reasons, I suggest treating table and column names as if they are case sensitive.

On the subject of case, one of the conventions followed by many SQL developers is to capitalize all of the reserved words (i.e., the words that have special meaning in SQL) in their queries. Thus, you'll often see the words SELECT, FROM, WHERE, etc., capitalized. You're certainly not obligated to follow this convention, but doing so can make your queries more readable, particularly when table and column names are in lower or mixed case.

Finally, recall that we renamed our output columns (assigned aliases) in some of the earlier query builder exercises by putting the desired alias and a colon in front of the input field name or expression. Doing this in the query builder results in the addition of the AS keyword to the query's SELECT clause, as in the following example:

SELECT PLAYER_ID, HR AS HomeRuns, RBI

B. The FROM clause

The other required clause in a SELECT statement is the FROM clause. This specifies the table(s) containing the columns specified in the SELECT clause. In the first queries we wrote, we pulled data from a single table, like so:

SELECT STATS.PLAYER_ID, STATS.HR, STATS.RBI FROM STATS

Or, omitting the parent table from the column specification:

SELECT PLAYER_ID, HR, RBI FROM STATS

The FROM clause becomes a bit more complicated when combining data from multiple tables. The Access query builder creates an explicit inner join by default. Here is the code for one of our earlier queries:

SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, STATS.YEAR, STATS.RBI
FROM PLAYERS INNER JOIN STATS ON PLAYERS.PLAYER_ID = STATS.PLAYER_ID

An inner join like this one causes a row-by-row comparison to be conducted between the two tables looking for rows that meet the criterion laid out in the ON predicate (in this case, that the PLAYER_ID value in the PLAYERS table is the same as the PLAYER_ID value in the STATS table). When a match is found, a new result set is produced containing all of the columns listed in the SELECT clause. If one of the tables has a row with a join field value that is not found in the other table, that row would not be included in the result.

We also created a couple of outer joins to display data from records that would otherwise not appear in an inner join. In the first example, we chose to include all data from the left table and only the matching records from the right table. That yielded the following code:

SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME] AS PLAYER, STATS.YEAR, STATS.RBI

    FROM STATS LEFT JOIN PLAYERS ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID

Note that the only difference between this FROM clause and the last is that the INNER keyword was changed to the word LEFT. As you might be able to guess, our second outer join query created code that looks like this:

SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME] AS PLAYER, STATS.YEAR, STATS.RBI

    FROM STATS RIGHT JOIN PLAYERS ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID

Here the word LEFT is replaced with the word RIGHT. In practice, RIGHT JOINs are more rare than LEFT JOINs since it's possible to produce the same results by swapping the positions of the two tables and using a LEFT JOIN. For example:

SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME] AS PLAYER, STATS.YEAR, STATS.RBI

    FROM PLAYERS LEFT JOIN STATS ON STATS.PLAYER_ID = PLAYERS.PLAYER_ID

Recall that the cross join query we created was characterized by the fact that it had no line connecting key fields in the two tables. As you might guess, this translates to a FROM clause that has no ON predicate:

SELECT STATS.PLAYER_ID, TEAMS.ABBREV

    FROM STATS, TEAMS;

In addition to lacking an ON predicate, the clause also has no form of the word JOIN. The two tables are simply separated by commas.

Finally, we created a query involving two inner joins. Here is how that query was translated to SQL:

SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, STATS.YEAR, STATS.HR, TEAMS.CITY, TEAMS.NICKNAME

    FROM (PLAYERS INNER JOIN STATS ON PLAYERS.PLAYER_ID = STATS.PLAYER_ID) INNER JOIN TEAMS ON STATS.TEAM = TEAMS.ABBREV;

The first join is enclosed in parentheses to control the order in which the joins are carried out. Like in algebra, operations in parentheses are carried out first. The result of the first join then participates in the second join.

C. The WHERE clause

As we saw earlier, it is possible to limit a query's result to records meeting certain criteria. These criteria are spelled out in the query's WHERE clause. This clause includes an expression that evaluates to either True or False for each row. The "True" rows are included in the output; the "False" rows are not. Returning to our earlier examples, we used a WHERE clause to identify seasons of 100+ RBIs:

SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.RBI

    FROM STATS

    WHERE STATS.RBI > 99

As with the specification of columns in the SELECT clause, columns in the WHERE clause need not be prefaced by their parent table if there is no confusion as to where the columns are coming from.

This statement exemplifies the basic column-operator-value pattern found in most WHERE clauses. The most commonly used operators are:

Commonly used operators and their descriptors
OperatorDescription
=equals
<>not equals
>greater than
>=greater than or equal to
<less than
<=less than or equal to
BETWEENwithin a value range
LIKEmatching a search pattern
INmatching an item in a list

The usage of most of these operators is straightforward. But let's talk for a moment about the LIKE and IN operators. LIKE is used in combination with the wildcard character (the * character in Access and most other RDBMS, but the % character in some others) to find rows that contain a text string pattern. For example, the clause WHERE FIRST_NAME LIKE 'B*' would return Babe Ruth and Barry Bonds. WHERE FIRST_NAME LIKE '*ar*' would return Barry Bonds and Harmon Killebrew. WHERE FIRST_NAME LIKE '*ank' would return Hank Aaron and Frank Robinson.

Note:

Text strings like those in the WHERE clauses above must be enclosed in quotes. Single quotes are recommended, though double quotes will also work in most databases. This is in contrast to numeric values, which should not be quoted. The key point to consider is the data type of the column in question. For example, zip code values may look numeric, but if they are stored in a text field (as they should be because some begin with zeroes), they must be enclosed in quotes.

The IN operator is used to identify values that match an item in a list. For example, you might find seasons compiled by members of the Giants franchise (which started in New York and moved to San Francisco) using a WHERE clause like this:

    WHERE TEAM IN ('NYG', 'SFG')

Finally, remember that WHERE clauses can be compound. That is, you can evaluate multiple criteria using the AND and OR operators. The AND operator requires that both the expression on the left and the expression on the right evaluate to True, whereas the OR operator requires that at least one of the expressions evaluates to True. Here are a couple of simple illustrations:

    WHERE TEAM = 'SFG' AND YEAR > 2000
    WHERE TEAM = 'NYG' OR TEAM = 'SFG'

Sometimes WHERE clauses require parentheses to ensure that the filtering logic is carried out properly. One of the queries you were asked to write earlier sought to identify 100+ RBI seasons compiled before 1960 or after 1989. The SQL generated by the query builder looks like this:

SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.RBI

    FROM STATS

    WHERE (((STATS.YEAR)>1989) AND ((STATS.RBI)>99)) OR (((STATS.YEAR)<1960) AND ((STATS.RBI)>99));

It turns out that this statement is more complicated than necessary for a couple of reasons. The first has to do with the way I instructed you to build the query:

Screen capture of the result of RBI>99 criterion inefficiently being added to the query twice.

This resulted in the RBI>99 criterion inefficiently being added to the query twice. A more efficient approach would be to merge the two-year criteria into a single row in the design grid, as follows:

Screen capture of a more efficient version of the statement which puts a set of parentheses around the two year criteriaof the RBI condition, it produces a more readable query.

This yields a more efficient version of the statement:

SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.RBI

    FROM STATS

    WHERE (((STATS.YEAR)>1989 Or (STATS.YEAR)<1960) AND ((STATS.RBI)>99));

It is difficult to see, but this version puts a set of parentheses around the two year criteria so that the condition of being before 1960 or after 1989 is evaluated separately from the RBI condition.

One reason the logic is difficult to follow is because the query builder adds parentheses around each column specification. Eliminating those parentheses and the table prefixes produces a more readable query:

SELECT PLAYER_ID, YEAR, RBI

    FROM STATS

    WHERE (YEAR>1989 Or YEAR<1960) AND RBI>99;

D. The GROUP BY clause

The GROUP BY clause was generated by the query builder when we calculated each player's career RBI total:

SELECT LAST_NAME, FIRST_NAME, Sum(RBI) AS SumOfRBI

    FROM STATS

    GROUP BY LAST_NAME, FIRST_NAME;

The idea behind this type of query is that you want the RDBMS to find all of the unique values (or value combinations) in the field (or fields) listed in the GROUP BY clause and include each in the result set. When creating this type of query, each field in the GROUP BY clause must also be listed in the SELECT clause. In addition to the GROUP BY fields, it is also common to use one of the SQL aggregation functions in the SELECT clause to produce an output column. The function we used to calculate the career RBI total was Sum(). Other useful aggregation functions include Max(), Min(), Avg(), Last(), First() and Count(). Because the Count() function is only concerned with counting the number of rows associated with each grouping, it doesn't really matter which field you plug into the parentheses. Frequently, SQL developers will use the asterisk rather than some arbitrary field. This modification to the query will return the number of years in each player's career, in addition to the RBI total:

SELECT LAST_NAME, FIRST_NAME, Sum(RBI) AS SumOfRBI, Count(*) AS Seasons

    FROM STATS

    GROUP BY LAST_NAME, FIRST_NAME;

Note: In actuality, this query would over count the seasons played for players who switched teams mid-season (e.g., Mark McGwire was traded in the middle of 1997 and has a separate record in the STATS table for each team). We'll account for this problem using a sub-query later in the lesson.

E. The ORDER BY clause

The purpose of the ORDER BY clause is to specify how the output of the query should be sorted. Any number of fields can be included in this clause, so long as they are separated by commas. Each field can be followed by the keywords ASC or DESC to indicate ascending or descending order. If this keyword is omitted, the sorting is done in an ascending manner. The query below sorts the seasons in the STATS table from the best RBI total to the worst:

SELECT PLAYER_ID, YEAR, RBI

    FROM STATS

    ORDER BY RBI DESC;

That concludes our review of the various clauses found in SELECT queries. You're likely to memorize much of this syntax if you're called upon to write much SQL from scratch. However, don't be afraid to use the Access query builder or other GUIs as an aid in producing SQL statements. I frequently take that approach when writing database-related PHP scripts. Even though the data is actually stored in a MySQL database, I make links to the MySQL tables in Access and use the query builder to develop the queries I need. That workflow (which we'll explore in more depth later in the course) is slightly complicated by the fact that the flavor of SQL produced by Access differs slightly from industry standards. We'll discuss a couple of the important differences in the next section.


Credit for all screenshots: © Penn State is licensed under CC BY-NC-SA 4.0

Differences Between MS-Access and Standard SQL

Differences Between MS-Access and Standard SQL jls27

If you plan to use the SQL generated by the Access query builder in other applications as discussed on the previous page, you'll need to be careful of some of the differences between Access SQL and other RDBMS.

A. Table/Column Specification in the Expression Builder

We used the Expression Builder to combine the players' first and last names, and later to calculate their home run per at-bat ratio. Unfortunately, the Expression Builder specifies table/column names in a format that is unique to Access. Here is a simplified version of the HR/AB query that doesn't bother including names from the PLAYERS table:

SELECT STATS.PLAYER_ID, STATS.YEAR, [STATS]![AB]/[STATS]![HR] AS ABPERHR

	FROM STATS;

The brackets and exclamation point used in this expression is non-standard SQL and would not work outside of Access. A safer syntax would be to use the same table.column notation used earlier in the SELECT clause:

SELECT STATS.PLAYER_ID, STATS.YEAR, STATS.AB/STATS.HR AS ABPERHR

	FROM STATS;

Or, to leave out the parent table name altogether:

SELECT PLAYER_ID, YEAR, AB/HR AS ABPERHR

	FROM STATS;

Note that this modified syntax will work in Access.

B. String Concatenation

In our queries that combined the players' first and last names, the following syntax was used:

SELECT [PLAYERS]![FIRST_NAME] & ' ' & [PLAYERS]![LAST_NAME]....

As mentioned above, columns should be specified using the table.column notation rather than with brackets and exclamation points. In addition to the different column specification, the concatenation itself would also be done differently in other RDBMS. While Access uses the & character for concatenation, the concatenation operator in standard SQL is ||. However, there is considerable variation in the adherence to this standard. For a comparison of a number of the major RDBMS, see http://troels.arvin.dk/db/rdbms/#functions-concat.

Note:

One of the points of comparison in the linked page is whether or not there is automatic casting of values. All this is getting at is what happens if the developer tries to concatenate a string with some other data type (say, a number). Some SQL implementations will automatically cast, or convert, the number to a string. Others are less flexible and require the developer to perform such a cast explicitly.


Credit for all screenshots: © Penn State is licensed under CC BY-NC-SA 4.0

Subqueries

Subqueries jls27

There comes a time when every SQL developer has a problem that is too difficult to solve using only the methods we've discussed so far. In our baseball stats database, difficulty arises when you consider the fact that players who switch teams mid-season have a separate row for each team in the STATS table. For example, Mark McGwire started the 1997 season with OAK, before being traded to and spending the rest of that season with STL. The STATS table contains two McGwire-1997 rows; one for his stats with OAK and one for his stats with STL.

If you wanted to identify each player's best season (in terms of batting average, home runs or runs batted in), you wouldn't be able to do a straight GROUP BY on the player's name or ID because that would not account for the mid-season team switchers. What's needed in this situation is a multistep approach that first computes each player's yearly aggregated stats, then identifies the maximum value in that result set.

A novice's approach to this problem would be to output the results from the first query to a new table, then build a second query on top of the table created by the first. The trouble with this approach is that it requires re-running the first query each time the STATS table changes.

A more ideal solution to the problem can be found through the use of a subquery. Let's have a look at how to build a subquery in Access.

A. Building a query on top of another saved query

  1. Create a new query and add the STATS table to the design canvas.
  2. Click on the Totals button to add the Total row to the design grid.
  3. Bring the following fields down into the design grid: PLAYER_ID, YEAR, AB, HITS, HR, RBI, and TEAM.
  4. Set the value in the Total cell to Group By for the PLAYER_ID and YEAR fields.
  5. Set the value in the Total cell to Sum for the AB, HITS, HR, and RBI fields.
  6. Set the value in the Total cell to Count for the TEAM field. These settings specify that we want to group by the player ID and year, sum the player's stats across each player/year grouping, and include a count of the number of teams the player played for each year.
  7. Run the query and confirm that it produces 227 records. (Compare that against the number of records in the STATS table. Do you understand why there is a difference?)
  8. Save the query as STATS_AGGR.
  9. Now, create a new query. In the Show Table dialog, double-click on PLAYERS to add it to the design canvas.
  10. Next, in the Show Table dialog, switch from the Tables tab to the Queries tab and double-click on STATS_AGGR to add it to the design canvas. Let's identify each player's best season in terms of home runs.
  11. Again, click on the Totals button.
  12. Bring down to the design grid the FIRST_NAME and LAST_NAME fields from the PLAYERS table and SumOfHR from the STATS_AGGR query.
  13. Confirm that the value in the Total cell is set to Group By for the FIRST_NAME and LAST_NAME fields.
  14. Set the value in the Total cell for the SumOfHR field to Max.
  15. Save this query as QueryBasedOnAQuery.

B. Building a query within another query.

  1. Switch to the SQL View of the QueryBasedOnAQuery query (sorry about that). Note that an inner join is conducted between PLAYERS and STATS_AGGR. When Access executes this query, it needs to first evaluate the SQL stored in the STATS_AGGR query before it can join that result set to PLAYERS. In your mind, you can imagine replacing the STATS_AGGR part of the PLAYERS INNER JOIN STATS_AGGR expression with the SQL stored in STATS_AGGR. In fact, our next step will be to literally replace STATS_AGGR with its SQL code.
  2. Re-open the STATS_AGGR query and switch to SQL View.
  3. Highlight and copy the SQL (leave behind the trailing semi-colon).
  4. Return to the SQL View of the QueryBasedOnAQuery query.
  5. Remove the STATS_AGGR from the PLAYERS INNER JOIN STATS_AGGR and paste in the code on your clipboard.
  6. Insert parentheses around the code you just pasted and add AS STATS_AGG after the closing parenthesis. Note that we're giving the subquery an alias of STATS_AGG to avoid confusion with the saved query with the name STATS_AGGR.
  7. Replace the query's references to STATS_AGGR with STATS_AGG, (there is one reference to STATS_AGGR in the SELECT clause and one in the ON clause). Your query should look like this:
    SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, Max(STATS_AGG.SumOfHR) AS MaxOfSumOfHR
        FROM PLAYERS INNER JOIN (SELECT STATS.PLAYER_ID, STATS.YEAR, Sum(STATS.AB) AS SumOfAB, Sum(STATS.HITS) AS SumOfHITS, Sum(STATS.HR) AS SumOfHR, Sum(STATS.RBI) AS SumOfRBI, Count(STATS.TEAM) AS CountOfTEAM
        FROM STATS
        GROUP BY STATS.PLAYER_ID, STATS.YEAR) AS STATS_AGG
        ON PLAYERS.PLAYER_ID = STATS_AGG.PLAYER_ID
        GROUP BY PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME;
  8. Test the query. It should report the highest single-season home run total for each player.

This approach is a bit more complicated than the first, since it cannot be carried out with the graphical Query Builder. However, the advantage is that all of the code involved can be found in one place, and you don't have a second intermediate query cluttering the database's object list.

This completes the material on retrieving data from an RDBMS using SELECT queries. In the next section, you'll be posed a number of questions that will test your ability to write SELECT queries on your own.

Project 1: Writing SQL SELECT Queries

Project 1: Writing SQL SELECT Queries jls27

To demonstrate that you've learned the material from Lesson 1, please build queries in your Access database for each of the problems posed below. Some tips:

  • Remember that players who played for multiple teams in the same season (i.e., were traded mid-season) will have a separate row in the STATS table for each team-year combination.
  • Even if the requested query does not explicitly ask for player names in the output, please include names to make the output more informative.
  • Some of the problems require the use of sub-queries. You will receive equal credit whether you choose to embed your sub-query within the larger query or save your sub-query separately and build your final query upon that saved sub-query.
  • You should treat the words 'season' and 'year' in the problems below as synonymous.

Here are the problems to solve:

  1. Display all records/fields from the STATS table for players on the San Francisco Giants (SFG).
  2. Output the batting average (HITS divided by AB) for each player and season. Format this value so that exactly 3 digits appear after the decimal point.

    For example, Babe Ruth had a batting average of .200 in 1914 (2 hits in 10 at-bats).
     
  3. Display all records/fields from the STATS table along with the names of the players in the format “Ruth, Babe”.
  4. Calculate the number of seasons each player played for each of his teams. (Players who were on two teams in the same season can be counted as having played for both teams.)

    For example, Mark McGwire played 12 seasons for the Oakland Athletics and 5 seasons for the St Louis Cardinals (1997 being counted for both teams).
     
  5. List the players based on the number of home runs hit in their rookie (first) seasons, from high to low.

    For example, Frank Robinson hit 38 home runs in his rookie season of 1956.
     
  6. Display the names of players who played in New York (any team beginning with 'NY').  Your output should include only the player names, with no duplicates.
     
  7. Sort the players by the number of seasons played (high to low). (Do not double-count in cases in which a player switched teams mid-season.)

    For example, Sammy Sosa played 18 seasons, with 1989, which he split between CHW and TEX counted as 1 season, not 2.
     
  8. Sort the players by their career batting average (career hits divided by career at-bats, high to low).

    For example, Harmon Killebrew had a career batting average of .256 (2086 hits in 8147 at-bats).
     
  9. Sort the players by their number of seasons with 100 or more RBI (high to low).

    For example, Hank Aaron had 100 or more RBI in a season 11 times.
     
  10. Sort the players based on the number of years between their first and last seasons of 40 or more HR (from most to least).

    For example, Willie Mays had his first 40-HR season in 1954 and his last in 1965, a difference of 11 years.
     

Name your queries so that it's easy for me to identify them (e.g., Proj1_1 through Proj1_10).

Deliverables

This project is one week in length. Please refer to the Canvas Calendar for the due date.

  1. Upload your Access database (.accdb file) containing your solutions to the 10 query exercises above to the Project 1 Dropbox. (100 of 100 points)
  2. Complete the Lesson 1 quiz.