Posts

DSC_0045 Zach Doty Cover Photo Group By SQL Statement Function

GROUP BY SQL Statement

Introducing the GROUP BY SQL Statement in PostgreSQL

‘Ello SQL geeks! Welcome back to our SQL learning journey. We left off with a beginner SQL skills challenge and the aggregate SQL functions: MIN, MAX, AVG and SUM. Today we’re looking at the GROUP BY statement. We’ll learn about this function in PostgreSQL and walk through usage of this handy SQL statement.

 

About the GROUP BY SQL Statement/Clause

From my simple understanding, GROUP BY functions like a hybrid of the following:

  • SELECT DISTINCT keyword (If used without an aggregate function like SUM), and,
  • An Excel pivot table, rolling up aggregate figures (Count, Sum, Average, etc.) into unique rows

If you’re familiar with Excel Pivot tables, then you’ll recognize here the power of this function.

 

Let’s take a look at some examples to clarify.

 

First Look at Using the GROUP BY Function

To better illustrate the power of GROUP BY, we’ll first show its usage without aggregate functions. Consider the following:

If we query the address table of our sample database with a generic SELECT * FROM address; we get back an atrocious 605 rows of data. Aggregate and useless!

2017-04-27-001-GROUB-BY-Generic-SELECT-Query-Pitfall

In contrast, if we call the GROUP BY function, we’ll get back a cleaner output, with fewer rows – only unique values returned. While this is an incremental improvement to analyzing data, there’s much left to be desired.

2017-04-27-002-GROUP-BY-Without-Aggregate-Function

 

What’s missing here? How about that pivot table-esque functionality? This is where the power of using GROUP BY with aggregate functions gets awesome.

Using the GROUP BY SQL Statement with Aggregate Functions

As with most analysis, a single data point or data series rarely holds significant insight value on its own. Let’s drive home that point by leveraging the GROUP BY statement with the SUM aggregate function. Below, we compare the ratings of films in our sample database in aggregate by replacement cost. Perhaps this could serve in-store strategies for loss prevention.

SELECT rating, SUM(replacement_cost)
FROM film
GROUP BY rating;

2017-04-27-003-GROUP-BY-SUM-Aggregate-Function

If we extend this functionality to more real world examples, we could use the following for GROUP BY:

  • Grouping page-level / URL data to roll up clickstream analytics data
  • Large scale analysis of CRM data for customer segmentation analysis
  • Analyzing returns for financial data

The list could (and I’m sure it does) go on.

Let’s take this one step further and reduce potential future workload, by building sort functionality into our query. Below, we add a line to get most expensive ratings to least.

SELECT rating, SUM(replacement_cost)
FROM film
GROUP BY rating
ORDER BY SUM(replacement_cost) DESC;

2017-04-27-004-GROUP-BY-SUM-ORDER-BY-SUM

By the looks of this, no need to guard Land Before Time 8. 🙂

Extending our lesson: you can use the COUNT, AVG, and other aggregate functions to analyze as desired.

 

Wrap Up

Alright, this was a relatively gentle introduction into more advanced SQL functions. GROUP BY is a rather critical function, so in our next article, we’ll be doing yet another skills challenge. Joy!

Feel free to catch up on our other articles that help you learn PostgreSQL. Also, check out some how-to’s on developing Amazon Alexa skills, and a new series on getting started with Machine Learning. As always, please share with your colleagues and share thoughts in the comments below. Cheers.

DSC_0007 Zach Doty Cover Photo for Beginner PostgreSQL Skills Challenge

Beginner SQL Skills Challenge!

Howdy, SQL geeks. Hope this post finds you swell!

Over the past few months, we’ve gained a ton of ground in learning SQL, or at least I have. 🙂

Let’s take a moment to:

  1. Test our knowledge of SQL skills learned thus far
  2. Start seeing SQL queries less as statements of code, and more as real world business challenges

In this article, we’ll have a recap plus three sections:

  • Recap of the training database we’ve been working with
  • General statements of each business problem
  • Hints and thoughts about how to approach each problem
  • Solutions to each problem

 

Recap: Our Training Database

Our training database is a best/old faithful. We’ve been using the surprisingly popular DVD rental database in a .tar file for our practice database.

Contained within this databases are various tables with fictitious information, including: customer information, film production information, business/pricing information and so forth.

In our challenges, we’ll execute various SQL queries to extract pieces of insight for business tasks. It’s assumed in this article that you’ve installed PostgreSQL via pgAdmin and have followed this article series so far, using the DVD rental training database.

Without further ado, let’s begin.

 

The SQL Challenges

Alright, here we go:

  1. How rentals were returned after July 17, 2005?
  2. How many actors have a last name that starts with the letter A?
  3. How many unique districts are represented in the customer database?
  4. Can you return the actual list of districts from challenge #3?
  5.  How many films have a rating of R and a replacement cost between $5 and $15?
  6. How many films have the word Truman somewhere in the title?

 

How to Approach the Challenges

Right, then. In this section, we’ll add some color commentary (read: hints) to our challenges. This should help you understand the mechanics of the solution, while ensuring you can’t see the answers all in one screen. 😉

 

Challenge 1: Rentals Returned After July 17, 2005

As with all challenges, a problem well stated is half (or more) solved. So let’s look at the high levels of the ask, and work our way down. We need information on rentals, so this means we’ll probably be querying the rentals table.

We would want to first examine the rentals table in a concise manner by doing:

SELECT * FROM rental LIMIT 5;

Once you’ve surveyed the table, we really only need one column returned (pun not intended) and we only want the sum figure of returns, where (HINT!!) the return date was after (think a logical operator here) July 17, 2005.

 

Challenge 2: Actors That Have a Last Name Beginning with “A”

Like our first challenge, let’s work from the “top down”. We need actor information, so querying the actor table would be a great place to start. Similar to last time, we need a count of values matching a condition. The difference versus challenge #1  is we need to find match a pattern like or such as an actor’s last name that begins with the letter A.

 

Challenge 3: Number of Unique Districts in the Customer Database

The title and description could cause some confusion here. You may need to do some basic SELECT * FROM table_name LIMIT X; queries to make sure you’ve got the right table. Once you do, we’re looking for an amount of distinct values in the database. Order of operations matters.

 

Challenge 4: Returning the Actual Lists of Districts from Challenge #3

Not much to hint at here – getting challenge #3 is the key here. You’ll really only be simplifying the correct query in challenge #3 to get the correct answer here.

 

Challenge 5: Cheap, (Mildly) Naughty Films

This one might be the longest query yet in this challenge. So we’re looking up film information, thus should know which table to query. We’re returning a value where a certain rating must be returned, and (HINT) we need to layer in one more lens of qualification. That lens dictates we specify a range between (cough, hint!) two values.

 

Challenge 6: Where in the World are Films Containing “Truman”?

This challenge is more of a recency test than retention of older concepts. You’ll need to employ pattern matching again for this business challenge/query to find titles that have some match like Truman in the title.

 

 

Challenge Solutions

Is that your final answer? Below are the queries, with screenshots of what I did.

 

Solution 1: Rentals Returned After July 17, 2005

SELECT COUNT(return_date) FROM rental
WHERE return_date > ‘2005-07-17’;

2017-04-15-001-Challenge-1-Sol-Count-From-Where

 

Solution 2: Actors That Have a Last Name Beginning with “A”

SELECT COUNT(*) FROM actor
WHERE last_name LIKE ‘A%’;

2017-04-15-002-Challenge-2-Sol-Count-From-Where-Like

 

Solution 3: Number of Unique Districts in the Customer Database

SELECT COUNT(DISTINCT(district)) FROM address;

2017-04-15-003-Challenge-3-Sol-Count-Distinct

 

Solution 4: Returning the Actual Lists of Districts from Challenge #3

SELECT DISTINCT(district) FROM address;

2017-04-15-004-Challenge-4-Sol-Select-Distinct

 

Solution 5: Cheap, (Mildly) Naughty Films

SELECT COUNT(*) FROM film
WHERE rating = ‘R’
AND replacement_cost BETWEEN 5 AND 15;

2017-04-15-005-Challenge-4-Sol-Count-From-Where-Between-And

 

Solution 6: Where in the World are Films Containing “Truman”?

SELECT COUNT(*) FROM film
WHERE title LIKE ‘%Truman%’;

2017-04-15-006-Challenge-4-Sol-Count-Where-Like-Wildcard

 

Wrap Up

Well done for completing these challenges! You shall indeed pass. 🙂

Soon, we’ll be covering aggregate SQL functions, such as MIN, MAX, AVG and SUM.

If you’re just joining us, here’s a running list of our articles so to date (4/16/2017):