JOINS in PostgreSQL

Welcome back, SQL-ites! Today, we’re going to take a significant leap forward in our journey of learning PostgreSQL.

A quick recap on where we’ve been recently:

In more broad terms, most of our PostgreSQL training so far has focused on how to best select and segment our data. However, there is much more to be accomplished in big data manipulation. Enter JOINS.

 

About JOINS & PostgreSQL

JOINS helps unlock the foundation of most SQL databases, that are known as relational databases. Specifically, when we’re able to triangulate multiple data sources together, combined with sorting, segmentation and some intuition, we can get some really powerful stuff.

So what does JOINS do? It allows you to relate data in one table to the data in other tables.

Think about some of our practice runs in PostgreSQL so far. We’ve looked at DVD rentals, customers, transactions, ID’s, and so on. What if we wanted to start linking in more detail customer names with ID’s, or transaction ID’s with film names, or all the above? Weaving data into an interactive story could allow us to see really cool things, such as which customers prefer certain genres, or perhaps unveil renting habits for a timely promotional offer to smooth out our top line revenue.

 

BASIC JOINS Syntax in PostgreSQL

As with many of our learning adventures, there is complexity and there are layers to our education. 🙂

First, it’s important to note that there are actually several types of JOINS, so this isn’t just an amorphous, blob-lump function. Here are some of the important kinds:

  1. INNER JOIN
  2. OUTER JOIN
  3. self-join

To start things out, we’re going to look at INNER JOIN first.

INNER JOINS Concepts for PostgreSQL

We’re going to jump headlong into this, at the expense of immediate clarity. Joy! Let’s say we have two tables which we’ll relate data between.

 

2017-09-10-001-JOINSA-Concept-Example-1

Above, we’ve pulled out some key points of interest between our tables, unique columns, and more importantly, a shared key (column) of customer_id between the two tables. We’ll explain more about keys in a minute. Ish.

We’ll use the shared column of (example) customer_id to further mash up the data. Let’s take a look at syntax now.

 

INNER JOINS Basic Syntax

If things aren’t 100% connected for you conceptually, I completely understand. Please do bear with me, as seeing some examples may help “close the loop”. Here’s what an INNER JOINS query might look like for a PostgerSQL usage:

 

SELECT A.pka, A.c1, B.pkb, B.c2

FROM A

INNER JOIN B ON A.pka = B.fka;

 

Let’s break this down a bit more.

The first line is extremely similar to what we’ve been doing all along. Calling in data. The subtle difference here is that we’re first including the respective table names. So, it’s tableA.column, tableB.column, tableB.column2.

Next, we specify the primary table we’re first selecting from, per usual. Then we specify the “donor” table that will be combined with the primary table via INNER JOIN. Further, we make this work by using the ON keyword which specifies the shared key columns. 🙂

 

More Explanation on the Inner Workings of JOINS (Pun Not Intended)

This has been a bit of a tricky concept for me to learn so far, so we’re going to walk through the nuts and bolts of JOINS as operated by PostgreSQL.

When JOINS is executed, PostgreSQL takes every row in our “Table A” and goes through Table B, looking for rows that match our specified condition of A.pka = B.fka. So combining our example and laymen’s terms, the JOINS function looks for common occurrences of the customer_id  so that additional, previously disparate data from separate tables may be awesomely fused together. 🙂

If a match is found between the two tables, it combines columns of both A and B rows into one row, and adds that newly combined row to the data set you get back.

 

Pardon the Interruption: PostgreSQL “life” hacks for JOINS

A couple of quick items that may prove useful to you as we get more familiar with joins.

  1. Sometimes your tables may share column names (Analytics data, anyone? Can I get an amen?), so you may need to use the syntax table_name.column_name for clarity.
  2. If you or someone you love has created excessively long table names, you can introduce a shorter alias (via the AS statement) for the table, and use accordingly.

 

Hey, I Thought We Were Going to Learn About INNER JOIN?

We are! And what’s better, we’re going to use a Venn diagram to explain it. 🙂 #DataScience.

Okay, the INNER JOIN PostgreSQL clause returns rows in Table A that have corresponding rows in Table B. Super scientific visualization below.

2017-09-11-002-JOINSA-Concept-Example-2

The overlapped area above are the results that will be returned to us.

 

INNER JOINS PostgreSQL Code Examples

Putting all the pieces together, here’s our first INNER JOINS code sample:

SELECT

customer.customer_id

first_name,

last_name,

email,

amount,

payment_date

FROM

customer

INNER JOIN payment ON payment.customer_id = customer.customer_id;

 

And below, what we get back from pgAdmin.

2017-09-11-003-JOINSA-INNER-JOINS-First-Example

 

Okay, a few notes. We went out of our way to say that overlapping column names need table names for distinction. Here’s what we didn’t say, you don’t need to specify table names for unique columns. Above, you can see that we’ve nicely meshed together customer ID with both customer information with payment information. Awesome!

I know this is a bit of a rough stop- but I’m really trying to work on getting more sleep these days. So it’s 12:30am, and I’ve got to head to bed. We’ll be back with updates on this post and more progress forward. Cheers!

Update – 9/16/2017 – Add’l INNER JOINS Examples

Welcome back again, SQL-ites! We’re back with a weekend update from our first look at INNER JOINS in PostgreSQL. Above, we left off with a basic example of INNER JOINS. To quickly recap, we might think of JOINS as basic data composition. Often in marketing and business useful data may begin in silos, thus, combination into useful form must occur.

Our next example of INNER JOINS is a slight modification to the first full example. We have the same basin INNER JOINS syntax, but have added ORDER BY customer.customer_id at the end, to clean the presentation of our data a bit.

SELECT customer.customer_id

first_name,

last_name,

email,

amount,

payment_date

FROM customer

INNER JOIN payment ON payment.customer_id = customer.customer_id

ORDER BY customer.customer_id ASC;

2017-09-16-004-JOINSA-INNER-JOINS-Second-Example

 

Using WHERE with INNER JOINS

Let’s continue to add modifications to our basic INNER JOINS statement. Below, we harness the power of both combination and selection to specify joined results, but for a specific customer ID. A use case for this might involve isolating or routing specific data after it’s been joined, for reporting purposes.

SELECT customer.customer_id

first_name,

last_name,

email,

amount,

payment_date

FROM customer

INNER JOIN payment ON payment.customer_id = customer.customer_id

WHERE customer.customer_id = 2;

2017-09-16-005-JOINSA-INNER-JOINS-Third-Example

DSC_0050 Zach Doty AS PostgreSQL Statement Cover Photo

JOINS Foundations: The AS PostgreSQL Statement

Intro to JOINS: the AS PostgreSQL Statement

What’s up SQL people? We’re back, and better than ever, in our foray to learn PostgreSQL. Since we’ve completed some intermediate skills challenges & learned GROUP BY, it’s time to examine JOINS.

Before we examine JOINS, there’s a key foundation piece we must cover first: the AS statement in PostgreSQL. Let’s jump in.

 

About the AS Statement

The AS statement in PostgreSQL enables us to rename sections of a table or table columns with a temporary alias, or almost a variable, for manipulation.

It’s a simple statement, so let’s see it in action.

 

1. Basic AS Statement Example

Our most basic example is a basic query where perhaps a column wasn’t named to our liking. Consider the following.

SELECT rental_rate AS film_cost

FROM film

LIMIT 10;

2017-08-30-001-AS-Statement-Syntax-Example-1

Great for an introductory example, but not inherently useful. Read on as we apply the AS statement more deeply.

2. Semi-Intermediate AS Statement Example

Let’s provide an example that’s a bit more engaged. Example, if we use aggregate functions, the column output doesn’t have a clean name attached to it. But no longer! The AS statement allows us to have the summation output in a GROUP BY statement to something we’ll recognize.

 

SELECT customer_id, SUM(amount) AS customer_ltv

FROM payment

GROUP BY customer_id

ORDER BY customer_ltv DESC

LIMIT 7;

2017-08-30-002-AS-Statement-Syntax-aggregate-example-2

 

This is something more useful for intermediate PostgreSQL work!

 

Wrap Up

We aren’t spending much further time here since this is a simple application and the JOINS statement is the function we’re truly after. If you’re just joining this series, check out our home page on how to learn PostgreSQL.

DSC_0012 Zach Doty PostgreSQL cover photo

PostgreSQL Intermediate Skills Test

Hey there, SQL-ites. Time for another (yes, another!) PostgreSQL skills challenge. It’s our last stop before moving into more intermediate and advanced PostgreSQL skills, such as JOINS.

If you’re just joining this series on SQL, we previously covered the HAVING statement & GROUP BY. Here’s the home page of our journey in learning PostgreSQL. Today, we do a more rounded knowledge check of older PostgreSQL statements, such as SELECT WHERE, COUNT & ORDER BY.

We have 3 problems, and provide the winning code for each, based on our sample database. In the past, I’ve given more explanation to the problems, but because I’m trying to get some traction myself in moving forward, we’ll only have problem & solution laid out here.

Alright, let’s go!

1. Return the customer IDs of customers who have spent at least $110 with the staff member who has an ID of 2.

The answer should be customers 187 and 148.

SELECT customer_id, SUM(amount)
FROM payment
WHERE staff_id=2
GROUP BY customer_id
HAVING SUM(amount) >110;

 

2. How many films begin with the letter J?

The answer should be 20.

SELECT COUNT(*) FROM film

WHERE title LIKE ‘J%’;

 

3. What customer has the highest customer ID number whose name starts with an ‘E’ and has an address ID lower than 500?

The answer is Eddie Tomlin.

SELECT customer_id, first_name, last_name, address_id
FROM customer
WHERE address_id <500 AND first_name LIKE ‘E%’
ORDER BY customer_id DESC;

 

We’ll see you on the other side soon, for some JOINS awesomeness. 🙂

DSC_0300 Zach Doty Cover Photo for HAVING PostgreSQL Clause

The HAVING Clause in PostgreSQL

Howdy SQL-ites! Welcome back to our next step in learning SQL. After a long hiatus, we recently stumbled our way through some intermediate SQL statement challenges, namely on GROUP BY.

Today, we’re back in the flow, and learning about the HAVING clause in PostgreSQL.

 

About the HAVING Clause in PostgreSQL

The HAVING clause is most often seen as a helper with the GROUP BY statementGROUP BY was pretty awesome, right? How might we improve upon it?

Let’s consider two concepts:

  1. Our occasional parallels of PostgreSQL to Excel, and,
  2. Our previous intermediate challenge example.

Relating PostgreSQL’s GROUP BY to Excel

If we hearken back to our first encounter with GROUP BY, we compared GROUP BY to pivot tables. Specifically, if we have a big data table that records data of recurring themes, e.g., you customer base, it can be helpful to manipulate as aggregated and assembled, vs. raw.

However, GROUP BY is only an intermediate step in data analysis. If we think about our final challenge in the last post, we had to limit the results to 5. Even if we group data, it’s neither segmented nor useful toward analysis.

Thus, we need an additional method for winnowing down our GROUP BY results.

 

Meet the HAVING Clause

So, about the HAVING clause. It’s most often used in tandem with GROUP BY to filter out data rows that don’t meet a certain condition. Think of it as similar to the WHERE function, just an accompaniment to GROUP BY.

 

Let’s take a look at basic syntax:

SELECT column, aggregate(column_2)

FROM table

GROUP BY column

HAVING condition;

 

Off the bat, we should this is extremely familiar if we’ve covered GROUP BY. The differentiation is the additional HAVING condition at the end. This condition could be something like, HAVING sum(column_2) less than 50.

The Difference Between WHERE and HAVING

Some of you sharp folks may want to know, “what’s the difference between WHERE and HAVING?” That would be an excellent question. Here’s the difference:

The WHERE clause sets conditions on individual rows, before the GROUP BY clause has been applied. The HAVING clause specifies conditions on grouped rows, created by the GROUP BY clause.

Let’s run some examples.

 

Using the HAVING Clause in PostgreSQL

Here’s our first example, very similar to our previous skills challenge:

SELECT customer_id, SUM(amount)
FROM payment
GROUP BY customer_id
HAVING SUM(amount) > 150;

Above, we have added another segmentation layer with the HAVING clause.

You can see we’re pulling information about how much our customers have paid us. Further, we specify that we only want to see customers with a lifetime purchase amount of greater than $150.

 

Let’s look at another example of the HAVING clause. Say for example, we want to know which store has served the most customers. Below, we’ll execute the following code:

SELECT store_id, COUNT(customer_id)
FROM customer
GROUP BY store_id
HAVING COUNT(customer_id) >275;

Above, we’ve selected both the store and customer ID columns from the customer table. Further, we group by the store ID, because we want store-level data, but we only want to see the stores which have served more than 275 customers. Below, we can see only store has done so. 🙂

 

Combining Usage of the WHERE & HAVING Clauses

We mentioned earlier the WHERE and HAVING clauses are different, somewhat in function, but mostly in order of execution. Here’s what we didn’t say: you can actually use them in tandem. There is a great theoretical use case for this, unfortunately our sample database is a bit small, but here goes.

Let’s think about the film table (used in previous examples.) Perhaps we want to analyze the films, by rental rate, but only films with certain ratings. For example, perhaps we’re no longer interested in carrying NC-17 films, but still want to get an aggregated view of on average, how much each films rents for, by rating. Additionally, we want to see which ratings, if any, have an average rental rate of less than $3.

Here’s the code we would use:

SELECT rating, ROUND(AVG(rental_rate),2)
FROM film
WHERE rating IN (‘R’,’G’,’PG’,’PG-13′)
GROUP BY rating
HAVING AVG(rental_rate)<3;

 

 

Wrap Up

Alright, that concludes our section for today. It feels good to be back. 🙂

Hopefully you found this section on the HAVING PostgreSQL statement useful. If you need to backtrack or further explore, here are some useful/recent links:

DSC_0006 Zach Doty Intermediate GROUP BY SQL Skills Challenge Cover Photo

Intermediate SQL Skills Challenge: GROUP BY

Hey there SQL-ites! Wow, it’s been awhile since I’ve last posted. Work has been crazy busy again, and just life in general. Sure feels good to be back, learning again! I’m daringly dropping straight back into my learnings from where we left off…in April!? Crazy.

Anyway, let’s get back to brass tacks. Before life and work got really crazy for me, we covered:

Another quick recap note, we’ve been using the famous DVD rental training database for our work. On to the good stuff.

 

GROUP BY SQL Skills Challenge #1

Let’s say it’s time for quarterly reviews, who doesn’t love evaluations? ¯_(ツ)_/¯ Implement your knowledge of GROUP BY against the following problem:

We need to figure out how much revenue our employees have realized from DVD rentals, and how many transactions each handled.

 

GROUP BY SQL Skills Answer #1

Let’s talk through the problem and dissect it before presenting code. “A problem well-stated is a problem-half solved” – paraphrase of some smart person.

  1. We’re talking about revenue, so we’ll need to be dealing with the payment table.
  2. We’re evaluating employees (staff), SUM of revenue, and COUNT of transactions.
  3. If we’re aggregating this data, we’re GROUPing BY employee.
  4. We’re also ORDERing the employees BY who handled the most transactions.

That said, here’s the code:

SELECT staff_id, SUM(amount), COUNT(amount)
FROM payment
GROUP BY staff_id
ORDER BY COUNT(amount) DESC;

…with our results!

 

GROUP BY SQL Skills Challenge #2

Well done on your challenge! Here’s the second:

It’s time to do inventory, flashbacks of retail and restaurant wonder for all of us. 🙂

In the name of forecasting and planning, we need to figure out the average replacement cost of our movies, by rating.

 

GROUP BY SQL Skills Answer #2

Ok, let’s walk through our problem.

  1. We need to use the film database here, since we’re gathering information on the movies.
  2. We’re GROUPing our films BY rating
  3. We’re using an aggregate function to determine the average replacement cost of each rating.

Drum roll, here’s a winning code snippet:

SELECT rating, AVG(replacement_cost)
FROM film
GROUP BY rating
ORDER BY AVG(replacement_cost) DESC;

With the output:

 

Are there more challenges we should be covering? Yes. However, I’m trying to do better about getting more sleep these days. Unlike past SQL articles, it’s still (barely) before midnight. So we’ll take a quick breather, possibly update this article, but definitely keep moving forward. Cheers!

 

Update- 8/20/2017 —

GROUP BY SQL Skills Challenge #3!

Alright SQL-ites. After getting some rest, I’ve regrouped a few days later to cover the last challenge:

From our database, we want to get the customer ID’s of the top 5 customers, by money spent, so we can send them a thank you for their business!

 

GROUP BY SQL Skills Answer #3

Let’s diagnose the problem.

  1. If we’re gathering revenue information, we’ll need to use the payment table.
  2. If we’re getting the top spending customers, we’ll need to GROUP all transactions BY customer ID
  3. To see the top 5 paying customers, we’ll want to ORDER the results BY the SUM of payment amount.

Considering the above, here’s our code:

SELECT customer_id, SUM(amount)
FROM payment
GROUP BY customer_id
ORDER BY SUM(amount) DESC
LIMIT 5;

DSC_0069 Zach Doty Cover Photo for What is data science

What is Data Science?

Welcome back, campers! It’s been a minute (read:months) since I’ve last posted, and we’re back! (For now.)

Today’s topic, data science: supposedly the latest exploding field, critical to every enterprise.

Why is it important and relevant? The rise of big data has created a relatively untapped treasure trove of insight. However, it’s undeveloped! Further, the tapping of this insight requires a blended skill set which is currently in short supply in the market: the data scientist.

Who and what is a data scientist?

A data scientist is someone who finds new discoveries in data. They investigate hypotheses and look for meaning and knowledge within the data. They visualize the data by creating reports and looking for patterns. What distinguishes a data scientist from a traditional business analyst is the use of algorithms. Algorithms are one of the fundamental tools for data scientists. This requires mathematics knowledge, computer science savvy and domain knowledge.

What does it mean to be a data scientist?

A data scientist may handle open-ended questions such as, “Which customers are more likely to churn?” The data scientist would gather all the data, and run algorithms to find dependable patterns to improve the situation. Seems straightforward, yes? However, there are a range of misconceptions about data science and data scientist For example,  a data scientist may not necessarily be a developer-only or business intelligence analyst-only.

A data scientist will be able to combine both technical know-how and business domain knowledge into mathematics and statistics for maximum effect. That being said, true data scientists are extremely difficult to find and train. However, it may be possible to become a data scientist without expensive and time-consuming degrees, via focused tools and application training.

An oversimplified Venn diagram showing the makeup and value of a data scientist

An oversimplified Venn diagram showing the makeup and value of a data scientist

More Notes on Data Science

When considering data science from a managerial perspective, it’s important to understand the current broad allocation of the average data scientist’s time. An estimated 60-705 of a data scientist’s time is spent assembling and cleaning data, tasks which could be delegated to technical specialists, data integration specialists and so forth. (For example, text mining, SQL queries and so forth.)

If you’ve followed my site and blog lately, you’ll noticed I’ve lapsed a bit on posting. I’m trying to get back into sharing my education again, so stay tuned. Things have just been busy lately. 🙂