DSC_0069 Zach Doty Cover Photo for What is data science

What is Data Science?

Welcome back, campers! It’s been a minute (read:months) since I’ve last posted, and we’re back! (For now.)

Today’s topic, data science: supposedly the latest exploding field, critical to every enterprise.

Why is it important and relevant? The rise of big data has created a relatively untapped treasure trove of insight. However, it’s undeveloped! Further, the tapping of this insight requires a blended skill set which is currently in short supply in the market: the data scientist.

Who and what is a data scientist?

A data scientist is someone who finds new discoveries in data. They investigate hypotheses and look for meaning and knowledge within the data. They visualize the data by creating reports and looking for patterns. What distinguishes a data scientist from a traditional business analyst is the use of algorithms. Algorithms are one of the fundamental tools for data scientists. This requires mathematics knowledge, computer science savvy and domain knowledge.

What does it mean to be a data scientist?

A data scientist may handle open-ended questions such as, “Which customers are more likely to churn?” The data scientist would gather all the data, and run algorithms to find dependable patterns to improve the situation. Seems straightforward, yes? However, there are a range of misconceptions about data science and data scientist For example,  a data scientist may not necessarily be a developer-only or business intelligence analyst-only.

A data scientist will be able to combine both technical know-how and business domain knowledge into mathematics and statistics for maximum effect. That being said, true data scientists are extremely difficult to find and train. However, it may be possible to become a data scientist without expensive and time-consuming degrees, via focused tools and application training.

An oversimplified Venn diagram showing the makeup and value of a data scientist

An oversimplified Venn diagram showing the makeup and value of a data scientist

More Notes on Data Science

When considering data science from a managerial perspective, it’s important to understand the current broad allocation of the average data scientist’s time. An estimated 60-705 of a data scientist’s time is spent assembling and cleaning data, tasks which could be delegated to technical specialists, data integration specialists and so forth. (For example, text mining, SQL queries and so forth.)

If you’ve followed my site and blog lately, you’ll noticed I’ve lapsed a bit on posting. I’m trying to get back into sharing my education again, so stay tuned. Things have just been busy lately. 🙂

DSC_0045 Zach Doty Cover Photo Group By SQL Statement Function

GROUP BY SQL Statement

Introducing the GROUP BY SQL Statement in PostgreSQL

‘Ello SQL geeks! Welcome back to our SQL learning journey. We left off with a beginner SQL skills challenge and the aggregate SQL functions: MIN, MAX, AVG and SUM. Today we’re looking at the GROUP BY statement. We’ll learn about this function in PostgreSQL and walk through usage of this handy SQL statement.


About the GROUP BY SQL Statement/Clause

From my simple understanding, GROUP BY functions like a hybrid of the following:

  • SELECT DISTINCT keyword (If used without an aggregate function like SUM), and,
  • An Excel pivot table, rolling up aggregate figures (Count, Sum, Average, etc.) into unique rows

If you’re familiar with Excel Pivot tables, then you’ll recognize here the power of this function.


Let’s take a look at some examples to clarify.


First Look at Using the GROUP BY Function

To better illustrate the power of GROUP BY, we’ll first show its usage without aggregate functions. Consider the following:

If we query the address table of our sample database with a generic SELECT * FROM address; we get back an atrocious 605 rows of data. Aggregate and useless!


In contrast, if we call the GROUP BY function, we’ll get back a cleaner output, with fewer rows – only unique values returned. While this is an incremental improvement to analyzing data, there’s much left to be desired.



What’s missing here? How about that pivot table-esque functionality? This is where the power of using GROUP BY with aggregate functions gets awesome.

Using the GROUP BY SQL Statement with Aggregate Functions

As with most analysis, a single data point or data series rarely holds significant insight value on its own. Let’s drive home that point by leveraging the GROUP BY statement with the SUM aggregate function. Below, we compare the ratings of films in our sample database in aggregate by replacement cost. Perhaps this could serve in-store strategies for loss prevention.

SELECT rating, SUM(replacement_cost)
FROM film
GROUP BY rating;


If we extend this functionality to more real world examples, we could use the following for GROUP BY:

  • Grouping page-level / URL data to roll up clickstream analytics data
  • Large scale analysis of CRM data for customer segmentation analysis
  • Analyzing returns for financial data

The list could (and I’m sure it does) go on.

Let’s take this one step further and reduce potential future workload, by building sort functionality into our query. Below, we add a line to get most expensive ratings to least.

SELECT rating, SUM(replacement_cost)
FROM film
GROUP BY rating
ORDER BY SUM(replacement_cost) DESC;


By the looks of this, no need to guard Land Before Time 8. 🙂

Extending our lesson: you can use the COUNT, AVG, and other aggregate functions to analyze as desired.


Wrap Up

Alright, this was a relatively gentle introduction into more advanced SQL functions. GROUP BY is a rather critical function, so in our next article, we’ll be doing yet another skills challenge. Joy!

Feel free to catch up on our other articles that help you learn PostgreSQL. Also, check out some how-to’s on developing Amazon Alexa skills, and a new series on getting started with Machine Learning. As always, please share with your colleagues and share thoughts in the comments below. Cheers.

SQL Aggregate Functions: Min, Max, Avg and Sum

Welcome back, SQL nerds! We’re back in action on the journey of learning SQL, after a beginner PostgreSQL skills challenge.

We’re reaching the end of basic functions and queries with this article. Based on what we’ve learned so far, we can do basic counting, filtering, sorting and pattern matching against PostgreSQL databases.

If you’re just tuning in, here’s the page on how to learn SQL, and the previous SQL article on the LIKE Statement.

Okay, enough jabber. Let’s jump in. The aggregate functions of MIN, MAX, AVG and SUM are our turning point into more complex SQL queries that involve concepts such as GROUP BY, among others.

At the same time, the functions on their own aren’t super complicated. Because we’re ass-u-ming you’re familiar with the general concepts of minimum, maximum, etc., we’re going to forgo conceptual and syntax explanations for demos.


AVG Aggregate Function

As we level up in SQL, we’re going to do less and less explaining / handholding / screenshots. That said, we’re going to explore our DVD rental data set for a table with a nice numerical component that would make good use of the functions.

We did a SELECT * FROM film LIMIT 15; to get a peek at the columns. For the purposes of this exercise, the replacement_cost column will do nicely.

To get the AVG:

SELECT AVG(replacement_cost) FROM film;


Using ROUND for Decimal Place Control

You’ll notice in the example above that we got 3 decimal places on what’s supposed to be a dollar amount. How do we fix that? Glad you asked. Meet the ROUND function.

We pass in the target value (average of replacement_cost) and mandate the number of decimal places we’d like returned. Below:

SELECT ROUND( AVG(replacement_cost), 2) FROM film;



MIN Aggregate Function

To find the minimum value in a given column:

SELECT MIN(replacement_cost) FROM film;



MAX Aggregate Function

To find the maximum value in a given column:

SELECT MAX(replacement_cost) FROM film;



SUM Aggregate Function

To find the maximum value in a given column:

SELECT SUM(replacement_cost) FROM film;




Alright, that was a relatively quick article! Hopefully this was a reprieve from more involved sections in the past. We should find that as we continue to strengthen our core PostgreSQL  capabilities, these articles and our SQL queries should be easier and easier. If you found this article interesting, you might enjoy a new section on how to get started in machine learning. Cheers.




DSC_0007 Zach Doty Cover Photo for Beginner PostgreSQL Skills Challenge

Beginner SQL Skills Challenge!

Howdy, SQL geeks. Hope this post finds you swell!

Over the past few months, we’ve gained a ton of ground in learning SQL, or at least I have. 🙂

Let’s take a moment to:

  1. Test our knowledge of SQL skills learned thus far
  2. Start seeing SQL queries less as statements of code, and more as real world business challenges

In this article, we’ll have a recap plus three sections:

  • Recap of the training database we’ve been working with
  • General statements of each business problem
  • Hints and thoughts about how to approach each problem
  • Solutions to each problem


Recap: Our Training Database

Our training database is a best/old faithful. We’ve been using the surprisingly popular DVD rental database in a .tar file for our practice database.

Contained within this databases are various tables with fictitious information, including: customer information, film production information, business/pricing information and so forth.

In our challenges, we’ll execute various SQL queries to extract pieces of insight for business tasks. It’s assumed in this article that you’ve installed PostgreSQL via pgAdmin and have followed this article series so far, using the DVD rental training database.

Without further ado, let’s begin.


The SQL Challenges

Alright, here we go:

  1. How rentals were returned after July 17, 2005?
  2. How many actors have a last name that starts with the letter A?
  3. How many unique districts are represented in the customer database?
  4. Can you return the actual list of districts from challenge #3?
  5.  How many films have a rating of R and a replacement cost between $5 and $15?
  6. How many films have the word Truman somewhere in the title?


How to Approach the Challenges

Right, then. In this section, we’ll add some color commentary (read: hints) to our challenges. This should help you understand the mechanics of the solution, while ensuring you can’t see the answers all in one screen. 😉


Challenge 1: Rentals Returned After July 17, 2005

As with all challenges, a problem well stated is half (or more) solved. So let’s look at the high levels of the ask, and work our way down. We need information on rentals, so this means we’ll probably be querying the rentals table.

We would want to first examine the rentals table in a concise manner by doing:


Once you’ve surveyed the table, we really only need one column returned (pun not intended) and we only want the sum figure of returns, where (HINT!!) the return date was after (think a logical operator here) July 17, 2005.


Challenge 2: Actors That Have a Last Name Beginning with “A”

Like our first challenge, let’s work from the “top down”. We need actor information, so querying the actor table would be a great place to start. Similar to last time, we need a count of values matching a condition. The difference versus challenge #1  is we need to find match a pattern like or such as an actor’s last name that begins with the letter A.


Challenge 3: Number of Unique Districts in the Customer Database

The title and description could cause some confusion here. You may need to do some basic SELECT * FROM table_name LIMIT X; queries to make sure you’ve got the right table. Once you do, we’re looking for an amount of distinct values in the database. Order of operations matters.


Challenge 4: Returning the Actual Lists of Districts from Challenge #3

Not much to hint at here – getting challenge #3 is the key here. You’ll really only be simplifying the correct query in challenge #3 to get the correct answer here.


Challenge 5: Cheap, (Mildly) Naughty Films

This one might be the longest query yet in this challenge. So we’re looking up film information, thus should know which table to query. We’re returning a value where a certain rating must be returned, and (HINT) we need to layer in one more lens of qualification. That lens dictates we specify a range between (cough, hint!) two values.


Challenge 6: Where in the World are Films Containing “Truman”?

This challenge is more of a recency test than retention of older concepts. You’ll need to employ pattern matching again for this business challenge/query to find titles that have some match like Truman in the title.



Challenge Solutions

Is that your final answer? Below are the queries, with screenshots of what I did.


Solution 1: Rentals Returned After July 17, 2005

SELECT COUNT(return_date) FROM rental
WHERE return_date > ‘2005-07-17’;



Solution 2: Actors That Have a Last Name Beginning with “A”

WHERE last_name LIKE ‘A%’;



Solution 3: Number of Unique Districts in the Customer Database

SELECT COUNT(DISTINCT(district)) FROM address;



Solution 4: Returning the Actual Lists of Districts from Challenge #3

SELECT DISTINCT(district) FROM address;



Solution 5: Cheap, (Mildly) Naughty Films

WHERE rating = ‘R’
AND replacement_cost BETWEEN 5 AND 15;



Solution 6: Where in the World are Films Containing “Truman”?

WHERE title LIKE ‘%Truman%’;



Wrap Up

Well done for completing these challenges! You shall indeed pass. 🙂

Soon, we’ll be covering aggregate SQL functions, such as MIN, MAX, AVG and SUM.

If you’re just joining us, here’s a running list of our articles so to date (4/16/2017):


DSC_0064 Zach Doty Unsupervised Machine Learning Intro Cover Photo

Unsupervised Learning Introduction: Machine Learning Essentials

Howdy, machine learning students! Today we’re going to introduce the concept of unsupervised machine learning algorithms.

Quick Recap: Supervised Learning

Before we jump in, let’s quickly recap our last article introducing supervised machine learning algorithms. This will give us the appropriate context for unsupervised learning.

In supervised machine learning problems, we supply pre-labeled data to the algorithm. By supplying data that’s already correctly labeled, we ask the algorithm to further predict (regression) or label (classification) new data.

2017-04-11-004-Multiple-Input-Classication-Machine Learning


Unsupervised Machine Learning = Unlabeled Data

The most immediate and prominent difference  for unsupervised learning is the data. Above, we gave the algorithm “a boost” by supplying the intended “right” answers in the data. Below, in an unsupervised machine learning problem, there are no right answers…yet.


We’ve supplied the algorithm with data in the problem, but it’s provided without labels or “answers”. We are mandating that the algorithm discover structure and infer patterns/labels on its own. We could also compare the above example to a clustering problem.

So in unsupervised learning, we supply a large amount of unlabeled data, without explicitly identified form or structure. We ask the algorithm to come up with ideas of structure and segmentation on its own.

Some additional applications of unsupervised learning could include:

  • Market segmentation of massive transaction data
  • Large scale social networking data
  • Astronomical data analysis
  • Large scale market data
  • Mass audio/voice analysis
  • Large scale gene clustering


Wrap Up

That was a bit of a quick one! The challenge with some these technical subject matter areas is sometimes we have limited room to run before going off into the technical weeds. This is one of those areas. Next, we’ll be covering some key concepts in the areas of machine learning model representation, cost function and parameter learning. Don’t worry too much about those yet, we’ll take it step by step. 🙂
As always, feel free to follow my other journeys of learning PostgreSQL, learning how to develop Amazon Alexa Skills, learning how to get started in algorithmic trading, JavaScript for beginners…and more to come soon! Cheers.

DSC_0002 Zach Doty Cover Photo for Interactive Alexa Quiz Skill Development

Building an Interactive Quiz Alexa Skill, Part 1

Hello Alexa geeks! Welcome back to our journey of learning how to develop Amazon Alexa Skills for the Echo and more. Last time, we completed the build process for our first simple “fact-dispensing” Alexa Skill.

In this article, we’ll start the process for a skill that accepts user input in the form of a quiz, fun! If you recall from our first skill, there are two parts to skill development:

  1. The Skill service development, in AWS (Lambda)
  2. The Alexa Skill interface details through the Amazon / Alexa Developer Console




Getting Started in AWS Lambda

You’ll notice as we progress from our early articles, there will be less detail paid to more basic instructions, such as our first! First, log in to the Amazon AWS portal.

Navigate to the Lambda service. If you’re the casual developer just working in this course, odds are the Lambda link will be near the top of screen under “Recently Visited Services”. Once you’ve clicked through, click, “Create a Lambda Function”.


On the next screen, you should see something like “Select blueprint” (Note: at the rate of change Amazon has been pursuing, this screen could change, even in a matter of weeks!) Click the “Blank Function” option, we’re starting this one from scratch!


The next screen should be, “Configure triggers”. Click inside the gray dash-outlined box, and select, “Alexa Skills Kit” from the dropdown menu. Click next!



AWS Lambda Function Configuration for Alexa

Now we should be able to configure the basics of our function. Enter the following:

  • Function name
  • Description

The default runtime environment should be Node.js 4.3. If not, change it to Node.js 4.3.

(Note: Amazon just introduced support for Node.js 6.10, so that may be the preferred format going forward!) Will try to provide an article update, should that be the case.


Onward! Now, we need to upload some code to this burgeoning success. Throwback time, do you remember the files we downloaded in one of the first articles? Time to go back to them again. In your folder of numbered skill templates, go to “2-reindeerGames”, “src” folder and safely open the index.js file in your text editor of choice.


Copy and paste (replacing all previous code) into the code window that should appear.  This assumes you’ve selected the Code entry type of “Edit code inline” for the Lambda function code. As we work on more increasingly more advanced skills, we will likely use the zip upload feature to accommodate additional code resources. The astute will note we’ve merely copied and pasted code here. Yes, we’ll go back and customize soon. 🙂

Beneath the code window, leave the index.handler intact, select an existing role option in the Role dropdown menu, and use the role we previously created. Leave the other settings as-is, click the “Next” button to review details, and click, “Create Function”!


Be sure you take note / record the ARN in the upper right-hand corner, as we’ll need that in our forthcoming Skill Interface development section.


That’s the first part! I don’t know about you, but this is getting easier as I go. We’ll next cover the skill interface and customization to make it your own skill. If this is your first article, be sure to check out the running stable of articles on how to learn Amazon Alexa skill development. Also, there’s a growing body of work on how to learn PostgreSQL, and some fledgling articles on learning algorithmic trading, for good measure.

Share your experience, thoughts and feedback in the comments below. Don’t be a stranger, help your friends along in Alexa Skill development and share with them. Cheers!







Getting Started with Alexa Development 02: Signing Up to Alexa Development Portal

Welcome back to our journey in learning how to program Amazon Alexa Skills via Echo voice search. In the previous article, we walked through the process of setting up an Amazon Web Services (AWS) account. Today, we’ll set up an account at the Alexa Development portal, a distinct entity from the AWS portal.


Without further ado, let’s jump in. Go to


You should land on something like the above screen. Click on the “Sign In” button, you can create a new account from this screen if you need.

Important Note: If you already have an account (regular old Amazon shopping account), use those credentials to log in.

Obviously, if you’re a returning Alexa Development Portal user, you can skip the account creation process shown below. If you’re creating a new account, you’ll need to fill out a screen that will likely resemble the below, and click “Save and Continue” when you’ve finished.


Next, you should be presented with an App Distribution and Services Agreement screen. Be sure to give it a quick read. If you want to use the services, then you’ll need to agree by clicking save and continue. 🙂

The final registration step addresses payments and whether you plan to monetize the apps you develop. For the purposes of my usage, and this learning, I checked “No” to both options before proceeding.


Once you finish that step, you should find yourself in the Amazon Developer Console! Good thing we got the hard material out of the way first, huh?

This should wrap up a pretty quick introductory section for setups. Feel free to visit my previous article on getting set up with AWS, or go to my learning home page on how to start developing Alexa Skills. Thanks and see you at the next article!

DSC_0034 Zach Doty Cover Photo How to Learn SEO in Five Minutes

How to Explain SEO in 5 Minutes

Today, you’re going to learn how to explain SEO to anyone in five minutes! Is it a comprehensive, end-all, be-all example? No. Will it help someone quickly understand the basics of a complex practice? Yes. Let’s jump in. This is, after all, a five minute guide!

Starting Point: SEO is not a black box.

For context, I often hear that SEO is a “black box” and it’s anyone’s guess as to how you rank in search engines. I’m always happy to hear that! It means we have a chance to explain what SEO is really about. Google isn’t a black box. Rather, its algorithms and search engines are so complex at a large scale, it’s difficult to comprehend on a daily basis.
With introductions out of the way, here’s how you do it:
A Job Application
Consider SEO like a job application. You have a resume, references, you deal with a recruiter and a hiring manager. And you want a great career!
In SEO, you have a website, backlinks, the search engine and the user. You want to be found and loved!

Your Resume
Your resume is your website. You control the content and tailor the experience to each opportunity. You balance who you are and what you do.
Your website balances your brand story and products or services. You target keywords to pages of your site, just like your jobs are targeted to certain functions and skill sets.
Which resume would you choose to interview for a Marketing Director?

A. “I’m Leroy, I’m awesome. I’m really cool, I enjoy long walks on the beach and like Ethiopian food.”

B. “Kevin is a seasoned marketing professional who has managed and implemented effective marketing promgrams.”

By the same token, which website would you do business with? One with “high quality products and synergistic solutions”? Or a site that speaks in plain, compelling English?
Your References
Your references are your backlinks, or 3rd party websites linking to your own. You can influence these endorsements and relationships, but not necessarily control them. On both sides of this comparison, quality trumps quantity.

Which of the following references holds the most weight for a candidate?

A. “Little Michael is my favorite nephew, he eats a lot at Thanksgiving, you should hire him.”

B. “Jake paid me some money to recommend him. I guess he’s cool.”

C. “Ryan took our marketing department to the next level with smart strategy. He works well with the team implementing our campaigns.”

Ryan is the best choice. A bunch of low quality or fake recommendations will do a job seeker little good if not harm. By the same token, it’s better to have a few excellent backlinks than a bunch of spammy or fake endorsements.
The ultimate goal
When you’re a job seeker, interviews are great. But you really want the career! Likewise, it’s great to rank well in the search engines, but you want new, loyal customers that convert on your website.

Above all else, optimize your website for your users. The search engine results will follow.

Where to next?

That’s how to explain and understand SEO in five minutes! Obviously, there’s much, much more to learn. However, this will get you on the right path. Dive deeper into SEO with a fantastic Beginner’s Guide from Moz.