Google PageSpeed Insights on Command Line (CLI)

Welcome back to more in adventures in Command Line! Previously, we were looking at running Google Lighthouse programmatically.

Today, we’re going to explore a similar exercise- running Google PageSpeed Insights on Windows Command Line (CLI)! Running PageSpeed insights on Command Line was probably the easiest self-coding experience I’ve had to-date. (More on the hard part later. 🙂 )

3 Prep Steps for Running PageSpeed Insights on Command Line

Unlike Screaming Frog Command Line automation, there are very few parameters or options to mess with for running data on a single URL. To run a single URL on PageSpeed Insights from Command Line:

  • Ensure CURL works in your command line (test CURL and check that the HTML source code is returned
    • Most recent Windows machines should already have CURL ready out of the box for CLI usage
  • Visit the Google PageSpeed Insights API documentation page
    • Get an API key (free) and create a project for credentials (also free) if needed
  • Copy the code sample into a code editor or text editor

Running Your First URL on PageSpeed Insights CLI

Is CLI for PageSpeed insights easy? Yes!

Effortless? No.

Here’s the out-of-the-box code sample from Google:


If you try to run that out of the box on Windows, you’ll like run into one or more of these errors:

  • 429 error: Too many requests / exhausted
  • Invalid syntax: Command(s) not recognized

First, you’ll need to append your API key to the end of the URL.


Here’s what did the trick for me: placing escape characters in front of any special operators. Specifically, the ampersand (&) and equals (=) signs have special functions in CMD that need to be escaped with a carrot (^).


Great, CLI Works! What’s next?

If everything works, you should see a flurry of JSON code scroll through the command prompt window – hooray! But here’s the rub, what do you with this information?

Some extra code will help us capture the output into a file:

curl^=^&key^=yourAPIKeyHere -o fileName.json

Adding -o for output and specifying a file name will populate a table in CLI during processing with download progress stats. After the file has downloaded (sample runs took about 18 seconds), you should be able to see it in the default CLI directory in Windows File Explorer. (Typically C:\Users\youUserName)

Stay tuned for some more updates on how to scale this process across multiple URLs and extracting insights from the JSON file!

Actually Running Google Lighthouse Programmatically

Are you ready to run Google Lighthouse programmatically? This challenge has weighed on me for years. But with the help of some resources and friends, I’d love to show you a start-to-finish guide of truly running Google Lighthouse programmatically.

Programmatic Means Prep Work

First, my personal machine runs Windows 10, so my apologies to Mac & Linux users out there. Here’s what you need to get started, and huge shout-out to Upbuild for cutting the first part of the path here.

  • Install Node:
  • After installation, search your start menu for “Command Prompt” or CMD
  • Assuming Node was installed correctly, and your command line is correctly configured, the following line in Command Prompt should install a Batch Lighthouse utility, a splendid bit of programming by Mike Stead
npm install lighthouse-batch -g

With this first round of prep, you can go get a list of URLs! For example, you might use Screaming Frog Command Line (CLI) automation for some pages.

JSON Files: Step 1 of Running Google Lighthouse Programmatically

The fastest way to run Lighthouse Batch programmatically is to use the -s method, shown below. Essentially the -s option allows you to paste in a comma-separated list of URLs to run.

Whether you’re working in Screaming Frog, or getting Google Search Console export data, you’ll want to use Convert.Town to transform column data into comma-separated values. (Hat tip again to Upbuild for this find.)

When you’re ready, open up command line, and let’s get Google Lighthouse running programmatically!

Here’s what your script should look like:

lighthouse-batch -s,,,, 

Hit enter, and watch it go! You may notice that the Command Line may take a few seconds to spool everything up before it starts running. After a few weeks of field-testing, I’ve been able to do up to 110 URLs consistently at a time without tripping an error.

You’ll see lots of wonderful code scrolling through the window, and after it finishes, you should have some JSON files in C:\Users\[YOUR USERNAME]\report\lighthouse

The Tricky Part of Running Google Lighthouse Programmatically

So if I ended this article here, it would be a poor summary of an Upbuild blog post. However! I found that while Google Sheets was indeed a brilliant solve, it didn’t meet my needs.

Here’s what I encountered in Google Sheets:

  • I may have broken some quotas for running scripts 🙂
  • The sheet took a long time to refresh queries
  • Because Google Sheets is less stable for large amounts of interactive data, it was slow to work with
  • Because the queries auto-refreshed, it wasn’t super handy for having a snapshot in time for later manipulation
  • Also, not being able to work with data locally (vs an FTP) didn’t go with my personal workflow very well

Again, huge ups to Upbuild for doing way more effort than I ever did on this! I just found that my needs diverged with the use case. Here’s what I did differently.

Larger Lighthouse Scale? Back to JSON

Here’s my bone to pick with trying run Lighthouse programmatically: JSON files make no sense to me as an SEO strategist and practitioner. (I get that JSON is great for developers for many reasons, but I’m not them.)

Sadly, Lighthouse only offers JSON or HTML reports, as far as I know. (If you know otherwise, please reach out! Would love to buy you a coffee. 🙂 ) So, that means you have to embrace the JSON.

I need tabular data, specifically, Excel exports to share with marketing and development teams for fixes from the report. So here’s where we are: we have a bunch of JSON files sitting on the hard drive.

This means command line batch script might be a good fit. Is it possible? How do we do it?

Command Line Batch Scripting to the Rescue!

In very simple terms:

  • We have information in File A (JSON) that needs to be extracted, and
  • Transformed into tabular (CSV) form, and
  • Both executed in a programmatic fashion

I unsuccessfully tried a number of manual batch script options, but if we go looking, there’s a solution to get us on our way! Meet JQ, a command line processor for JSON. Particularly, JQ can extract data from a JSON file and transform it into a CSV!

If we mesh JQ with some traditional batch scripting, we can achieve the requirements above.

A Little More Prep Work to Get Programmatic

You can download JQ here or at the link above (same URL.) There is some attention to detail here. In order to make JQ work properly in Command Line / Batch Script, you’ll need to copy the address of where you installed JQ and add it to your Path. Here’s a quick video showing you how to do it on Windows.

Once you have JQ successfully downloaded, installed and tested, it’s time to begin some scripting.

Programmatic Magic With Batch Script

Here’s both the fun and challenging part! In order to achieve a successful result, we need to lay out how our script will work in “psuedocode”. (Using pseudocode very loosely here as instructions to how the program should function.) Pseudocode follows.

Psuedocode for Programmatic Lighthouse Batch Script

For each JSON file (Sample JSON files attached) in directory C:\Users\yourusername\report\lighthouse, perform the following:

  • Extract the keys and values from this sub-section of each json file: audits > uses-rel-preload > details > items (We’re looking for url and wastedMS, and the values underneath)
  • Convert the extracted JSON data to a CSV, and place it in newly created folder named “uses-rel-preload” (without quotation marks) within C:\Users\yourusername\report\lighthouse. The CSV file should be named the same as the JSON file name (without the .json file extension)
  • Within the newly created CSV, insert two new columns to the left of “url” and “wastedMS”. Title the new column A “auditItem” and the new column B “sourceURL”.
  • In column A, populate “uses-rel-preload” (without qoutation marks) into each row where there are values in Columns C & D.
  • In column B, populate the CSV file name for each row where there are values in Columns C & D.
  • If there are no values in the JSON file from Step 1, create a new CSV with the JSON file name (no JSON file extension) in the new folder under C:\Users\yourusername\report\lighthouse (from step 2), with first column header “url”, second column header “wastedMS” and a single row below, with both cells containing “N/A”.
  • Within the newly created CSV, insert two new columns to the left of “url” and “wastedMS”. Title the new column A “auditItem” and the new column B “sourceURL”.
  • In column A, populate “uses-rel-preload” (without qoutation marks) into the first blank row.
  • In column B, populate the CSV file name into the first blank row.
  • Repeat/loop this action for every JSON file in the starting directory, but do not create a new folder in Step 2 for each following run, continue to use the same “uses-rel-preload” folder.
  • After the loop has completed for all files in the starting directory, Change directories to the new uses-rel-preload folder in Step 2. (C:\Users\yourusername\report\lighthouse\uses-rel-preload)
  • In that directory, combine all the CSV files (except for the header row in 2nd and subsequent CSV files, do not include the header row after the first file) into a new file titled uses-rel-preload.csv.

Batch Script for Programmatic Lighthouse Data Extraction

Alright, fire up your favorite text or code editor, and customize this to your needs! Personally, I will call the file via Command Prompt. Also, special thanks to Albin for his help making the JQ and CSV transformation.

:: Housekeeping and settings
echo off
:: Enable replacing variables with values in the for loop
setlocal enabledelayedexpansion
:: Clear the screen
:: Specify the working directory where the loop needs to perform
cd "C:\Users\yourusername\report\lighthouse"
:: For all the items (declare the working variable) in the set: active directory previously specified, all files matching *wildcard - all json files
for /f "tokens=*" %%A in ('dir /b /a-d *.json ') do ( 

	echo %%A
	set inputfile=%%A
	echo !inputfile!
	set inputfilename=%%~nA
	del !inputfilename!.csv_temp
	del !inputfilename!.csv

:: Parse the JSON with jq, extracting the specific details from the audit items and recommendations

:: Call JQ and a filter file that transforms the data into a temporary CSV file 
	type !inputfile! | jq .audits.\"uses-rel-preload\".details.items | jq -r -f filter.jq>>!inputfilename!.csv_temp

:: Create the first row of the CSV
	echo auditItem,sourceURL,url,wastedMs>!inputfilename!.csv

:: Check for empty files and provide fallback values if there are no recommendations
	for /f %%x in ("!inputfilename!.csv_temp") do set size=%%~zx
	if not !size! gtr 0 echo uses-rel-preload,!inputfile!,N/A,N/A>>!inputfilename!.csv

:: For all lines, make csv and add additional column
	for /F "skip=1 tokens=1,2 delims=," %%i in (!inputfilename!.csv_temp) do ( 			
			set url=%%i
			set wastedms=%%j
			set url=!url:"=!
			set wastedms=!wastedms:"=!
			echo uses-rel-preload,!inputfile!,!url!,!wastedms!>>!inputfilename!.csv

:: Clear the temporary CSV files out of the working directory
	del !inputfilename!.csv_temp


:: Make a new folder to house the files
mkdir uses-rel-preload

:: Move all CSV files into the audit type folder
move *.csv "uses-rel-preload"

:: Change the working directory
cd "C:\Users\yourusername\report\lighthouse\uses-rel-preload"

:: Start merging the individual files into a single CSV for the Lighthouse audit
:: Set a counter to 1
set cnt=1

:: For each file that matches *.csv, do the following loop
for %%i in (*.csv) do (
:: Call the counter 1 it's the first time running
  if !cnt!==1 (
:: Push the entire file complete with header into uses-rel-preload.csv - this will also create uses-rel-preload.csv
    for /f "delims=" %%j in ('type "%%i"') do echo %%j >> uses-rel-preload.csv
:: Otherwise, make sure we're not working with the uses-rel-preload.csv file and
  ) else if %%i NEQ uses-rel-preload.csv (
:: push the file without the header into uses-rel-preload.csv
    for /f "skip=1 delims=" %%j in ('type "%%i"') do echo %%j >> uses-rel-preload.csv
REM increment count by 1
  set /a cnt+=1

Additionally, here’s the filter file referenced in the code.

def tocsv:
    ) as $cols
    |map(. as $row
    ) as $rows
    | @csv;


Ok! There’s a lot going on. Let’s make sure a few important details aren’t missed:

  • This assumes that you left the JSON files in the default export location, C:\Users\yourusername\report\lighthouse\uses-rel-preload
  • This also assumes that you know how to run a batch script.
  • It’s not necessary per se, but for simplicity, I highly recommend you run and save the batch script files and resources all in the same directory as cited above
    • You’ll just have to be thoughtful about file path references if you spread resources/calls out.

The Output

Here’s what you should have:

  • The JSON files in the original directory
  • A new folder with the CSV files that contain URLs referenced in the Lighthouse uses-rel-preload report
    • The rows will contain N/A if the audit was empty, or had no results
  • Also within the folder should be a report-level CSV with all the CSVs containing all preload recommendations for all page you ran, combined into one sitewide file! Hooray!

Limitations on this Programmatic Lighthouse Approach

The astute will notice that this approach only covers one audit! Therein lies the downside to this approach, you’ll need to customize and program out additional code/files for other audits.

Assuming that you program out all the audits, you’ll want to create an audit summary folder within the first batch script file. This can be done with the mkdir directive in the script.

Additionally for each audit, you’ll want to copy the single overarching audit file into the audit summary folder. Using similar script as the last half of the above, you can combine all the audits into a single CSV that can easily be manipulated or ingested.

Google Search Console: Download & Export All Website Data

Update! I’ve done a bit more work recently, and have included an R script at the end of this that will let you mechanically pull up to 16 months of data, by month and also get a combined Excel file. It’s mechanical in the sense that I haven’t yet worked out automating date lookups, or expanding this beyond one website at a time. However, if you need a lot of data in a hurry, with minimal work…this should help you!

Another day, another blog post? This is a rarity for sure. At the moment, I’m starting to hit my stride on getting a few old items done in R. 🙂 If I keep writing more blog posts, I’ll need to start taking more photos again for the feature covers!

Quickly sharing how I was able to use R and Google Search Console to download & export all of my website’s search analytics data.

Why Use R to Download & Export all Search Console Data?

Even for the novice SEO, this is likely a well-tread frustration. Valuable data to make critical business decisions is sitting in Google Search Console. But you can’t access it at scale!

It’s either hidden behind the abysmal UI export limit, or you have to do so much filtering that’s impossible to scale the data extraction.

SearchConsoleR: Easy Google Search Console API for Unlimited Data Access

Similar to my previous post (How to Bulk Submit Sitemap URLs in Google Search Console) this article looks at the splendid SearchConsoleR package in R. See the link above for prerequisites.

The code to get all your data in Google Search Console is really quite brief. I’ve listed it below, as well a link to the Github gist.

Note that if it’s your first time using RStudio, you’ll need to use install.packages() to load up the necessary dependencies.

# Load packages

# Authorize & choose Google profile

# Specify website --- CLIENT
website <- ""
dimensions <- c("page", "query")

ttl_queries <- 
                   "2019-08-01", "2019-08-31", 
                   c("query", "page"), 
                   searchType="web", rowLimit = 100000)

# Write the data frame to an XLSX file
write_xlsx(ttl_queries, "ttl_queries.xlsx")

If the code has run correctly, you should have a (potentially) massive Excel file at your disposal! The possibilities are endless – data crunching in Excel, piping to a database or some other visualization software. Enjoy!

Update: A Mechanical Download of All Google Search Console Data for up to 16 months

You can also find this script in a Github Gist here.

# Load packages #

# Authorize & choose Google profile #

# Specify website --- Client Code or Line of Business Code Here #
client <- "CLCO"
website <- ""

############ SPECIFYING THE MONTHS ############
start1 <- "2020-04-01"
end1 <- "2020-04-30"

start2 <- "2020-05-01"
end2 <- "2020-05-31"

start3 <- "2020-06-01"
end3 <- "2020-06-30"

start4 <- "2020-07-01"
end4 <- "2020-07-31"

start5 <- "2020-08-01"
end5 <- "2020-08-31"

start6 <- "2020-09-01"
end6 <- "2020-09-30"

start7 <- "2020-10-01"
end7 <- "2020-10-31"

start8 <- "2020-11-01"
end8 <- "2020-11-30"

start9 <- "2020-12-01"
end9 <- "2020-12-31"

start10 <- "2021-01-01"
end10 <- "2021-01-31"

start12 <- "2021-02-01"
end12 <- "2021-02-28"

start13 <- "2021-03-01"
end13 <- "2021-03-31"

start14 <- "2021-04-01"
end14 <- "2021-04-30"

start15 <- "2021-05-01"
end15 <- "2021-05-31"

start16 <- "2021-06-01"
end16 <- "2021-06-30"

# # # Start mo 1

mo1_queries <- search_analytics(
  siteURL = website, 
  startDate = start1,
  endDate = end1,
  dimensions = c("query", "page"),
  rowLimit = 1000000

# Dynamically get the number of rows from the GSC export so the created vectors match number of rows
rows <- nrow(mo1_queries)

# Make a vector that populates the client code
clientName <- rep(client,rows)

# Make a vector that populates the start date for Month filtering, segmenting, pivoting
date <- rep(start1,rows)

# Make a data frame from the created vectors
mo1_queries_merged_1 <- data.frame(clientName,date,mo1_queries)

############### QA CHECKPOINT ######################
# Write the data frame to an XLSX file
write_xlsx(mo1_queries_merged_1, "mo1_queries_1.xlsx")

# # # END mo 1

# # # Start mo 2

mo2_queries <- search_analytics(
  siteURL = website, 
  startDate = start2,
  endDate = end2,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo2_queries)
clientName <- rep(client,rows)
date <- rep(start2,rows)
mo2_queries_merged_1 <- data.frame(clientName,date,mo2_queries)
write_xlsx(mo2_queries_merged_1, "mo2_queries_1.xlsx")

# # # END mo 2

# # # Start mo 3

mo3_queries <- search_analytics(
  siteURL = website, 
  startDate = start3,
  endDate = end3,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo3_queries)
clientName <- rep(client,rows)
date <- rep(start3,rows)
mo3_queries_merged_1 <- data.frame(clientName,date,mo3_queries)
write_xlsx(mo3_queries_merged_1, "mo3_queries_1.xlsx")

# # # END mo 3

# # # Start mo 4
mo4_queries <- search_analytics(
  siteURL = website, 
  startDate = start4,
  endDate = end4,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo4_queries)
clientName <- rep(client,rows)
date <- rep(start4,rows)
mo4_queries_merged_1 <- data.frame(clientName,date,mo4_queries)
write_xlsx(mo4_queries_merged_1, "mo4_queries_1.xlsx")

# # # END mo 4

# # # Start mo 5

mo5_queries <- search_analytics(
  siteURL = website, 
  startDate = start5,
  endDate = end5,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo5_queries)
clientName <- rep(client,rows)
date <- rep(start5,rows)
mo5_queries_merged_1 <- data.frame(clientName,date,mo5_queries)
write_xlsx(mo5_queries_merged_1, "mo5_queries_1.xlsx")

# # # END mo 5

# # # Start mo 6

mo6_queries <- search_analytics(
  siteURL = website, 
  startDate = start6,
  endDate = end6,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo6_queries)
clientName <- rep(client,rows)
date <- rep(start6,rows)
mo6_queries_merged_1 <- data.frame(clientName,date,mo6_queries)
write_xlsx(mo6_queries_merged_1, "mo6_queries_1.xlsx")

# # # END mo 6

# # # Start mo 7

mo7_queries <- search_analytics(
  siteURL = website, 
  startDate = start7,
  endDate = end7,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo7_queries)
clientName <- rep(client,rows)
date <- rep(start7,rows)
mo7_queries_merged_1 <- data.frame(clientName,date,mo7_queries)
write_xlsx(mo7_queries_merged_1, "mo7_queries_1.xlsx")

# # # END mo 7

# # # Start mo 8

mo8_queries <- search_analytics(
  siteURL = website, 
  startDate = start8,
  endDate = end8,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo8_queries)
clientName <- rep(client,rows)
date <- rep(start8,rows)
mo8_queries_merged_1 <- data.frame(clientName,date,mo8_queries)
write_xlsx(mo8_queries_merged_1, "mo8_queries_1.xlsx")

# # # END mo 8

# # # Start mo 9

mo9_queries <- search_analytics(
  siteURL = website, 
  startDate = start9,
  endDate = end9,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo9_queries)
clientName <- rep(client,rows)
date <- rep(start9,rows)
mo9_queries_merged_1 <- data.frame(clientName,date,mo9_queries)
write_xlsx(mo9_queries_merged_1, "mo9_queries_1.xlsx")

# # # END mo 9

# # # Start mo 10

mo10_queries <- search_analytics(
  siteURL = website, 
  startDate = start10,
  endDate = end10,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo10_queries)
clientName <- rep(client,rows)
date <- rep(start10,rows)
mo10_queries_merged_1 <- data.frame(clientName,date,mo10_queries)
write_xlsx(mo10_queries_merged_1, "mo10_queries_1.xlsx")

# # # END mo10

# # # Start mo 11

mo11_queries <- search_analytics(
  siteURL = website, 
  startDate = start11,
  endDate = end11,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo11_queries)
clientName <- rep(client,rows)
date <- rep(start11,rows)
mo11_queries_merged_1 <- data.frame(clientName,date,mo11_queries)
write_xlsx(mo11_queries_merged_1, "mo11_queries_1.xlsx")

# # # END mo 11

# # # Start mo 12

mo12_queries <- search_analytics(
  siteURL = website, 
  startDate = start12,
  endDate = end12,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo12_queries)
clientName <- rep(client,rows)
date <- rep(start12,rows)
mo12_queries_merged_1 <- data.frame(clientName,date,mo12_queries)
write_xlsx(mo12_queries_merged_1, "mo12_queries_1.xlsx")

# # # END mo 12

# # # Start mo 13

mo13_queries <- search_analytics(
  siteURL = website, 
  startDate = start13,
  endDate = end13,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo13_queries)
clientName <- rep(client,rows)
date <- rep(start13,rows)
mo13_queries_merged_1 <- data.frame(clientName,date,mo13_queries)
write_xlsx(mo13_queries_merged_1, "mo13_queries_1.xlsx")

# # # END mo 13

# # # Start mo 14

mo14_queries <- search_analytics(
  siteURL = website, 
  startDate = start14,
  endDate = end14,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo14_queries)
clientName <- rep(client,rows)
date <- rep(start14,rows)
mo14_queries_merged_1 <- data.frame(clientName,date,mo14_queries)
write_xlsx(mo14_queries_merged_1, "mo14_queries_1.xlsx")

# # # END mo 14

# # # Start mo 15

mo15_queries <- search_analytics(
  siteURL = website, 
  startDate = start15,
  endDate = end15,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo15_queries)
clientName <- rep(client,rows)
date <- rep(start15,rows)
mo15_queries_merged_1 <- data.frame(clientName,date,mo15_queries)
write_xlsx(mo15_queries_merged_1, "mo15_queries_1.xlsx")

# # # END mo 15

# # # Start mo 16

mo16_queries <- search_analytics(
  siteURL = website, 
  startDate = start16,
  endDate = end16,
  dimensions = c("query", "page"),
  rowLimit = 1000000

rows <- nrow(mo16_queries)
clientName <- rep(client,rows)
date <- rep(start16,rows)
mo16_queries_merged_1 <- data.frame(clientName,date,mo16_queries)
write_xlsx(mo16_queries_merged_1, "mo16_queries_1.xlsx")

# # # END mo 16

total_queries_by_page <- rbind(mo1_queries_merged_1,mo2_queries_merged_1,mo3_queries_merged_1,mo4_queries_merged_1,mo5_queries_merged_1,mo6_queries_merged_1,mo7_queries_merged_1,mo8_queries_merged_1,mo9_queries_merged_1,mo10_queries_merged_1,mo11_queries_merged_1,mo12_queries_merged_1,mo13_queries_merged_1,mo14_queries_merged_1,mo15_queries_merged_1,mo16_queries_merged_1)

write_xlsx(total_queries_by_page, "total_queries_by_page.xlsx")

How to Bulk Submit Sitemap URLs to Google Search Console

Well, howdy! As with many of my posts lately, I find myself saying that’s it’s been awhile since I’ve posted something. It’s true!

Today, I’d like to walk you through how to bulk submit sitemap URLs to Google Search Console using R and RStudio. Before we begin the process of bulk submitting sitemap URLs, you’ll need a few prerequisites:

Prerequisites to Bulk Sitemap URL Submission: R Setup

  • Download and install the appropriate version of R for your system:
  • For Windows users, search for “rtools”, download and install from the official site.
    • Mac/other OS users may need to as well for their respective OS. However, my primary machine is Windows (described above) and I’ve been able to use RStudio without issue on a Mac, without installing RTools. Doesn’t mean you don’t need it. ¯\_(ツ)_/¯
  • Download and install RStudio from the official site:

SearchConsoleR & Bulk Sitemap Submission in Google Search Console

How we bulk submit XML Sitemap URLs in Google Search Console is made possible by the excellent R package made by Mark Edmondson. Why R and SearchConsoleR? Approachability.

Many readers may find this method much more approachable than trying to get API projects, API keys, and other confusing configuration details just right, to work in Python/Jupyter notebooks. You’ll see what I mean in a minute.

Code for Bulk Sitemap URL Submission in Google Search Console

Below is sample R code for bulk sitemap URL submission in Google Search Console, using R and SearchConsoleR. You can also view my Github gist here.

# Load packages

# Authorize & choose Google profile

# Specify website --- FULL URL
website <- ""

# Submit all manually found sitemaps



The code is fairly straightforward, and sadly, not automated. I’ll be working on an automated, scalable way of submitting all Sitemap URLs for all websites in a given Search Console account.

The one custom touch I’d note here is providing the one second system sleep to proactively avoid running afoul of rate limits, if you’re submitting a lot of URLs.

JOINS in PostgreSQL

Welcome back, SQL-ites! Today, we’re going to take a significant leap forward in our journey of learning PostgreSQL.

A quick recap on where we’ve been recently:

In more broad terms, most of our PostgreSQL training so far has focused on how to best select and segment our data. However, there is much more to be accomplished in big data manipulation. Enter JOINS.


About JOINS & PostgreSQL

JOINS helps unlock the foundation of most SQL databases, that are known as relational databases. Specifically, when we’re able to triangulate multiple data sources together, combined with sorting, segmentation and some intuition, we can get some really powerful stuff.

So what does JOINS do? It allows you to relate data in one table to the data in other tables.

Think about some of our practice runs in PostgreSQL so far. We’ve looked at DVD rentals, customers, transactions, ID’s, and so on. What if we wanted to start linking in more detail customer names with ID’s, or transaction ID’s with film names, or all the above? Weaving data into an interactive story could allow us to see really cool things, such as which customers prefer certain genres, or perhaps unveil renting habits for a timely promotional offer to smooth out our top line revenue.


BASIC JOINS Syntax in PostgreSQL

As with many of our learning adventures, there is complexity and there are layers to our education. 🙂

First, it’s important to note that there are actually several types of JOINS, so this isn’t just an amorphous, blob-lump function. Here are some of the important kinds:

  3. self-join

To start things out, we’re going to look at INNER JOIN first.

INNER JOINS Concepts for PostgreSQL

We’re going to jump headlong into this, at the expense of immediate clarity. Joy! Let’s say we have two tables which we’ll relate data between.



Above, we’ve pulled out some key points of interest between our tables, unique columns, and more importantly, a shared key (column) of customer_id between the two tables. We’ll explain more about keys in a minute. Ish.

We’ll use the shared column of (example) customer_id to further mash up the data. Let’s take a look at syntax now.


INNER JOINS Basic Syntax

If things aren’t 100% connected for you conceptually, I completely understand. Please do bear with me, as seeing some examples may help “close the loop”. Here’s what an INNER JOINS query might look like for a PostgerSQL usage:


SELECT A.pka, A.c1, B.pkb, B.c2


INNER JOIN B ON A.pka = B.fka;


Let’s break this down a bit more.

The first line is extremely similar to what we’ve been doing all along. Calling in data. The subtle difference here is that we’re first including the respective table names. So, it’s tableA.column, tableB.column, tableB.column2.

Next, we specify the primary table we’re first selecting from, per usual. Then we specify the “donor” table that will be combined with the primary table via INNER JOIN. Further, we make this work by using the ON keyword which specifies the shared key columns. 🙂


More Explanation on the Inner Workings of JOINS (Pun Not Intended)

This has been a bit of a tricky concept for me to learn so far, so we’re going to walk through the nuts and bolts of JOINS as operated by PostgreSQL.

When JOINS is executed, PostgreSQL takes every row in our “Table A” and goes through Table B, looking for rows that match our specified condition of A.pka = B.fka. So combining our example and laymen’s terms, the JOINS function looks for common occurrences of the customer_id  so that additional, previously disparate data from separate tables may be awesomely fused together. 🙂

If a match is found between the two tables, it combines columns of both A and B rows into one row, and adds that newly combined row to the data set you get back.


Pardon the Interruption: PostgreSQL “life” hacks for JOINS

A couple of quick items that may prove useful to you as we get more familiar with joins.

  1. Sometimes your tables may share column names (Analytics data, anyone? Can I get an amen?), so you may need to use the syntax table_name.column_name for clarity.
  2. If you or someone you love has created excessively long table names, you can introduce a shorter alias (via the AS statement) for the table, and use accordingly.


Hey, I Thought We Were Going to Learn About INNER JOIN?

We are! And what’s better, we’re going to use a Venn diagram to explain it. 🙂 #DataScience.

Okay, the INNER JOIN PostgreSQL clause returns rows in Table A that have corresponding rows in Table B. Super scientific visualization below.


The overlapped area above are the results that will be returned to us.


INNER JOINS PostgreSQL Code Examples

Putting all the pieces together, here’s our first INNER JOINS code sample:










INNER JOIN payment ON payment.customer_id = customer.customer_id;


And below, what we get back from pgAdmin.



Okay, a few notes. We went out of our way to say that overlapping column names need table names for distinction. Here’s what we didn’t say, you don’t need to specify table names for unique columns. Above, you can see that we’ve nicely meshed together customer ID with both customer information with payment information. Awesome!

I know this is a bit of a rough stop- but I’m really trying to work on getting more sleep these days. So it’s 12:30am, and I’ve got to head to bed. We’ll be back with updates on this post and more progress forward. Cheers!

Update – 9/16/2017 – Add’l INNER JOINS Examples

Welcome back again, SQL-ites! We’re back with a weekend update from our first look at INNER JOINS in PostgreSQL. Above, we left off with a basic example of INNER JOINS. To quickly recap, we might think of JOINS as basic data composition. Often in marketing and business useful data may begin in silos, thus, combination into useful form must occur.

Our next example of INNER JOINS is a slight modification to the first full example. We have the same basin INNER JOINS syntax, but have added ORDER BY customer.customer_id at the end, to clean the presentation of our data a bit.

SELECT customer.customer_id






FROM customer

INNER JOIN payment ON payment.customer_id = customer.customer_id

ORDER BY customer.customer_id ASC;




Let’s continue to add modifications to our basic INNER JOINS statement. Below, we harness the power of both combination and selection to specify joined results, but for a specific customer ID. A use case for this might involve isolating or routing specific data after it’s been joined, for reporting purposes.

SELECT customer.customer_id






FROM customer

INNER JOIN payment ON payment.customer_id = customer.customer_id

WHERE customer.customer_id = 2;


DSC_0050 Zach Doty AS PostgreSQL Statement Cover Photo

JOINS Foundations: The AS PostgreSQL Statement

Intro to JOINS: the AS PostgreSQL Statement

What’s up SQL people? We’re back, and better than ever, in our foray to learn PostgreSQL. Since we’ve completed some intermediate skills challenges & learned GROUP BY, it’s time to examine JOINS.

Before we examine JOINS, there’s a key foundation piece we must cover first: the AS statement in PostgreSQL. Let’s jump in.


About the AS Statement

The AS statement in PostgreSQL enables us to rename sections of a table or table columns with a temporary alias, or almost a variable, for manipulation.

It’s a simple statement, so let’s see it in action.


1. Basic AS Statement Example

Our most basic example is a basic query where perhaps a column wasn’t named to our liking. Consider the following.

SELECT rental_rate AS film_cost

FROM film



Great for an introductory example, but not inherently useful. Read on as we apply the AS statement more deeply.

2. Semi-Intermediate AS Statement Example

Let’s provide an example that’s a bit more engaged. Example, if we use aggregate functions, the column output doesn’t have a clean name attached to it. But no longer! The AS statement allows us to have the summation output in a GROUP BY statement to something we’ll recognize.


SELECT customer_id, SUM(amount) AS customer_ltv

FROM payment

GROUP BY customer_id

ORDER BY customer_ltv DESC




This is something more useful for intermediate PostgreSQL work!


Wrap Up

We aren’t spending much further time here since this is a simple application and the JOINS statement is the function we’re truly after. If you’re just joining this series, check out our home page on how to learn PostgreSQL.

DSC_0012 Zach Doty PostgreSQL cover photo

PostgreSQL Intermediate Skills Test

Hey there, SQL-ites. Time for another (yes, another!) PostgreSQL skills challenge. It’s our last stop before moving into more intermediate and advanced PostgreSQL skills, such as JOINS.

If you’re just joining this series on SQL, we previously covered the HAVING statement & GROUP BY. Here’s the home page of our journey in learning PostgreSQL. Today, we do a more rounded knowledge check of older PostgreSQL statements, such as SELECT WHERE, COUNT & ORDER BY.

We have 3 problems, and provide the winning code for each, based on our sample database. In the past, I’ve given more explanation to the problems, but because I’m trying to get some traction myself in moving forward, we’ll only have problem & solution laid out here.

Alright, let’s go!

1. Return the customer IDs of customers who have spent at least $110 with the staff member who has an ID of 2.

The answer should be customers 187 and 148.

SELECT customer_id, SUM(amount)
FROM payment
WHERE staff_id=2
GROUP BY customer_id
HAVING SUM(amount) >110;


2. How many films begin with the letter J?

The answer should be 20.


WHERE title LIKE ‘J%’;


3. What customer has the highest customer ID number whose name starts with an ‘E’ and has an address ID lower than 500?

The answer is Eddie Tomlin.

SELECT customer_id, first_name, last_name, address_id
FROM customer
WHERE address_id <500 AND first_name LIKE ‘E%’
ORDER BY customer_id DESC;


We’ll see you on the other side soon, for some JOINS awesomeness. 🙂

DSC_0300 Zach Doty Cover Photo for HAVING PostgreSQL Clause

The HAVING Clause in PostgreSQL

Howdy SQL-ites! Welcome back to our next step in learning SQL. After a long hiatus, we recently stumbled our way through some intermediate SQL statement challenges, namely on GROUP BY.

Today, we’re back in the flow, and learning about the HAVING clause in PostgreSQL.


About the HAVING Clause in PostgreSQL

The HAVING clause is most often seen as a helper with the GROUP BY statementGROUP BY was pretty awesome, right? How might we improve upon it?

Let’s consider two concepts:

  1. Our occasional parallels of PostgreSQL to Excel, and,
  2. Our previous intermediate challenge example.

Relating PostgreSQL’s GROUP BY to Excel

If we hearken back to our first encounter with GROUP BY, we compared GROUP BY to pivot tables. Specifically, if we have a big data table that records data of recurring themes, e.g., you customer base, it can be helpful to manipulate as aggregated and assembled, vs. raw.

However, GROUP BY is only an intermediate step in data analysis. If we think about our final challenge in the last post, we had to limit the results to 5. Even if we group data, it’s neither segmented nor useful toward analysis.

Thus, we need an additional method for winnowing down our GROUP BY results.


Meet the HAVING Clause

So, about the HAVING clause. It’s most often used in tandem with GROUP BY to filter out data rows that don’t meet a certain condition. Think of it as similar to the WHERE function, just an accompaniment to GROUP BY.


Let’s take a look at basic syntax:

SELECT column, aggregate(column_2)

FROM table

GROUP BY column

HAVING condition;


Off the bat, we should this is extremely familiar if we’ve covered GROUP BY. The differentiation is the additional HAVING condition at the end. This condition could be something like, HAVING sum(column_2) less than 50.

The Difference Between WHERE and HAVING

Some of you sharp folks may want to know, “what’s the difference between WHERE and HAVING?” That would be an excellent question. Here’s the difference:

The WHERE clause sets conditions on individual rows, before the GROUP BY clause has been applied. The HAVING clause specifies conditions on grouped rows, created by the GROUP BY clause.

Let’s run some examples.


Using the HAVING Clause in PostgreSQL

Here’s our first example, very similar to our previous skills challenge:

SELECT customer_id, SUM(amount)
FROM payment
GROUP BY customer_id
HAVING SUM(amount) > 150;

Above, we have added another segmentation layer with the HAVING clause.

You can see we’re pulling information about how much our customers have paid us. Further, we specify that we only want to see customers with a lifetime purchase amount of greater than $150.


Let’s look at another example of the HAVING clause. Say for example, we want to know which store has served the most customers. Below, we’ll execute the following code:

SELECT store_id, COUNT(customer_id)
FROM customer
GROUP BY store_id
HAVING COUNT(customer_id) >275;

Above, we’ve selected both the store and customer ID columns from the customer table. Further, we group by the store ID, because we want store-level data, but we only want to see the stores which have served more than 275 customers. Below, we can see only store has done so. 🙂


Combining Usage of the WHERE & HAVING Clauses

We mentioned earlier the WHERE and HAVING clauses are different, somewhat in function, but mostly in order of execution. Here’s what we didn’t say: you can actually use them in tandem. There is a great theoretical use case for this, unfortunately our sample database is a bit small, but here goes.

Let’s think about the film table (used in previous examples.) Perhaps we want to analyze the films, by rental rate, but only films with certain ratings. For example, perhaps we’re no longer interested in carrying NC-17 films, but still want to get an aggregated view of on average, how much each films rents for, by rating. Additionally, we want to see which ratings, if any, have an average rental rate of less than $3.

Here’s the code we would use:

SELECT rating, ROUND(AVG(rental_rate),2)
FROM film
WHERE rating IN (‘R’,’G’,’PG’,’PG-13′)
GROUP BY rating
HAVING AVG(rental_rate)<3;



Wrap Up

Alright, that concludes our section for today. It feels good to be back. 🙂

Hopefully you found this section on the HAVING PostgreSQL statement useful. If you need to backtrack or further explore, here are some useful/recent links:

DSC_0006 Zach Doty Intermediate GROUP BY SQL Skills Challenge Cover Photo

Intermediate SQL Skills Challenge: GROUP BY

Hey there SQL-ites! Wow, it’s been awhile since I’ve last posted. Work has been crazy busy again, and just life in general. Sure feels good to be back, learning again! I’m daringly dropping straight back into my learnings from where we left off…in April!? Crazy.

Anyway, let’s get back to brass tacks. Before life and work got really crazy for me, we covered:

Another quick recap note, we’ve been using the famous DVD rental training database for our work. On to the good stuff.


GROUP BY SQL Skills Challenge #1

Let’s say it’s time for quarterly reviews, who doesn’t love evaluations? ¯_(ツ)_/¯ Implement your knowledge of GROUP BY against the following problem:

We need to figure out how much revenue our employees have realized from DVD rentals, and how many transactions each handled.


GROUP BY SQL Skills Answer #1

Let’s talk through the problem and dissect it before presenting code. “A problem well-stated is a problem-half solved” – paraphrase of some smart person.

  1. We’re talking about revenue, so we’ll need to be dealing with the payment table.
  2. We’re evaluating employees (staff), SUM of revenue, and COUNT of transactions.
  3. If we’re aggregating this data, we’re GROUPing BY employee.
  4. We’re also ORDERing the employees BY who handled the most transactions.

That said, here’s the code:

SELECT staff_id, SUM(amount), COUNT(amount)
FROM payment
GROUP BY staff_id

…with our results!


GROUP BY SQL Skills Challenge #2

Well done on your challenge! Here’s the second:

It’s time to do inventory, flashbacks of retail and restaurant wonder for all of us. 🙂

In the name of forecasting and planning, we need to figure out the average replacement cost of our movies, by rating.


GROUP BY SQL Skills Answer #2

Ok, let’s walk through our problem.

  1. We need to use the film database here, since we’re gathering information on the movies.
  2. We’re GROUPing our films BY rating
  3. We’re using an aggregate function to determine the average replacement cost of each rating.

Drum roll, here’s a winning code snippet:

SELECT rating, AVG(replacement_cost)
FROM film
GROUP BY rating
ORDER BY AVG(replacement_cost) DESC;

With the output:


Are there more challenges we should be covering? Yes. However, I’m trying to do better about getting more sleep these days. Unlike past SQL articles, it’s still (barely) before midnight. So we’ll take a quick breather, possibly update this article, but definitely keep moving forward. Cheers!


Update- 8/20/2017 —

GROUP BY SQL Skills Challenge #3!

Alright SQL-ites. After getting some rest, I’ve regrouped a few days later to cover the last challenge:

From our database, we want to get the customer ID’s of the top 5 customers, by money spent, so we can send them a thank you for their business!


GROUP BY SQL Skills Answer #3

Let’s diagnose the problem.

  1. If we’re gathering revenue information, we’ll need to use the payment table.
  2. If we’re getting the top spending customers, we’ll need to GROUP all transactions BY customer ID
  3. To see the top 5 paying customers, we’ll want to ORDER the results BY the SUM of payment amount.

Considering the above, here’s our code:

SELECT customer_id, SUM(amount)
FROM payment
GROUP BY customer_id

DSC_0069 Zach Doty Cover Photo for What is data science

What is Data Science?

Welcome back, campers! It’s been a minute (read:months) since I’ve last posted, and we’re back! (For now.)

Today’s topic, data science: supposedly the latest exploding field, critical to every enterprise.

Why is it important and relevant? The rise of big data has created a relatively untapped treasure trove of insight. However, it’s undeveloped! Further, the tapping of this insight requires a blended skill set which is currently in short supply in the market: the data scientist.

Who and what is a data scientist?

A data scientist is someone who finds new discoveries in data. They investigate hypotheses and look for meaning and knowledge within the data. They visualize the data by creating reports and looking for patterns. What distinguishes a data scientist from a traditional business analyst is the use of algorithms. Algorithms are one of the fundamental tools for data scientists. This requires mathematics knowledge, computer science savvy and domain knowledge.

What does it mean to be a data scientist?

A data scientist may handle open-ended questions such as, “Which customers are more likely to churn?” The data scientist would gather all the data, and run algorithms to find dependable patterns to improve the situation. Seems straightforward, yes? However, there are a range of misconceptions about data science and data scientist For example,  a data scientist may not necessarily be a developer-only or business intelligence analyst-only.

A data scientist will be able to combine both technical know-how and business domain knowledge into mathematics and statistics for maximum effect. That being said, true data scientists are extremely difficult to find and train. However, it may be possible to become a data scientist without expensive and time-consuming degrees, via focused tools and application training.

An oversimplified Venn diagram showing the makeup and value of a data scientist

An oversimplified Venn diagram showing the makeup and value of a data scientist

More Notes on Data Science

When considering data science from a managerial perspective, it’s important to understand the current broad allocation of the average data scientist’s time. An estimated 60-705 of a data scientist’s time is spent assembling and cleaning data, tasks which could be delegated to technical specialists, data integration specialists and so forth. (For example, text mining, SQL queries and so forth.)

If you’ve followed my site and blog lately, you’ll noticed I’ve lapsed a bit on posting. I’m trying to get back into sharing my education again, so stay tuned. Things have just been busy lately. 🙂