Let’s say you’ve decided to listen to every episode of a podcast starting with episode one. Here is an example of how you might go about saving to your machine and descriptively naming all the episodes listed in the RSS feed in one automated shot.
You’ll need:
- cURL1
- some flavor of Awk2
- a Bash-like shell
- an RSS/Atom3|4 feed that includes URLs to all the files you want
Wget5 will work in place of cURL, and any structured plaintext file containing URIs to your files of interest will do as long as you’re willing to play around with your Awk program.
Regarding portability, the example program described in this post is written for GNU utilities6 like gawk and GNU date, so make any changes needed for your system. For example, most of the equivalent utilities found on BSD systems require short options and may even lack certain options. Platform-dependent utilities notwithstanding, dash,7 bash,8 zsh9 and probably any other shell that claims a superset of the Bourne shell’s syntax will have no trouble correctly interpreting the script that follows.
Today I’m working with the RSS feed10 provided by Platicando en Católico11 mostly because it’s a bit messy and hasn’t been pruned of old episodes. In case your feed only keeps the n (< N) most recent entries/episodes, the ideas laid out here should apply to whatever flavor of Internet markup language with which you’re working.
Refer to the following as you read:
pcdl.sh
,12 an executable shell script that calls Awk and cURLextract.awk
,13 the set of instructions from which Awk reads
Do not expect any parameter assignment provided by these programs to work compatibly with your system. Such assignments may include:
- sensible defaults that work on my system (e.g.
dash
) - judgment calls suited to my own tastes (e.g.
IFS="@"
) - demonstrative placeholders unlikely to work on anyone’s system (e.g.
/home/bob
)
Set the environment
Start by assigning some strings to variables at the top of your script for easy future access.
wdir
stores the absolute path to your preferred “working
directory” where files generated by the program will appear. I recommend a
safe place where mistakes can be made and the executing user has write
permissions.
feed_url
locates on the Internet your RSS feed of interest. If you
plan to run your script many times, you may prefer to download and store a copy
on your machine in a preliminary step. On the other hand, fetching your feed
with each execution may help clarify your input source, ease adaptation for use
with other feeds, and keep your data up-to-date with the latest entries.
Finally, tlength
specifies the number of words to be stripped from
an episode’s title and used in the downloaded episode’s
corresponding filename. In other words, each file downloaded will have as the
title portion of its filename the first tlength
consecutive words from the <title>
element of its RSS entry.
If you plan to fetch your feed from within pcdl.sh
, consider
passing to cURL the options --silent
and --show-error
to avoid terminal clutter without silencing any failures that may occur at this
stage.
Transform/extract
The goal of processing feed
with Awk is to extract the URL and
title of each episode and pass them to cURL for download. Naturally, any line in
feed
containing neither the title nor the URL of some episode gets
discarded. The strings that pass through our Awk script accumulate in
list.tmp
, a single file whose lines correspond with one episode of
the podcast each and whose values cURL will take as arguments in the
batch-download stage of the process. Let’s take a closer look at our Awk
extraction script, extract.awk
.
I introduce tlength
to extract.awk
’s namespace
using gawk’s --assign
option. First, the shell expands
$tlength
to the value specified during Set the
environment. Next, gawk assigns that value to the name tlength
and makes it available to extract.awk
. For tlength
to
be called from within gawk, its value must be assigned to some name within the
BEGIN
clause of extract.awk
. Awk can only access
variables passed by the shell from within a given script’s
BEGIN
preamble, but grants global scope to any variable assigned as
such. To distinguish the shell variable tlength
from the Awk
variable used in the body of extract.awk
, I stylized the
Awk-internal variable TLENGTH
with all caps.
Each episode listed in feed
corresponds to an episode title and a
URL pointing to an audio file. extract.awk
’s first procedure
acts on lines containing the XML element tag <title>
, which
encloses the title of a given episode. All of the other lines in
feed
are examined for a matching string and passed along to the
next procedure untouched when no match is found. In either case, the resulting
string is then examined by the next procedure, which only acts on lines
containing .mp3
somewhere in the line. Each of these lines contain
a single URL pointing to the audio file of a unique episode. Fortunately in my
case, feed
is structured such that no line matches both addresses,
and every line containing a <title>
element is separated from
the next <title>
element by a line containing its
corresponding audio file URL. This lets us pair titles up with URLs by reading
until a title is matched, reading until a URL is matched, pairing the two, and
repeating the process for the next episode. While most feeds follow this general
structure, you should closely examine your feed to confirm before moving
forward.
Beginning with the title string, use the sub
function to replace
everything in the line up to and including an opening <title>
tag with the empty string: ""
. To clarify, substituting
with the empty string effectively deletes. Call the function again to
replace everything beginning with and including the closing
</title>
tag, through the end of the line, with the empty
string. What remains of the line should be the title of the episode alone, which
can and should be verified by printing the line as a test.
If you would prefer that the entire title be included in each filename of your
downloaded episodes, the for
loop here can be skipped and
TLENGTH
omitted entirely. For the sake of legibility in a file
browser, I set tlength
to 4 or 5 in the calling shell script so
that each filename describes some aspect of the episode—the first five
words of the title—without growing out of control in length. In plain
English, the loop starts with a variable title
that stores the
first word of the title, retrieves the next whitespace-separated word from the
full title, appends that word to whatever string was assigned to
title
thus far, and repeats the process until title
contains the desired number of words or until the entire original title string
is used. If the titles stripped out of feed
were not separated by
whitespace, it would be necessary to modify gawk’s system field separator
variable by reassigning to FS
the appropriate separating character
from within our BEGIN
procedure.
At this point, our Awk program is capable of modifying title
in a
number of ways. In the interest of clean filesystem-compliant filenames, I omit
punctuation using the gsub
function and replace all uppercase
characters with their corresponding lowercase character by calling
tolower
. Unlike sub
, gsub
continues to
make substitutions through a line even after its first substitution has been
made. For example, if gsub
encounters a title containing two
commas, it will strip both commas from the title. Leave the filenames’
finishing touches to the shell and proceed to extract cURL’s key
ingredient: each episode’s file URL.
As a bonus, feed
provides each episode’s date of publication
through the <pubDate>
element in the same lines containing
related file URLs. This allows us to extract each episode’s publication
date and name each episode’s file accordingly for easy chronological
sorting. feed
presents these dates in the rather formal Day,
dd Mon yyyy
format, so I trim them and, back out in the shell, convert
them to integers whose order corresponds with time. For now,
POSIX character classes14
like [[:alpha:]]
and [[:digit:]]
allow us to
strategically eat around the date of the year. The URL’s distinct and
consistent format makes its extraction simple, so let’s move on to
choosing a field separator for list.tmp
.
Note that, instead of replacing the text between the date and the URL with the
empty string or whitespace, I chose @
to serve as a general-use
field separator going forward, a purely stylistic choice which ultimately needs
to be communicated to the shell in one way or another, and acts as a kind of
disambiguator between various interpretations of whitespace and field
separation. Any ASCII separator character will do as long as it does not appear
in any of list.tmp
’s values, or can otherwise be effectively
canceled (e.g. protected by double-quotes or a backslash) wherever it does
appear.
Recall that gawk won’t output anything without explicit instructions.
Since every episode’s publication date and URL extracted in the second
procedure actually corresponds to the most recent value assigned to
title
, I can conveniently print all three parameters in one
command, separated by the @
character. Don’t expect every
podcast’s RSS feed to behave similarly. With the right Awk script in
place, we produce a stream of output with two to three fields of data on each
line: the title of the episode, the URL of its audio file, and optionally the
date of its publication. Alternatively, using just publication dates or some
other means of naming episode files works well enough in plenty of contexts. The
order by which gawk prints columns is not important so long as you know
the order when the time comes to call read
.
Loose ends
My episode titles include diacritics, so I pipe the episode data to a program
called
iconv15
that replaces each
Unicode16
character with its closest ASCII counterpart available before redirecting the
resulting output to a file called list.tmp
. Make sure to take care
of any desired post-Awk processing at this stage and check list.tmp
for oddities, inaccuracies or clutter that don’t belong in your data. If
you would like to view the processed data as it is sent to list.tmp
in real time, then pipe to
tee list.tmp
17
instead of using the >
operator. This will print your data to
standard output as it simultaneously saves to list.tmp
.
Eventually, we use read
to intake one line (i.e. one episode at a
time), assign its contents to respective variables, and reference our data
through those variables to carry out several instructions, repeating the process
for every line of data. If you chose a character or string other than whitespace
as your data’s field separator, you must inform the shell of which.
Otherwise, read
will assign to these key variables strings of
consecutive non-whitespace characters instead of the meaningful values you
carefully extracted from feed
. The Internal Field Separator, a
special shell variable “used when the shell splits words as part of
expansion,”18 can be
temporarily set to an arbitrary character in case you want read
to
treat certain strings as single arguments even when they contain whitespace.
Before you go modifying special variables, look into your own shell’s
IFS
implementation, scoping rules, and any possible side effects of
which you should be aware.
Depending on your podcast archive’s size, the whole download process could
take a while so here is a basic reporting strategy to maintain some idea of your
download progress. The variable i
is a counter that starts at 1 and
increments every time an episode is downloaded, communicating how many episodes
have been downloaded so far at any given time. The number of episodes listed in
feed
equals the line count of list.tmp
, allowing
wc
19
or a preferred alternative to assign to the constant n
the total
number of episodes that should be downloaded when the process is finished.
Consequently, [i/n]
represents the share of all episodes listed in
list.tmp
that have already been downloaded.
Execution
All the pieces are in place to start saving downloaded episodes to your filesystem. Before you do, you may want to take note of your storage space usage and keep an eye on it as you download so that a full disk doesn’t halt your progress partway.
The while
loop that follows works a lot like an Awk program,
parsing one line of input at a time, assigning its contents to variables, and
then doing something before moving on to the next line of its input
file.
read
is a shell
built-in20
that takes as arguments a list of names, parses a line from standard input
according to the Internal Field Separator, assigns each field’s value to a
different name from the list supplied to read
respectively, and
returns 0
(i.e. success or true) until it reaches the end of the
file and every line has been read. While the order in which your fields appear
in list.tmp
—the order supplied to gawk’s
printf
function during Transform/extract—does not
matter per se, every line must use the same order and that order must match the
order of names supplied to read
according to how they are
referenced inside the loop. Using descriptive names like date
,
url
and title
may prevent many of the ordering
mistakes possible at this stage.
Keep in mind that each iteration of your while
loop supplies a line
of text to read
, not from a filename supplied as an argument, but
from standard input. In practice, this counterintuitively places the input
redirection operator <
outside of the loop and
after its closing done
keyword. Think of read
as a controlling expression with a truth value that gets served to
while
every time a line is read from standard input, so that
while
can determine whether to run its body statement again. The
expression and statement that follow while
, regardless of their
complexity, syntactically belong together. Any redirection operator or pipe
thrust between them will sever the complete statement.
I want my filenames to be ordered chronologically when sorted by name.
Fortunately, the dates provided in feed
are already formatted such
that GNU date can read them. Specifically, they can be passed via
date
’s --date
option and converted to any format
required using a few of date
’s many format sequences. For
readability, I store the newly formatted date in a separate variable called
$fdate
.
With every ingredient we need to make cURL sing, let’s talk options. I tend to throw the kitchen sink at the versatile download tool, but for some this configuration will still feel conservative. Choose your own adventure.
If you want to see download progress out of one hundred percent for each file,
don’t use --silent
or --show-error
. If the
directory in which you would like your downloads to propagate already exists,
you won’t need --create-dirs
. --location
, which
follows redirects, is rarely necessary and perhaps a security risk. The only
absolute essential here is --output
, which not only allows you to
name each file using your own carefully crafted format, but more importantly
instructs cURL against its default behavior, which is to stream each file to
standard output without saving them as files. Compressed audio binaries belong
in your media player as audio, not in your terminal as text! Before putting cURL
to work, double-check your output location and basename format so that you
don’t accidentally clobber anything. If you didn’t bother to scrape
any descriptive title data during Transform/extract, consider passing
--remote-name
, which preserves each file’s original
server-provided name, as an alternative to --output
. Before you
do, comb through these filenames in case they use invalid characters or exceed
cURL’s length maximum.
When a file finishes downloading, pcdl.sh
prints a line of text
with overall progress as well as the completed episode’s full pathname.
Using my input feed as an example, these progress-report lines should look
something like the following output:
[1/164] /home/bob/downloads/20220725 - vero brunkow y la vida.mp3 [2/164] /home/bob/downloads/20220718 - jose miguel y la vida.mp3 [3/164] /home/bob/downloads/20220711 - luis manuel bravo y la.mp3 [4/164] /home/bob/downloads/20220704 - bubu garcia y vivir en.mp3 [5/164] /home/bob/downloads/20220531 - rodrigo guerra y la claridad.mp3
This way, we can spot mistakes and get an idea of how much time the remaining
downloads will take without the need for busy walls of text. To store a copy of
your progress report to a file, either pipe the relevant output to tee
--append
or redirect it using >>
. In either case, it
is important to append to the report file using either the
--append
flag with tee
or the >>
redirection operator in place of >
. Otherwise, you will
overwrite your report file on each iteration of your while
loop. If
you have included a counter in your progress-report lines, remember to increment
i
before the next iteration begins. pcdl.sh
uses
arithmetic expansion but that’s probably not the most portable solution.
This brings us to a reminder that you probably don’t need: run tests! If
you’re not sure what it does, try it out in isolation. If your script is
broken but you can’t tell where, use echo
between
instructions so that you can see where data is flowing—and not flowing.
Play it safe: print out programmatically generated names and inspect them
against your expectations before you send them off to filesystem-altering power
tools. The real puzzle is to write expressions that work without caveat against
your entire RSS feed, but you may surprise yourself with how quickly and easily
you are able to whip up a few lines of data-extracting Awk with a bit of
practice.