Bulk downloading from an RSS feed using Awk and cURL

Let’s say you’ve decided to listen to every episode of a podcast starting with episode one. Here is an example of how you might go about saving to your machine and descriptively naming all the episodes listed in the RSS feed in one automated shot.

You’ll need:

cURL¹
some flavor of Awk²
a Bash-like shell
an RSS/Atom^3|4 feed that includes URLs to all the files you want

Wget⁵ will work in place of cURL, and any structured plaintext file containing URIs to your files of interest will do as long as you’re willing to play around with your Awk program.

Regarding portability, the example program described in this post is written for GNU utilities⁶ like gawk and GNU date, so make any changes needed for your system. For example, most of the equivalent utilities found on BSD systems require short options and may even lack certain options. Platform-dependent utilities notwithstanding, dash,⁷ bash,⁸ zsh⁹ and probably any other shell that claims a superset of the Bourne shell’s syntax will have no trouble correctly interpreting the script that follows.

Today I’m working with the RSS feed¹⁰ provided by Platicando en Católico¹¹ mostly because it’s a bit messy and hasn’t been pruned of old episodes. In case your feed only keeps the n (< N) most recent entries/episodes, the ideas laid out here should apply to whatever flavor of Internet markup language with which you’re working.

Refer to the following as you read:

pcdl.sh,¹² an executable shell script that calls Awk and cURL
extract.awk,¹³ the set of instructions from which Awk reads

Do not expect any parameter assignment provided by these programs to work compatibly with your system. Such assignments may include:

sensible defaults that work on my system (e.g. dash)
judgment calls suited to my own tastes (e.g. IFS="@")
demonstrative placeholders unlikely to work on anyone’s system (e.g. /home/bob)

Set the environment

Start by assigning some strings to variables at the top of your script for easy future access.

wdir stores the absolute path to your preferred “working directory” where files generated by the program will appear. I recommend a safe place where mistakes can be made and the executing user has write permissions.

feed_url locates on the Internet your RSS feed of interest. If you plan to run your script many times, you may prefer to download and store a copy on your machine in a preliminary step. On the other hand, fetching your feed with each execution may help clarify your input source, ease adaptation for use with other feeds, and keep your data up-to-date with the latest entries.

Finally, tlength specifies the number of words to be stripped from an episode’s title and used in the downloaded episode’s corresponding filename. In other words, each file downloaded will have as the title portion of its filename the first tlength consecutive words from the <title> element of its RSS entry.

If you plan to fetch your feed from within pcdl.sh, consider passing to cURL the options --silent and --show-error to avoid terminal clutter without silencing any failures that may occur at this stage.

Transform/extract

The goal of processing feed with Awk is to extract the URL and title of each episode and pass them to cURL for download. Naturally, any line in feed containing neither the title nor the URL of some episode gets discarded. The strings that pass through our Awk script accumulate in list.tmp, a single file whose lines correspond with one episode of the podcast each and whose values cURL will take as arguments in the batch-download stage of the process. Let’s take a closer look at our Awk extraction script, extract.awk.

I introduce tlength to extract.awk’s namespace using gawk’s --assign option. First, the shell expands $tlength to the value specified during Set the environment. Next, gawk assigns that value to the name tlength and makes it available to extract.awk. For tlength to be called from within gawk, its value must be assigned to some name within the BEGIN clause of extract.awk. Awk can only access variables passed by the shell from within a given script’s BEGIN preamble, but grants global scope to any variable assigned as such. To distinguish the shell variable tlength from the Awk variable used in the body of extract.awk, I stylized the Awk-internal variable TLENGTH with all caps.

Each episode listed in feed corresponds to an episode title and a URL pointing to an audio file. extract.awk’s first procedure acts on lines containing the XML element tag <title>, which encloses the title of a given episode. All of the other lines in feed are examined for a matching string and passed along to the next procedure untouched when no match is found. In either case, the resulting string is then examined by the next procedure, which only acts on lines containing .mp3 somewhere in the line. Each of these lines contain a single URL pointing to the audio file of a unique episode. Fortunately in my case, feed is structured such that no line matches both addresses, and every line containing a <title> element is separated from the next <title> element by a line containing its corresponding audio file URL. This lets us pair titles up with URLs by reading until a title is matched, reading until a URL is matched, pairing the two, and repeating the process for the next episode. While most feeds follow this general structure, you should closely examine your feed to confirm before moving forward.

Beginning with the title string, use the sub function to replace everything in the line up to and including an opening <title> tag with the empty string: "". To clarify, substituting with the empty string effectively deletes. Call the function again to replace everything beginning with and including the closing </title> tag, through the end of the line, with the empty string. What remains of the line should be the title of the episode alone, which can and should be verified by printing the line as a test.

If you would prefer that the entire title be included in each filename of your downloaded episodes, the for loop here can be skipped and TLENGTH omitted entirely. For the sake of legibility in a file browser, I set tlength to 4 or 5 in the calling shell script so that each filename describes some aspect of the episode—the first five words of the title—without growing out of control in length. In plain English, the loop starts with a variable title that stores the first word of the title, retrieves the next whitespace-separated word from the full title, appends that word to whatever string was assigned to title thus far, and repeats the process until title contains the desired number of words or until the entire original title string is used. If the titles stripped out of feed were not separated by whitespace, it would be necessary to modify gawk’s system field separator variable by reassigning to FS the appropriate separating character from within our BEGIN procedure.

At this point, our Awk program is capable of modifying title in a number of ways. In the interest of clean filesystem-compliant filenames, I omit punctuation using the gsub function and replace all uppercase characters with their corresponding lowercase character by calling tolower. Unlike sub, gsub continues to make substitutions through a line even after its first substitution has been made. For example, if gsub encounters a title containing two commas, it will strip both commas from the title. Leave the filenames’ finishing touches to the shell and proceed to extract cURL’s key ingredient: each episode’s file URL.

As a bonus, feed provides each episode’s date of publication through the <pubDate> element in the same lines containing related file URLs. This allows us to extract each episode’s publication date and name each episode’s file accordingly for easy chronological sorting. feed presents these dates in the rather formal Day, dd Mon yyyy format, so I trim them and, back out in the shell, convert them to integers whose order corresponds with time. For now, POSIX character classes¹⁴ like [[:alpha:]] and [[:digit:]] allow us to strategically eat around the date of the year. The URL’s distinct and consistent format makes its extraction simple, so let’s move on to choosing a field separator for list.tmp.

Note that, instead of replacing the text between the date and the URL with the empty string or whitespace, I chose @ to serve as a general-use field separator going forward, a purely stylistic choice which ultimately needs to be communicated to the shell in one way or another, and acts as a kind of disambiguator between various interpretations of whitespace and field separation. Any ASCII separator character will do as long as it does not appear in any of list.tmp’s values, or can otherwise be effectively canceled (e.g. protected by double-quotes or a backslash) wherever it does appear.

Recall that gawk won’t output anything without explicit instructions. Since every episode’s publication date and URL extracted in the second procedure actually corresponds to the most recent value assigned to title, I can conveniently print all three parameters in one command, separated by the @ character. Don’t expect every podcast’s RSS feed to behave similarly. With the right Awk script in place, we produce a stream of output with two to three fields of data on each line: the title of the episode, the URL of its audio file, and optionally the date of its publication. Alternatively, using just publication dates or some other means of naming episode files works well enough in plenty of contexts. The order by which gawk prints columns is not important so long as you know the order when the time comes to call read.

Loose ends

My episode titles include diacritics, so I pipe the episode data to a program called iconv¹⁵ that replaces each Unicode¹⁶ character with its closest ASCII counterpart available before redirecting the resulting output to a file called list.tmp. Make sure to take care of any desired post-Awk processing at this stage and check list.tmp for oddities, inaccuracies or clutter that don’t belong in your data. If you would like to view the processed data as it is sent to list.tmp in real time, then pipe to tee list.tmp¹⁷ instead of using the > operator. This will print your data to standard output as it simultaneously saves to list.tmp.

Eventually, we use read to intake one line (i.e. one episode at a time), assign its contents to respective variables, and reference our data through those variables to carry out several instructions, repeating the process for every line of data. If you chose a character or string other than whitespace as your data’s field separator, you must inform the shell of which. Otherwise, read will assign to these key variables strings of consecutive non-whitespace characters instead of the meaningful values you carefully extracted from feed. The Internal Field Separator, a special shell variable “used when the shell splits words as part of expansion,”¹⁸ can be temporarily set to an arbitrary character in case you want read to treat certain strings as single arguments even when they contain whitespace. Before you go modifying special variables, look into your own shell’s IFS implementation, scoping rules, and any possible side effects of which you should be aware.

Depending on your podcast archive’s size, the whole download process could take a while so here is a basic reporting strategy to maintain some idea of your download progress. The variable i is a counter that starts at 1 and increments every time an episode is downloaded, communicating how many episodes have been downloaded so far at any given time. The number of episodes listed in feed equals the line count of list.tmp, allowing wc¹⁹ or a preferred alternative to assign to the constant n the total number of episodes that should be downloaded when the process is finished. Consequently, [i/n] represents the share of all episodes listed in list.tmp that have already been downloaded.

Execution

All the pieces are in place to start saving downloaded episodes to your filesystem. Before you do, you may want to take note of your storage space usage and keep an eye on it as you download so that a full disk doesn’t halt your progress partway.

The while loop that follows works a lot like an Awk program, parsing one line of input at a time, assigning its contents to variables, and then doing something before moving on to the next line of its input file.

read is a shell built-in²⁰ that takes as arguments a list of names, parses a line from standard input according to the Internal Field Separator, assigns each field’s value to a different name from the list supplied to read respectively, and returns 0 (i.e. success or true) until it reaches the end of the file and every line has been read. While the order in which your fields appear in list.tmp—the order supplied to gawk’s printf function during Transform/extract—does not matter per se, every line must use the same order and that order must match the order of names supplied to read according to how they are referenced inside the loop. Using descriptive names like date, url and title may prevent many of the ordering mistakes possible at this stage.

Keep in mind that each iteration of your while loop supplies a line of text to read, not from a filename supplied as an argument, but from standard input. In practice, this counterintuitively places the input redirection operator < outside of the loop and after its closing done keyword. Think of read as a controlling expression with a truth value that gets served to while every time a line is read from standard input, so that while can determine whether to run its body statement again. The expression and statement that follow while, regardless of their complexity, syntactically belong together. Any redirection operator or pipe thrust between them will sever the complete statement.

I want my filenames to be ordered chronologically when sorted by name. Fortunately, the dates provided in feed are already formatted such that GNU date can read them. Specifically, they can be passed via date’s --date option and converted to any format required using a few of date’s many format sequences. For readability, I store the newly formatted date in a separate variable called $fdate.

With every ingredient we need to make cURL sing, let’s talk options. I tend to throw the kitchen sink at the versatile download tool, but for some this configuration will still feel conservative. Choose your own adventure.

If you want to see download progress out of one hundred percent for each file, don’t use --silent or --show-error. If the directory in which you would like your downloads to propagate already exists, you won’t need --create-dirs. --location, which follows redirects, is rarely necessary and perhaps a security risk. The only absolute essential here is --output, which not only allows you to name each file using your own carefully crafted format, but more importantly instructs cURL against its default behavior, which is to stream each file to standard output without saving them as files. Compressed audio binaries belong in your media player as audio, not in your terminal as text! Before putting cURL to work, double-check your output location and basename format so that you don’t accidentally clobber anything. If you didn’t bother to scrape any descriptive title data during Transform/extract, consider passing --remote-name, which preserves each file’s original server-provided name, as an alternative to --output. Before you do, comb through these filenames in case they use invalid characters or exceed cURL’s length maximum.

When a file finishes downloading, pcdl.sh prints a line of text with overall progress as well as the completed episode’s full pathname. Using my input feed as an example, these progress-report lines should look something like the following output:

[1/164] /home/bob/downloads/20220725 - vero brunkow y la vida.mp3
[2/164] /home/bob/downloads/20220718 - jose miguel y la vida.mp3
[3/164] /home/bob/downloads/20220711 - luis manuel bravo y la.mp3
[4/164] /home/bob/downloads/20220704 - bubu garcia y vivir en.mp3
[5/164] /home/bob/downloads/20220531 - rodrigo guerra y la claridad.mp3

This way, we can spot mistakes and get an idea of how much time the remaining downloads will take without the need for busy walls of text. To store a copy of your progress report to a file, either pipe the relevant output to tee --append or redirect it using >>. In either case, it is important to append to the report file using either the --append flag with tee or the >> redirection operator in place of >. Otherwise, you will overwrite your report file on each iteration of your while loop. If you have included a counter in your progress-report lines, remember to increment i before the next iteration begins. pcdl.sh uses arithmetic expansion but that’s probably not the most portable solution.

This brings us to a reminder that you probably don’t need: run tests! If you’re not sure what it does, try it out in isolation. If your script is broken but you can’t tell where, use echo between instructions so that you can see where data is flowing—and not flowing. Play it safe: print out programmatically generated names and inspect them against your expectations before you send them off to filesystem-altering power tools. The real puzzle is to write expressions that work without caveat against your entire RSS feed, but you may surprise yourself with how quickly and easily you are able to whip up a few lines of data-extracting Awk with a bit of practice.