Rake Part 2: File Lists

[boilerplate bypath=”rake”]

In the last episode, we wrote this Rakefile. It automates building three Markdown files into HTML files.

task :default => :html
task :html => %W[ch1.html ch2.html ch3.html]

rule ".html" => ".md" do |t|
  sh "pandoc -o #{t.name} #{t.source}"
end

We really don’t want to have to edit this file every time we add a new file to process though. Instead, we’d like to have the Rakefile automatically find files to be built.

To give us something to experiment with, I’ve set up a sample project directory. It contains four Markdown chapter files and one appendix file in a subdirectory, all of which should be built into HTML files. It also has some other stuff which we don’t want to build. There’s a ~ch1.md file which is some kind of temporary file left behind by an editor. And there’s a scratch directory, the contents of which should be ignored.

$ tree
.
├── ~ch1.md
├── ch1.md
├── ch2.md
├── ch3.md
├── ch4.markdown
├── scratch
│   └── test.md
├── subdir
│   └── appendix.md
└── temp.md

This project is under Git revision control. If we tell Git to list the files it knows about, we see a subset of the files from before. Notably missing is a file called temp.md, which has not been registered with Git and probably never will. It too should be left out of the list of files to build.

$ git ls-files
ch1.md
ch2.md
ch3.md
ch4.markdown
scratch/test.md
subdir/appendix.md
~ch1.md

In order to automatically discover just the files which should be built, we turn to Rake file lists. Let’s explore what file lists are, and what they are capable of.

To create a file list, we use the subscript operator on the Rake::FileList class, passing in a list of strings representing files.

require 'rake'
files = Rake::FileList["ch1.md", "ch2.md", "ch3.md"]
files # => ["ch1.md", "ch2.md", "ch3.md"]

So far this isn’t very exciting. But we’re just getting started. Instead of listing files individually, with a FileList we can instead pass in a shell glob pattern. Let’s give it the pattern *.md

require 'rake'
Dir.chdir "project"
files = Rake::FileList["*.md"]
files # => ["ch1.md", "temp.md", "ch3.md", "ch2.md", "~ch1.md"]

Now we start to see the power of a FileList. But this isn’t quite the list of files we want. It contains some files we don’t care about, and it’s missing some files we do want.

We’ll address the missing files first. We add a *.markdown pattern to find files which use the long-form extension.

require 'rake'
Dir.chdir "project"
files = Rake::FileList["*.md", "*.markdown"]
files # => ["ch1.md", "temp.md", "ch3.md", "ch2.md", "~ch1.md", "ch4.markdown"]

But we’re still missing the appendix file. To fix this, we change the glob patterns to match any level in the project directory tree.

require 'rake'
Dir.chdir "project"
files = Rake::FileList["**/*.md", "**/*.markdown"]
puts files 

# >> ch1.md
# >> temp.md
# >> ch3.md
# >> ch2.md
# >> scratch/test.md
# >> ~ch1.md
# >> subdir/appendix.md
# >> ch4.markdown

Now we’ve found all four chapters and the appendix, but we’ve picked up a lot of junk along the way. Let’s start winnowing down the list of files. For this, we’ll use exclusion patterns.

We start by ignoring files that begin with a ~ character.

require 'rake'
Dir.chdir "project"
files = Rake::FileList["**/*.md", "**/*.markdown"]
files.exclude("~*")
puts files 

# >> ch1.md
# >> temp.md
# >> ch3.md
# >> ch2.md
# >> scratch/test.md
# >> subdir/appendix.md
# >> ch4.markdown

Next we’ll ignore files in the scratch directory. Just to demonstrate that it’s possible, we’ll use a regular expression for this exclusion instead of a shell glob.

require 'rake'
Dir.chdir "project"
files = Rake::FileList["**/*.md", "**/*.markdown"]
files.exclude("~*")
files.exclude(/^scratch\//)
puts files 

# >> ch1.md
# >> temp.md
# >> ch3.md
# >> ch2.md
# >> subdir/appendix.md
# >> ch4.markdown

We’ve still got the file temp.md hanging around. As we saw before, this file isn’t registered with Git. We’d like to make an exclusion rule that says to ignore any non-Git-controlled file. To do this, we pass a block to .exclude. Inside, we put an incantation which will determine if Git is aware of the file.

require 'rake'
Dir.chdir "project"
files = Rake::FileList["**/*.md", "**/*.markdown"]
files.exclude("~*")
files.exclude(/^scratch\//)
files.exclude do |f|
  `git ls-files #{f}`.empty?
end
puts files 

# >> ch1.md
# >> ch3.md
# >> ch2.md
# >> subdir/appendix.md
# >> ch4.markdown

This filters out the temp file, and finally we are left with the list of just the files we care about.

Next we update the code to make the FileList definition a little more self-contained. We change from the subscript shorthand to FileList.new, and pass a block to the constructor. The FileList will yield itself to this block, which means we can set up all of our exclusions inside the block.

require 'rake'
Dir.chdir "project"
files = Rake::FileList.new("**/*.md", "**/*.markdown") do |fl|
  fl.exclude("~*")
  fl.exclude(/^scratch\//)
  fl.exclude do |f|
    `git ls-files #{f}`.empty?
  end
end
puts files 

# >> ch1.md
# >> ch3.md
# >> ch2.md
# >> subdir/appendix.md
# >> ch4.markdown

We need to make one more change to our list of files before we can return to our Rakefile. In the Rakefile what we needed was a list of the files to be built, not the source files that correspond to them. To convert our list of input files to a list of output files, we use the #ext method. We give it a .html file extension, and it returns a new list of files with all of the original Markdown extensions replaced with .html.

require 'rake'
Dir.chdir "project"
files = Rake::FileList.new("**/*.md", "**/*.markdown") do |fl|
  fl.exclude("~*")
  fl.exclude(/^scratch\//)
  fl.exclude do |f|
    `git ls-files #{f}`.empty?
  end
end
puts files.ext(".html")

# >> ch1.html
# >> ch3.html
# >> ch2.html
# >> subdir/appendix.html
# >> ch4.html

Now we’re ready to come back to our Rakefile. We replace our hardcoded list of target files with the FileList we just built.

Since we are now supporting Markdown files with either a .md or .markdown extension, we have to make one more change to tell Rake it can build an HTML file for either one. For now, we’ll do this by simply duplicating the rule. In the future we’ll look at a way to avoid this duplication.

source_files = Rake::FileList.new("**/*.md", "**/*.markdown") do |fl|
  fl.exclude("~*")
  fl.exclude(/^scratch\//)
  fl.exclude do |f|
    `git ls-files #{f}`.empty?
  end
end

task :default => :html
task :html => source_files.ext(".html")

rule ".html" => ".md" do |t|
  sh "pandoc -o #{t.name} #{t.source}"
end

rule ".html" => ".markdown" do |t|
  sh "pandoc -o #{t.name} #{t.source}"
end

When we run rake , we can see that it builds all the right HTML files:

$ rake
pandoc -o ch1.html ch1.md
pandoc -o ch2.html ch2.md
pandoc -o ch3.html ch3.md
pandoc -o subdir/appendix.html subdir/appendix.md
pandoc -o ch4.html ch4.markdown

I think that’s enough Rake for today. Happy hacking!

[boilerplate bypath=”rake-end”]

5 comments

  1. Thanks for the great tutorial. I think a more robust version of
    sh “pandoc -o #{t.name} #{t.source}”
    would be
    sh “pandoc -o #{Shellwords.escape(t.name)} #{Shellwords.escape(t.source)}”
    to provide with special (shell-interpreted) characters in the names.

    I don’t think t.name and t.source are shell-escaped. For some reason, lib/rake/application.rb requires “shellwordsbut Rake never actually uses that library (and least a quickack` through the source code would indicate so).

Leave a Reply to pskocik Cancel reply

Your email address will not be published. Required fields are marked *