June 25, 2020

Generate Sitemap and RSS for Vercel Next.js App with Dynamic Routes

Recently, I had an interesting problem to solve — generate a sitemap.xml and an RSS feed endpoint for this blog which is built with Next.js and uses Sanity as the content backend. The post pages are dynamic routes that fetch content from Sanity for pre-rendering. The tricky part is to make sure the sitemap and RSS are in sync with the pages dynamically generated. In this post, I will share how I solve it by generating the files using a postbuild script and config Vercel Routes to handle the request.

The solution I took is specific to my setup in which I deploy my serverless Next.js app to the Vercel platform. But the concept should be transferable to your use cases.

The "Problem"

There are no APIs in Next.js that generate a sitemap or RSS based on the complied build results. Solutions suggested usually only work without dynamic routes.

Requirements

The generation of the files(sitemap.xml and feed.json) should be automated in the build process.
The generation should base on the pre-rendered HTML files since the content stored in my CMS needed to be serialized into React components(same if you are using MDX) and I don't want to duplicate the work that Next.js has done for me.
The files should be built ahead of request. That means I am not using Next.js API routes to perform the task even though it is possible and valid.

Solution

Overview

I created a postbuild script that searches for any pre-rendered HTML files inside the output pages folder and parses the files using cheerio to extract data. The data will then be used to create sitemap.xml and feed.json which will be written to the output static folder. Lastly, I configured the vercel.json file to route the requests to /sitemap.xml and /feed.json to the corresponding files in the static folder. Below are some implementation details:

Add a `postbuild` script in `package.json`:

It will be run by automatically npm or yarn after build script is executed. (All scripts support pre and post hooks. Learn more about that here)

{
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "postbuild": "node ./scripts/postbuild"
  },
  "...": "..."
}

Create `/scripts/postbuild.js`

I will break it down piece by piece. First the main function:

function main() {
  const pagesDir = './.next/serverless/pages';
  const pageFiles = getPageFiles(pagesDir);
  buildRss(pageFiles, pagesDir);
  buildSiteMap(pageFiles);
}

The output pages directory in Vercel environment is located in ./.next/serverless/pages. If you want to test in local environment, the output pages directory is located in ./.next/server/static/${buildId}/pages and you can find the buildId in ./.next/BUILD_ID.

The getPageFiles simply collect all the HTML files (excluding 404.html) in the output pages directory.

const fs = require('fs');
const path = require('path');

function getPageFiles(directory, files = []) {
  const entries = fs.readdirSync(directory, { withFileTypes: true });
  entries.forEach(entry => {
    const absolutePath = path.resolve(directory, entry.name);
    if (entry.isDirectory()) {
      // wow recursive 🐍
      getPageFiles(absolutePath, files);
    } else if (isPageFile(absolutePath)) {
      files.push(absolutePath);
    }
  });
  return files;
}

function isPageFile(filename) {
  return (
    path.extname(filename) === '.html' &&
    !filename.endsWith('404.html')
  );
}

After I collect all absolute paths of HTML files, I pass them to buildRSS and buildSitemap and use cheerio to parse the content. (Learn more about cheerio) It is very unlikely that you can use the following code without modifying because how I use cheerio to parse the data is based on the HTML structure of my React components.

function buildRss(pageFiles, pagesDir) {
  // use the reduce method to collect all RSS data
  const rssData = pageFiles.reduce(
    (data, file) => {
      // the pathname is the relative path from '/pages' to the HTML file
      const pathname = path.relative(pagesDir, file).slice(0, -'.html'.length);
      // collect all RSS top level info in the index page
      if (pathname === 'index') {
        const htmlString = fs.readFileSync(file, 'utf8');
        const $ = cheerio.load(htmlString);
        data.title = $('title').text();
        data.home_page_url = $(`meta[property='og:url']`).attr('content');
        data.feed_url = $(
          `link[rel='alternate'][type='application/json']`
        ).attr('href');
        data.description = $(`meta[name='description']`).attr('content');
        data.icon = $(`link[sizes='512x512']`).attr('href');
        data.favicon = $(`link[sizes='64x64']`).attr('href');
      }
      // only add to RSS if the pathname is '/blog/*'
      if (pathname.startsWith('blog')) {
        const htmlString = fs.readFileSync(file, 'utf8');
        const $ = cheerio.load(htmlString);
        // remove the placeholder image for lazy loading images
        $(`#Content img[aria-hidden='true']`).remove();
        data.items.push({
          url: $(`meta[property='og:url']`).attr('content'),
          id: pathname.substring('blog/'.length),
          content_html: $('#Content').html(),
          title: $('article h1').text(),
          summary: $(`meta[name='description']`).attr('content'),
          image: $(`meta[property='og:image']`).attr('content'),
          banner_image: $(`meta[property='og:image']`).attr('content'),
          date_published: $('time').attr('datetime'),
          author: {
            name: $(`a[rel='author']`).text(),
            url: $(`a[rel='author']`).attr('href'),
            avatar: $(`img#Avatar`).attr('src'),
          },
        });
      }
      return data;
    },
    {
      version: 'https://jsonfeed.org/version/1',
      items: [],
    }
  );
  // sort the items by the publishing date
  rssData.items.sort(byDateDesc);
  // write to the output static folder
  fs.writeFileSync(
    path.join('./.next/static', 'feed.json'),
    JSON.stringify(rssData, null, 2)
  );
}

function buildSiteMap(pageFiles) {
  // I am using the open graph URL tag as the url
  // but you can simply concat the base Url with the relative path
  const urls = pageFiles.map(file => {
    const htmlString = fs.readFileSync(file, 'utf8');
    const $ = cheerio.load(htmlString);
    return $(`meta[property='og:url']`).attr('content');
  });
  const sitemap = `
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" 
  xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" 
  xmlns:xhtml="http://www.w3.org/1999/xhtml" 
  xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0" 
  xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" 
  xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
  ${urls
    .map(
      url => `
    <url>
      <loc>${url}</loc>
      <changefreq>daily</changefreq>
      <priority>0.7</priority>
    </url>
    `
    )
    .join('')}  
</urlset>
`;
  fs.writeFileSync(path.join('./.next/static', 'sitemap.xml'), sitemap);
}

Finally, here is the vercel.json routing configuration which you can learn more here.

{
  "routes": [
    { "src": "/sitemap.xml", "dest": "/_next/static/sitemap.xml" },
    { "src": "/feed.json", "dest": "/_next/static/feed.json" },
  ]
}

You can find the complete code in this gist.

Wrap up

So there you have it, my hack(or not?) to put together the sitemap and RSS generation after countless trials and errors and googling. It relies on knowing the location of the output files in the Vercel environment which may change in the future as Next.js and Vercel update so keep an eye on it. This might not be the most straightforward method so please let me know your implementation. Thank you for reading. Until next time! 👋

I am redesigning my blog to integrate a CMS. (before that I used MDX to write my posts) I figured it's another good chance to learn new tech stacks. After playing with several headless CMSs, I picked Sanity.io to be used with my go-to SSG/SSR framework Next.js. I also tried out TailwindCSS to learn about what's going on with this trendy utility-first CSS. I will share my learning progress and tips I found regularly.