Better Search

2023-02-20 11:07 by Staff

It's technically been live now for a few months, but we're happy to officially announce the successful roll-out of a new search mechanism that allows full-text search for all transcripts site-wide. At the top of the homepage and each podcast's list page, you should now see a search bar that allows quick retrieval of text matches on each page's title and body content.

Providing an affordable full-text search system to index hundreds of thousands of documents, without the use of an expensive dedicated server, is no simple feat. To minimize costs while keeping search performance high, we've tried a variety of client-side search tools, like Lunr.js, but these always required constantly building and re-uploading massive indexes every day, or just worked too slowly in the browser given the size of our dataset. Most of these tools were designed to index a simple blog, with a few dozen pages. Not thousands.

We came close to using Sql.js, which worked fairly well in the browser, even if it is still a bit experimental, but its lack of sharding required rebuilding large indexes every day, which wasn't practical.

Eventually, we settled on Pagefindfs, which runs quickly and supports sharding, allowing us to build segmented indexes, so we only have to upload new indexes for the most recent document changes, saving a lot of money and bandwidth.

The only limitation it has is that some of the more advanced search features, like filtering by publication date or author, require each transcript page be explicitly tagged with meta data in a specific Pagefindfs syntax. And since we've been building pages for years, long before we ever knew about Pagefindfs, most pages don't contain those tags. As we move forward, new pages will contain Pagefindfs tags, and given time, we might retroactively parse our old documents and insert the appropriate tags.