Standard Search: a search engine for standard.site posts

31 January 2026

The other day I put a search engine online which attempts to index all articles in the Atmosphere conforming to the standard.site lexicon: Standard Search

In practice this mostly covers people blogging on sites like pckt.blog or leaflet.pub but also independent websites like mine that have gone out of their way to integrate with ATProto. This is rapidly growing in popularity—when I first launched the search engine a few days ago there were around 3900 indexed documents and today there are 4122.

This was a relatively easy job as far as search engines go. Between the ATProto firehose and the relay's ability to list all repos with a particular collection, there is no great technical hurdle to find all the standard.site records that exist on the network and the process is entirely automated by tap. This means that no crawling is required for discovery.

Tap is a standalone service that's designed to be consumed over an HTTP API, which means you can use it from basically any programming environment. Since I'm a fan of Rust I previously built the library tapped. This hides the fiddly details of running tap, making requests and parsing the JSON events.

The search and indexing part wasn't particularly tough either because I could use tantivy. This is basically a pure Rust equivalent of Lucene and it works incredibly well. It even handles generating the snippets of text highlighting where the keywords appear in the text. For documents that don't include any content in the AT record I scraped the HTML and ran it through dom_smoothie to get the text. The existing AT blogging platforms all seem to have plaintext fields in their content data so I extract and index those.

The main part that doesn't come off-the-shelf is verification. Standard.site has a two-part verification check. The AT record points to the page's URL, and the HTML at that URL needs to have a <link> tag pointing back to that same AT record to prove ownership. There's also a publication record which is verified against a .well-known URL. Therefore to prevent any forged records there is a bit of scraping required for each discovered post. Of course, at this early stage nobody (to my knowledge) is actually trying to forge anything. Verification failures mostly occur because someone didn't realise that they needed to add <link> tags. (If you want to check your own implementation I built an online validator).

Then there is the matter of validating the standard.site records themselves. Technically I don't need to do this: if I'm able to parse out the fields I need and the scraping checks succeed, then that would be enough to index the documents. However, in the spirit of encouraging correctness the search engine is actually pretty pedantic. It uses jacquard-lexicon to validate the record which confirms that the lexicon matches both structurally (having the right fields) and in terms of constraints. In practice these constraints are sometimes not observed, particularly the maximum number of graphemes in a post's description field (300). Documents that don't conform exactly are skipped from indexing.

One unknown at this stage is when there will be problems with spam or otherwise problematic content showing up in the index. I speculate that strict verification might keep a lid on this for a while. It's a little bit fiddly to get it right on your own and if a spammer is using a hosted platform they're likely to get kicked off anyway. I'll decide what to do if and when that happens.

One mildly embarassing thing: when I announced this I didn't realise that there already was a search engine for standard.site posts: pub-search.waow.tech by @zzstoatzz.io. That's no reason not to have a go making my own but it might have been nice of me to acknowledge that it wasn't a new concept.

Now that it's there, there are more things I could play with when I have time: language detection and filtering, vector search, things like that. We'll see what happens.

Serious Computer Business Blog by Thomas Karpiniec
Posts RSS, Atom