Programmatic SEO at Scale: What a 40,000-Page News Site Taught Us

Most advice about programmatic SEO quietly assumes you have a few hundred pages. Push past ten thousand and the playbook starts to crack. We spent the last stretch deep in the internals of a 40,000-page news site, and the lessons were not the ones the blog posts promised.

Crawl budget is a real budget

On a small site Google crawls everything and you never think about it. On a large one, attention is rationed. Google samples your site, decides how much it trusts the domain, and crawls accordingly. If most of your pages are thin or near-duplicate, that sample looks weak and the crawl rate drops for the whole property. The fix is not more pages. It is fewer, stronger ones, and a clear internal structure that tells the crawler what actually matters.

Your sitemap can quietly time out

Here is a failure mode almost nobody writes about: a sitemap that takes 25 seconds to generate. Plugins build them on the fly, and on a huge archive that generation crawls. Googlebot gives a sitemap a few seconds before it moves on. We watched child sitemaps sit untouched for a month because the fetch kept timing out. Switching to static XML files served straight off disk cut the fetch to under a second, and Google pulled them immediately.

Thin pages drag the whole domain down

The single biggest category in the “not indexed” report was “crawled, currently not indexed” — Google looked and declined. These are pages that exist but earn nothing: duplicated language variants, tag archives with three posts, auto-generated stubs. Each one is a small vote against the domain. Pruning them, or merging them into real hubs, did more for indexation than any on-page tweak.

Topic depth beats raw volume

A site that covers one area thoroughly outranks one that covers fifty areas shallowly. A focused Bitcoin news section with consistent, interlinked coverage builds authority that scattered one-off posts never will. Cluster your content, link within the cluster, and give each hub a clear job.

What we would tell our past selves

Audit before you publish. Know your indexed-versus-submitted ratio before adding a single URL. Serve sitemaps as static files once you cross a few thousand pages. And treat every thin page as a liability, not an asset — because to a search engine rationing its attention, that is exactly what it is.

Scale is not a content problem. It is an architecture problem wearing a content costume. Get the structure right and volume helps you; get it wrong and volume buries you.