The Columnar Index Is Now the URL Index
We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.
commoncrawl.org/blog/the-col…
Introducing the AI Visibility Audit
A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.
Under-represented languages deserve better tools! On June 4th, The Common Crawl Foundation and Mozilla Data Collective will host a webinar to test language identification for the languages you care about.
RSVP and join speakers Laurie Burchell and Pedro Ortiz Suarez from the Common Crawl Foundation and Kostis Saitas Zarkias and Robert Pugh from Mozilla Data Collective for a truly hands-on session.
Thursday, June 4th 6 PM CEST | 12 PM ET | 9 AM PDT Register via Zoom: zoom.us/meeting/register/ilR…
May 2026 Crawl Archive Now Available
We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content.
📷
As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.
You can now build directly on Common Crawl from the browser
Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages.
📷
Have you ever seen a user agent named "CCBOT"?
If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10 petabytes open-source dataset.
I think it's beautiful that humanity shares this data.
It means that anyone with minimal resources has the access to data required to build their own AI models.
It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.
Our April 2026 Crawl Archive and corresponding Web Graph are now available.
The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.