Common Crawl is a non-profit foundation dedicated to the Open Web.

Joined February 2010
62 Photos and videos
The Columnar Index Is Now the URL Index We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format. commoncrawl.org/blog/the-col…
1
3
215
Under-represented languages deserve better tools! On June 4th, The Common Crawl Foundation and Mozilla Data Collective will host a webinar to test language identification for the languages you care about.
1
3
532
RSVP and join speakers Laurie Burchell and Pedro Ortiz Suarez from the Common Crawl Foundation and Kostis Saitas Zarkias and Robert Pugh from Mozilla Data Collective for a truly hands-on session. Thursday, June 4th 6 PM CEST | 12 PM ET | 9 AM PDT Register via Zoom: zoom.us/meeting/register/ilR…
1
457
May 2026 Crawl Archive Now Available We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content. 📷
1
1
5
541
You can now build directly on Common Crawl from the browser Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. 📷
2
1
2
371
Common Crawl Foundation retweeted
Have you ever seen a user agent named "CCBOT"? If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10 petabytes open-source dataset. I think it's beautiful that humanity shares this data. It means that anyone with minimal resources has the access to data required to build their own AI models. It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.
2
3
8
580
Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.
1
3
455
📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics Live long and prosper!
2
220
📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics
53