Common Crawl Foundation

Common Crawl Foundation

62 Photos and videos

Tweets

Common Crawl Foundation

@CommonCrawl

Jun 4

The Columnar Index Is Now the URL Index We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format. commoncrawl.org/blog/the-col…

Common Crawl - Blog - The Columnar Index Is Now the URL Index

We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.

commoncrawl.org

215

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Jun 1

Introducing the AI Visibility Audit A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.

791

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Jun 1

commoncrawl.org/blog/introdu… Link to PDF: cdn.prod.website-files.com/6…

Common Crawl - Blog - Introducing the AI Visibility Audit

A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.

commoncrawl.org

225

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 27

Under-represented languages deserve better tools! On June 4th, The Common Crawl Foundation and Mozilla Data Collective will host a webinar to test language identification for the languages you care about.

532

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 27

RSVP and join speakers Laurie Burchell and Pedro Ortiz Suarez from the Common Crawl Foundation and Kostis Saitas Zarkias and Robert Pugh from Mozilla Data Collective for a truly hands-on session. Thursday, June 4th 6 PM CEST | 12 PM ET | 9 AM PDT Register via Zoom: zoom.us/meeting/register/ilR…

Welcome! You are invited to join a meeting: Text Language Identification (LID) with CommonCrawl and...

Welcome! You are invited to join a meeting: Text Language Identification (LID) with CommonCrawl and Mozilla Data Collective. After registering, you will receive a confirmation email about joining the...

zoom.us

457

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 25

May 2026 Crawl Archive Now Available We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content. 📷

541

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 25

commoncrawl.org/blog/may-202…

Common Crawl - Blog - May 2026 Crawl Archive Now Available

We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content.

commoncrawl.org

216

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 21

As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.

1,339

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 21

commoncrawl.org/blog/april-2…

Common Crawl - Blog - April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket

commoncrawl.org

278

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 7

You can now build directly on Common Crawl from the browser Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. 📷

371

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

May 7

commoncrawl.org/blog/you-can…

Common Crawl - Blog - You can now build directly on Common Crawl from the browser

Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages.

commoncrawl.org

245

Tristan Rhodes

Common Crawl Foundation retweeted

Tristan Rhodes

@tristanbob

May 3

Have you ever seen a user agent named "CCBOT"? If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10 petabytes open-source dataset. I think it's beautiful that humanity shares this data. It means that anyone with minimal resources has the access to data required to build their own AI models. It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.

580

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Apr 30

Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.

455

more replies

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Apr 30

📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics Live long and prosper!

220

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Apr 30

Sorry, now with the actual links. commoncrawl.org/blog/april-2… blog.commoncrawl.org/blog/ho… commoncrawl.github.io/cc-cra… commoncrawl.github.io/cc-web…

146

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Apr 30

📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Apr 6

April 2026 Common Crawl Newsletter

413

Common Crawl Foundation

Common Crawl Foundation

@CommonCrawl

Apr 6

commoncrawl.org/blog/april-2…

Common Crawl - Blog - April 2026 Common Crawl Newsletter

Check out our newsletter for April 2026, with updates on what we've been up to.

commoncrawl.org

308

Financial Times

Common Crawl Foundation retweeted

Financial Times

@FT

Mar 20

Mistral CEO: AI companies should pay a content levy in Europe ft.trib.al/hKU8k0g | opinion

Mistral CEO: AI companies should pay a content levy in Europe

A revenue-based charge would protect the livelihoods of copyright holders and bring legal certainty

ft.com

118

93,859