I finished the data scrape of all Local US News sources.
I built the source, with all newspapers, and then I scraped each site, 10 pages deep on news articles.
Now, I have a dataset of newspaper articles,
about 1 million rows.
The problem is that the news scraper failed on some sources, and these will need to be excluded from production_news.
I am looking at different options for efficient deduplication, and so far I am liking
"datasketch" , which is described as "datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy."
we shall see!