This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020. It is an expanded edition of the original All the News dataset on Kaggle, which was compiled in early 2017. While the original dataset contains more than 100,000 articles, the new dataset's greater size and breadth makes it more broadly applicable for both training language models and studying a wider selection of media.
Each row contains the following data:
The yearly breakdown of articles is as follows:
Representation of publications in the data is as follows:
|The New York Times||252259|
Publications were scraped with Python according to the publications' sitemaps, with a few exceptions (like Vox) involving RSS feeds. The last day of scraping was on April 2, 2020.