This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020. It is an expanded edition of the original All the News dataset on Kaggle, which was compiled in early 2017. While the original dataset contains more than 100,000 articles, the new dataset's greater size and breadth makes it more broadly applicable for both training language models and studying a wider selection of media.
Each row contains the following data:
date(str): Datetime of article publication.
year(int): Year of article publication.
month(float): Month of article publication.
day(int): Day of article publication.
author(str): Article author, if available. Multiple authors are separated by a comma.
title(str): Article title.
article(str): Article text, without paragraph breaks.
url(str): Article URL.
section(str): Section of the publication in which the article appeared, if applicable.
publication(str): Name of the article publication.
The yearly breakdown of articles is as follows:
Representation of publications in the data is as follows:
|The New York Times||252259|
Publications were scraped with Python according to the publications' sitemaps, with a few exceptions (like Vox) involving RSS feeds. The last day of scraping was on April 2, 2020.