All the News 2.0 — 2.7 million news articles and essays from 27 American publications

Overview

This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020. It is an expanded edition of the original All the News dataset on Kaggle, which was compiled in early 2017. While the original dataset contains more than 100,000 articles, the new dataset's greater size and breadth makes it more broadly applicable for both training language models and studying a wider selection of media.

Changelog

7/9/22

  • Removed a Washington Post article with null values
  • Applied int type to days and years that were str.
  • Removed Unnamed columns.

Description

Each row contains the following data:

  • Datetime of article publication
  • Year
  • Month
  • Day
  • Author
  • Article title
  • Article text
  • Article URL
  • Section of publication where article was published (where relevant/applicable)
  • Publication name

The yearly breakdown of articles is as follows:

Year        Count
2016604511
2017640493
2018553588
2019655456
2020234830

Representation of publications in the data is as follows:

PublicationCount
Axios47815
Business Insider57953
Buzzfeed News32819
CNBC238096
CNN127602
Economist26227
Fox News20144
Gizmodo27228
Hyperallergic13551
Mashable94107
New Republic11809
New Yorker4701
People136488
Politico46377
Refinery 29111433
Reuters840094
TMZ49595
TechCrunch52095
The Hill208411
The New York Times252259
The Verge52424
Vice101137
Vice News15539
Vox47272
Washington Post40882
Wired20243

Method

Publications were scraped with Python according to the publications' sitemaps, with a few exceptions (like Vox) involving RSS feeds. The last day of scraping was on April 2, 2020.