The News Archives API is a super simple application programming interface that allows anyone to get news of the past.
What problem does this solve?
In my house, many newspapers of historical events are stuffed into my garage cabinets, like papers from 9/11, presidential elections of 2008, ’12, and ’16, the first quarantines from covid-19, and much more. These are stored so that in, say, ten years, I can see what was the news on this very important day. Maybe even in hundreds of years, someone will come across my former garage and get a primary source of world-changing events from my time (unlikely).
There’s a lot of problems with this though. Stacks of newspapers take up a ton of space in my garage. It’s hard to find a single newspaper without looking through all of them, and they get lost and destroyed easily.
So I solved this like any bored developer would do… move it to tech!
All the problems stated earlier can be solved… with some new features too:
- It’s automated, so no more doing the work of storing news myself.
- I can store all articles, instead of just a few.
- You can look for articles containing a keyword in their title.
- News can be logged for the public, instead of just me.
- Anyone can embed this into their own application.
Two components make up the system. A news “logger” to save articles, and a server to output the data for the public.
At a specific time of day, 24 top trending articles from a news source are saved to the cloud. Articles’ titles, descriptions, and links are stored.
- A database
- A news source
When a get request is made to the server, news should be returned.
API endpoints will be designated to get articles from a client-specified time period (day, month, year). An endpoint will also be made to get news with a specified keyword in its title.
Data will be retrieved using SQL queries, for fastest possible response times.
- An web framework
- For a reliable world news source, I used BBC News.
- For a document-oriented, low latency database, I used Rockset.
- For a simple web framework, I used Flask.
- For a GitHub compatible server hosting service, I used Heroku.
- To log news 24/7, I used an AWS EC2 instance.
- To host and open-source code, I used GitHub.
Building the service
At 10 am GMT, news is obtained from BBC News’ RSS feed using my python package bbc feeds and is pushed to a Rockset data collection. Each day has its own document, containing document fields “_id”, which has the news articles’ date, and “articles”, which contains a list of json objects containing articles’ title, description, and link.
Here‘s example article:
“title”: “Covid: First round of US vaccinations to begin on Monday”,
“description”: “The Pfizer/BioNTech vaccine was approved earlier this week and doses are being distributed this weekend.”,
“link”: “https://www.bbc.co.uk/news/world-us-canada-55289726", }
Documents are added using Rockset’s python SDK.
Getting a day’s data
To get data from a day, the client can make make the following request:
The date must follow “yy-mm-dd” format, so a valid date would be “2020–11–02”.
The following query is used to retrieve data:
Getting a month or year’s data
The following request is to acquire a month’s data:
The month must follow “yy-mm” format, so a valid month would be “2020–11”.
And for a year’s data:
Note that this request will return a ton of data.
The same SQL query is used to retrieve both a month’s and year’s data.
WHERE _id LIKE CONCAT(:time, '%')
Getting a data with a keyword
Getting data with a keyword is a tad more complicated, because seperate queries are performed if or if not time period arguments are supplied by the client. The client can specify a time frame where a keyword is to show up.
For example, to get data from the month 2020–11 containing the keyword “china”, make the following request:
Results can be limited too. If a limit is given, the most recent data will be returned
To get the five most recent mentions of keyword “china”, make this request:
To get all articles with “china” in its title:
If a time frame isn’t given, the following query is executed:
SELECT n._id, models.a.description, models.a.link, models.a.title
FROM commons.NewsArchives n,
UNNEST (n.articles as a) AS models
WHERE LOWER(models.a.title) LIKE LOWER(CONCAT(‘%’, :keyword, ‘%’))
ORDER BY _id DESC
If a time frame is given, another
WHERE condition is added.
WHERE LOWER(models.a.title) LIKE LOWER(CONCAT('%', :keyword, '%')) AND n._id LIKE CONCAT(:time, '%')
And that’s how the API is built!
NOTE: Articles have been logged since October 29, 2020, so you can only retrieve news from dates since then.
With privacy a priority, no requests are logged. Only hits are counted. When any request is made to any endpoint, it’s counted and saved into the database. The hit counter’s source code is here.
Developers can use client libraries to access the API easily in their favorite languages. Here are the “official” client libraries:
A demo application can be found here. Here’s a demo of the demo:
How it works
Here is how data from a day is retrieved:
from flask import render_template
import newsarchivesdef day_search(day):
day_search() is executed when someone clicks “search” on the demo.
When someone searches with a keyword:
return(render_template(‘demo_search.jinja’, news=newsarchives.keyword(keyword, limit=20)[‘data’]))