News

How my news aggregator works

October 30, 2023

I like to read the news, but I find myself returning to the same one or two sites over and over again. I don't like navigating most news sites because they are full of pictures or videos that are intended to keep your intention. I just like reading headlines and clicking on a few stories that interest me.

I decided to aggregate links from various news sources so that I could display them on a text-only page. This idea became the news feature on this website. Every couple of hours, I have a bot crawl various news sites and aggregate any links that it finds. I then display those links to the user.

Web Scraping

Gettings the links is fairly simple. Make an HTTP request, parse the HTML, and get all of the links with the text they display. I'm using BeautifulSoup to parse the HTML.

I do three main checks on the URLs:

  1. I validate them to make sure I got the full URL, not just the path.
  2. I check whether a given URL is already in my database. If not, I add it with the headline and the source.
  3. I do some simple checking to filter out About, Contact, or other miscellaneous pages that don't lead to news.

I also assign a weight to it based on the source and location in the page. We'll talk more about that later.

Sorting

One problem when designing this feature was how to sort the articles. I could sort them chronologically, but then links from the same source tend to bunch together. You can shuffle these before you add them to the database, but then you tend to get some pretty irrelevant articles up at the very top. This also has the undesirable property of burying sources with a few articles and promoting those with many.

To counter this behavior, I assign a weight to each article as I crawl based on the source it comes from and its location on the page. Articles from tops of sources I like are boosted; articles from the bottom of sources I dislike are penalized.

Say we have two articles, and I would like Article A to appear at the top with probability , and Article B with probability . This is simple to solve; choose a random number between 0 and 1. If it's less than , we choose A first. Otherwise, B is first.

This gets a bit more tricky when we have multiple articles, but it's very doable. We can assign intervals according to their weights, choose a random number, remove that article and repeat the process.

This method would work just fine if we knew all of the articles and all of the weights a priori, but I wanted to provide updates as the day progressed. Sure, we could repeat the process every time the data is updated, but that leads to unexpected shuffling. It's quite jarring if you come back to the page multiple times in a day.

A Better Solution

This is when I came across weighted reservoir sampling.

Say we wanted to select the "top" article of the day, but we have to update our decision fairly every time we get a new article. Like before, we want to weight articles from sources we like more. We could select a random number for each article, multiply by the weight, then find the largest number. This is a terrible algorithm, but higher weighted articles are technically more likely to be chosen than lower weighted articles.

It turns out there is a transformation we can do that behaves exactly how we want. Using our weights, weighted reservoir sampling assigns a priority using a random number where The article with the lowest value "wins" and will be at the top of the page.

It turns out that this is a fair selection based on the weights, regardless of the number of articles that get added later.

Comments
jdharman 29w
Wow! Great work Bryan!
Log in to post a comment