Forum Moderators: phranque

Message Too Old, No Replies

How do news aggregator sites automate news collection?

         

born2run

7:22 am on Dec 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi I have been visiting this popular UK based news aggregator website newzit dot com. Basically they publish links to news articles like Google News on a smaller scale.

They say they collect news links automatically. Can anyone please let me know how?

I am guessing via some public API service or their own bots? Some tips would be helpful. Thanks!

engine

9:07 am on Dec 14, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you're planning on experimenting with this yourself, one easy way is to use the RSS feed and aggregate them into each relevant page.

graeme_p

10:39 am on Dec 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If the sites have RSS feeds its easy. There are a lot of feed aggregators that will do a reasonable job off the shelf: most have names with the word "planet" in them.

I built a news aggregator as part of a niche search engine a few years ago. A lot of sites did not have RSS feeds so we built a bot that was easily configurable for different sites using Scrapy and indexed the data with Solr. The biggest problem was getting blocked by things like Incapsula - many smaller site owners do not know how to configure things like that to allow a particular bot.

born2run

11:49 am on Dec 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks. Nobody has used public news API services for collecting news? Or it can’t be done this way?

robzilla

12:15 pm on Dec 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why couldn't it be done that way? An API is just another means of getting the same news. Buy the paper, crawl the news sites, read the RSS feeds, connect to an API, they're all ways to get the same data. Most likely, that (annoying) site uses an API and/or its own bot, but you'd have to ask them to be sure :-)

graeme_p

12:34 pm on Dec 14, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Depends on whether then API matches your need. Does the API crawl the sources you need? Can you rely on it being available? Does it updates fast enough?.....

engine

5:08 pm on Dec 14, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The other thing about APIs is you're often hampered by limits, so its worth bearing that in mind.

Many of the best aggregators are quite sophisticated these days.

born2run

8:06 am on Dec 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The other thing is if you crawl the website won’t the website block the bot?

engine

11:00 am on Dec 16, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Well, yes, and you'll probably find that the major services block most bots.

The RSS feed could be the solution.

born2run

4:49 pm on Dec 18, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And what if the website doesn’t have rss feed feature? How does one automate news collecting from these websites? To aggregate into my website.

not2easy

5:06 pm on Dec 18, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Ask them? They may not be aware that others could share their news. Some sites may offer other options.

graeme_p

5:01 pm on Dec 19, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They may not block you depending on how much you crawl. If you are just getting headlines it may not be that much. If you are crawling the whole site you probably will get blocked. Follow directions in robots.txt, and robots.txt may also point to sitemaps (xml ones) as an alternative to RSS

There are lots of ways to get round bot blockers. There is a whole industry around doing this. The best way is if they know and and decide they want to allow your bot.

A lot of news sites have RRS feeds.

mack

10:47 pm on Dec 19, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I experimented with a news aggregation site many years back, around 2004 (ish). It was really an RSS reader that I wrote specifically to deal with news sources. I had a cronjob that would download the URL, title and description every 20 mins (if the feed was modified).

Mack.