RSS Scraper is a test application n which saves RSS feeds to a database and lets a user view and manage feeds they've added to the system through an API.
Feeds (and feed items) updates in a background task, asynchronously, periodically and in an unattended manner.
Also, there is a notification service that supports any kind of notification type (can be extended, only email supports now)
Technically, it's a flask application with a PostgreSQL database and celery+redis broker for asynchronous tasks.
Please keep in mind some points of my realisation that I made to avoid wasting time because, as I said before, I have full-time work now:
- I didn't represent all fields of feeds and items; Model expanding is easy but needs time
- I didn't cover all endpoints with tests but covered some of them to give you an understanding of my approach
- I didn't have time to finish the test of the background update service, but the service worked well. If it will be a decisive point, I can do it later.
This repo provides quick start docker-compose file for hassle-free usage.
- docker engine (https://docs.docker.com/engine/install/)
- docker-compose (https://docs.docker.com/compose/install/)
- Clone repo:
git clone https://github.com/ilyashatalov/rss-scraper.git
- Start application:
docker-compose up -d
Don't mind that app and celery services will be errored at start of command. They will start as soon as the container is builded
It will:
- Pull Redis and PostgreSQL images from the docker hub.
- Create persistent docker volume for PostgreSQL,
- Build container for the main app, celery worker and celery beat.
- Configure services between each other
- Forward ports (for main app port 80). Postgresql and Redis ports are forwarded too, but you can disable them by commenting on these lines in the docker-compose file—application using docker-network for these purposes by default.
You can provide your configuration through the docker-compose file. Pass it as an environment variable to the docker service (this way has the highest priority). You can see an example with
- app:
SQLALCHEMY_DATABASE_URI=<sql alchemy url>
- celery worker and beat (default values are here):
# SCHEDULER MAX_RETRIES=3 SCHEDULE_INTERVAL_SEC = 10 # Celery CELERY_BROKER_URL=redis://localhost:6379/0 # Notifier NOTIFICATION=False NOTIFICATION_TYPE=email SMTP_SERVER= SMTP_PORT= SMTP_LOGIN= SMTP_PASSWORD=
There are integrated API references in the application that are available on page http://127.0.0.1/apidocs/ after the start of the main app.
- Follow feed
Result
curl -X POST localhost/feeds/follow -H 'Content-Type: application/json' -d '{"name": "first", "url": "https://feeds.feedburner.com/tweakers/mixed"}'
If you try to send this request again there are will be an error (url and name should be uniq){ "message": { "id": 1, "name": "first", "url": "https://feeds.feedburner.com/tweakers/mixed" }, "success": true }
{ "message": "name or url already exists", "success": false }
- Update items for a feed from a remote source. There are two ways - wait for auto-updating or force update.
Follow feed method doesn't pull items from feed during /feed/follow request because it could be long or contain errors. The client should manage it.
Result:
curl localhost:5000/feeds/1/update -X POST
{ "success": true }
- Get items for this feed
Result:
curl localhost:5000/feeds/1/items
{ "message": [ { "id": 40, "last_updated": "2023-03-10 14:20:33", "title": "Videokaart Best Buy Guide - Update maart 2023", "unread": true, "url": "https://tweakers.net/reviews/10946/videokaart-best-buy-guide-update-maart-2023.html" }, { "id": 39, "last_updated": "2023-03-10 14:20:33", "title": "Tweakers Podcast #259 - Roltelefoons, early access-scams en socialemediaverboden", "unread": true, "url": "https://tweakers.net/geek/207410/tweakers-podcast-259-roltelefoons-early-access-scams-en-socialemediaverboden.html" }, ... ], "success": true }
- All items are unread. Let's read one
Result:
curl -X PATCH localhost/items/1 -d '{"unread": "false"}' -H 'Content-Type: application/json'
{ "success": true }
- And now we can filter our items with unread flag
Result
curl localhost/feeds/1/items?unread=false
{ "message": [ { "feed_id": 1, "id": 1, "last_updated": "2023-03-10 15:17:35", "title": "Nintendo toont nieuwe trailer The Super Mario Bros. Movie", "unread": false, "url": "https://tweakers.net/geek/207496/nintendo-toont-nieuwe-trailer-the-super-mario-bros-movie.html" } ], "success": true }
- Follow another feed and force update
Read one item from new feed:
curl -X POST localhost/feeds/follow -H 'Content-Type: application/json' -d '{"name": "nu.nl", "url": "http://www.nu.nl/rss/Algemeen"}' curl localhost/feeds/2/update -X POST
Get all already read items (from all feeds):curl -X PATCH localhost/items/41 -d '{"unread": "false"}' -H 'Content-Type: application/json'
Resultcurl "localhost/items?unread=false"
{ "message": [ { "feed_id": 2, "id": 41, "last_updated": "2023-03-10 15:32:32", "title": "Kabinet schrapt per direct de laatste coronaregels", "unread": false, "url": "https://www.nu.nl/coronavirus/6252775/kabinet-schrapt-per-direct-de-laatste-coronaregels.html" }, { "feed_id": 1, "id": 1, "last_updated": "2023-03-10 15:17:35", "title": "Nintendo toont nieuwe trailer The Super Mario Bros. Movie", "unread": false, "url": "https://tweakers.net/geek/207496/nintendo-toont-nieuwe-trailer-the-super-mario-bros-movie.html" } ], "success": true }
Firstly, you need redis and postgresql servers, so you can use provided docker-compose file again. Don't forget to change variables in compose-file, e.g. database credentials
docker-compose up redis db -d
Install requirements for project and pytest
pip install -r requirements.txt pytest
Configure environment variables
cd app
cp env .env
vim .env
cd worker
cp env .env
vim .env
Start app in dev mode
flask run --reload
Start celery worker and beat
celery -A worker.celery worker -B -l debug
Take care, database will be cleaned up after every test. So, it's better to use another database for test runs. It's simple with dotenv configuration, just pass it as the environment variable (it will be prioritized). But don't forget to create it before
SQLALCHEMY_DATABASE_URI="postgresql://rssapp:rssapp@localhost:5432/rssscraper_test" pytest -s -W ignore::DeprecationWarning worker/tests.py
Result
====================================== test session starts ==============================
platform darwin -- Python 3.10.1, pytest-7.2.2, pluggy-1.0.0
rootdir: /Users/ish/Git/github.com/ilyashatalov/rss-scraper
collected 5 items
app/tests.py .....
===================================== 5 passed in 0.42s ==================================