Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping from Arabic news #2016

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from
Draft

Conversation

RabeaAffan24
Copy link

Next steps:

  • Get all archive (+-300 pages)
  • generate JSON (JSON serialisation)
  • integrate Panet scraping into Anyway ETL

@atalyaalon atalyaalon requested review from shaysw and ziv17 November 29, 2021 20:23
@atalyaalon
Copy link
Collaborator

@BusinessLanguage looks good!
However right now this code writes into a local file. Not sure we want to merge this in that way.
We can merge just to make sure code is in our repo - however it's not a code that will run in prod.
@ziv17 @shaysw any thoughts?

@ziv17
Copy link
Collaborator

ziv17 commented Nov 30, 2021

Very nice!

Do we want to use these accidents for our reports and infographics?
If yes, then we need to add them to our database. To do this, I think we need:

  • coordinate the values of entities in the results file to those that are used in our database (e.g. injury severity, street name, accident severity etc.) Currently for these entities we use codes(numbers), and English names in the code, and have translation to Hebrew using pybabel.
  • Then add the data (accidents, injured, etc.) to our database.

@atalyaalon atalyaalon marked this pull request as draft January 23, 2022 17:44
abstract: str
title: str

def __init__(self, _article_pub_date, _abstract, _title):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I would like us to use type hints in function parameters and variables.

Comment on lines +32 to +34
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
f1.write(match.group(groupNum))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

  • In python we use lowercase and underscore for variables and function/method names.
  • I prefer not to change the variable loop in the loop. It is better to use a different variable.

response = requests.get(url)
print(response.status_code)

api_key = "AIzaSyD_B16MmHv7mfQNKSanibF_S2ofJgI6Pc0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we OK that the API key is in our code, in a public repo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on this one..
Accidently pushed it to the PR

@@ -0,0 +1,9 @@
import requests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the style we use for file names is lowercase with underscore between words.

Copy link
Collaborator

@ziv17 ziv17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Well done!
See some technical comments below.
How is this code going to be incorporated in our application? I think it worth a discussion.

@RabeaAffan24
Copy link
Author

@ziv17 Thank you for your comments. Will amend those issues soon.

regarding your question, incorporating the obtained data in the database will be carried out after the newsflash been translated (using Google API) and then will undergo the same process as your mainstream data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants