-
Notifications
You must be signed in to change notification settings - Fork 549
Add Internshala scraper for internships and jobs #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,290 @@ | ||||||||||||||
| from __future__ import annotations | ||||||||||||||
|
|
||||||||||||||
| import random | ||||||||||||||
| import time | ||||||||||||||
| from datetime import datetime, timedelta, date | ||||||||||||||
| from typing import Optional | ||||||||||||||
| from urllib.parse import urljoin, quote | ||||||||||||||
|
|
||||||||||||||
| from bs4 import BeautifulSoup | ||||||||||||||
| from bs4.element import Tag | ||||||||||||||
|
|
||||||||||||||
| from jobspy.exception import InternshalaException | ||||||||||||||
| from jobspy.internshala.constant import headers | ||||||||||||||
| from jobspy.internshala.util import ( | ||||||||||||||
| find_job_cards, | ||||||||||||||
| parse_location, | ||||||||||||||
| parse_posted_ago, | ||||||||||||||
| parse_stipend, | ||||||||||||||
| ) | ||||||||||||||
| from jobspy.model import ( | ||||||||||||||
| JobPost, | ||||||||||||||
| Location, | ||||||||||||||
| JobResponse, | ||||||||||||||
| Country, | ||||||||||||||
| Compensation, | ||||||||||||||
| DescriptionFormat, | ||||||||||||||
| Scraper, | ||||||||||||||
| ScraperInput, | ||||||||||||||
| Site, | ||||||||||||||
| JobType, | ||||||||||||||
| ) | ||||||||||||||
| from jobspy.util import ( | ||||||||||||||
| extract_emails_from_text, | ||||||||||||||
| markdown_converter, | ||||||||||||||
| plain_converter, | ||||||||||||||
| create_session, | ||||||||||||||
| create_logger, | ||||||||||||||
| ) | ||||||||||||||
|
|
||||||||||||||
| log = create_logger("Internshala") | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| class Internshala(Scraper): | ||||||||||||||
| base_url = "https://internshala.com" | ||||||||||||||
| delay = 2 | ||||||||||||||
| band_delay = 3 | ||||||||||||||
|
|
||||||||||||||
| def __init__( | ||||||||||||||
| self, proxies: list[str] | str | None = None, ca_cert: str | None = None, user_agent: str | None = None | ||||||||||||||
| ): | ||||||||||||||
| super().__init__(Site.INTERNSHALA, proxies=proxies, ca_cert=ca_cert, user_agent=user_agent) | ||||||||||||||
|
||||||||||||||
| self.session = create_session( | ||||||||||||||
| proxies=self.proxies, | ||||||||||||||
| ca_cert=ca_cert, | ||||||||||||||
| is_tls=False, | ||||||||||||||
| has_retry=True, | ||||||||||||||
| delay=5, | ||||||||||||||
| clear_cookies=True, | ||||||||||||||
| ) | ||||||||||||||
| self.session.headers.update(headers) | ||||||||||||||
| if user_agent: | ||||||||||||||
| self.session.headers["user-agent"] = user_agent | ||||||||||||||
| self.scraper_input: ScraperInput | None = None | ||||||||||||||
| self.country = Country.INDIA | ||||||||||||||
|
|
||||||||||||||
| def scrape(self, scraper_input: ScraperInput) -> JobResponse: | ||||||||||||||
| self.scraper_input = scraper_input | ||||||||||||||
| self.country = Country.INDIA | ||||||||||||||
| job_list: list[JobPost] = [] | ||||||||||||||
| seen_urls: set[str] = set() | ||||||||||||||
|
|
||||||||||||||
| hours_limit = scraper_input.hours_old | ||||||||||||||
| posted_cutoff: Optional[datetime] = None | ||||||||||||||
| if hours_limit is not None: | ||||||||||||||
| posted_cutoff = datetime.utcnow() - timedelta(hours=hours_limit) | ||||||||||||||
|
||||||||||||||
|
|
||||||||||||||
| results_wanted = scraper_input.results_wanted or 15 | ||||||||||||||
|
|
||||||||||||||
| query = (scraper_input.internshala_search_term or scraper_input.search_term or "").strip() | ||||||||||||||
| paths: list[tuple[str, str]] | ||||||||||||||
| if query: | ||||||||||||||
| encoded_query = quote(query.lower(), safe="") | ||||||||||||||
|
||||||||||||||
| encoded_query = quote(query.lower(), safe="") | |
| slug = "-".join(query.lower().split()) | |
| encoded_query = quote(slug, safe="-") |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a request exception occurs on line 111, an InternshalaException is raised, which will immediately terminate the scraping process for all remaining pages and the other listing type (job/internship). This is inconsistent with the error handling for card processing (lines 133-138) which continues on errors. Consider handling request failures more gracefully by breaking only for the current path and allowing the scraper to try the other listing type, or by retrying failed requests.
| raise InternshalaException(str(e)) | |
| break |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of datetime.utcnow() is deprecated as of Python 3.12 in favor of datetime.now(timezone.utc). While this still works, consider updating to the recommended approach for future compatibility.
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name linkedin_fetch_description is misleading when used for Internshala scraping. While reusing this existing parameter may be intentional to avoid adding another parameter, the name suggests it's LinkedIn-specific. Consider documenting this behavior in the PR or README, or consider using a more generic parameter name in future refactoring.
| description: str | None = None | |
| description: str | None = None | |
| # NOTE: `linkedin_fetch_description` is used as a generic "fetch full description" | |
| # flag across sites (including Internshala), despite its LinkedIn-specific name. |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using hash(job_url) for ID generation can produce negative values and is not guaranteed to be stable across Python runs (hash randomization). Other scrapers in the codebase use explicit job IDs from the site (e.g., "li-{job_id}", "nk-{job_id}"). Consider extracting a stable ID from the job URL path (e.g., the job detail ID) or using a URL-safe hash function like hashlib.md5 or just using the URL path segment as the identifier.
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description fetching logic on line 279 searches for headers containing "about the internship", but this won't work for job postings which may have different header text like "about the job" or "job description". Consider using a more generic pattern or checking for multiple header variations to properly handle both internships and jobs.
| about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower()) | |
| patterns = ["about the internship", "about the job", "job description", "about the opportunity"] | |
| about_header = soup.find( | |
| lambda tag: tag.name in ("h2", "h3") | |
| and any(pattern in tag.get_text(strip=True).lower() for pattern in patterns) | |
| ) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using find_all_next() without any arguments will traverse all following elements in the entire document, which can be inefficient for large HTML pages. Consider using find_next_siblings() or limiting the search with a limit parameter to improve performance.
| for sib in about_header.find_all_next(): | |
| for sib in about_header.find_next_siblings(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example shows setting
internshala_search_term="software engineer"for a location-based search in San Francisco. However, as noted in the PR description, Internshala is India-focused and ignores the location parameter. This example may confuse users as it suggests Internshala will respect the San Francisco location. Consider clarifying that Internshala will search in India regardless of the location parameter, or provide a separate example for Internshala usage.