-
Notifications
You must be signed in to change notification settings - Fork 549
Add Internshala scraper for internships and jobs #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Feat: Implement Internshala scraping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for scraping job postings and internships from Internshala, an India-focused job board platform. The implementation follows the existing scraper pattern established in the codebase and includes comprehensive documentation updates.
Key changes:
- Added Internshala as a new site option alongside existing job boards (LinkedIn, Indeed, etc.)
- Implemented a dedicated scraper that handles both internship and job listings from Internshala.com
- Added
internshala_search_termparameter to allow Internshala-specific search queries
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
jobspy/model.py |
Added INTERNSHALA to Site enum and internshala_search_term field to ScraperInput model |
jobspy/internshala/util.py |
Utility functions for parsing job cards, locations, posting dates, and compensation (stipend/salary) in INR |
jobspy/internshala/constant.py |
HTTP headers configuration for Internshala requests |
jobspy/internshala/__init__.py |
Main scraper implementation with pagination, dual listing type support (internships and jobs), and optional description fetching |
jobspy/exception.py |
Added InternshalaException for Internshala-specific error handling |
jobspy/__init__.py |
Integrated Internshala scraper into the main scraping interface and parameter handling |
README.md |
Updated documentation with Internshala support, usage examples, parameter descriptions, and India-specific notes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if rupee_idx != -1: | ||
| prefix = text[:rupee_idx] | ||
|
|
||
| matches = re.findall(r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b", prefix) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b" will only match locations starting with uppercase letters followed by lowercase (like "Mumbai", "New Delhi"). This will fail to match all-caps location names (like "DELHI") or locations with special characters. Consider using a more flexible pattern or additional parsing logic to handle various location name formats.
| matches = re.findall(r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b", prefix) | |
| matches = re.findall( | |
| r"(?i)\b([a-z][a-z.&'-]*(?:\s+[a-z][a-z.&'-]*)*(?:,\s*[a-z][a-z.&'-]*(?:\s+[a-z][a-z.&'-]*)*)*)\b", | |
| prefix, | |
| ) |
| about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower()) | ||
| if about_header: | ||
| desc_parts: list[str] = [] | ||
| for sib in about_header.find_all_next(): |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using find_all_next() without any arguments will traverse all following elements in the entire document, which can be inefficient for large HTML pages. Consider using find_next_siblings() or limiting the search with a limit parameter to improve performance.
| for sib in about_header.find_all_next(): | |
| for sib in about_header.find_next_siblings(): |
| listing_type = "internship" | ||
|
|
||
| job_post = JobPost( | ||
| id=f"internshala-{hash(job_url)}", |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using hash(job_url) for ID generation can produce negative values and is not guaranteed to be stable across Python runs (hash randomization). Other scrapers in the codebase use explicit job IDs from the site (e.g., "li-{job_id}", "nk-{job_id}"). Consider extracting a stable ID from the job URL path (e.g., the job detail ID) or using a URL-safe hash function like hashlib.md5 or just using the URL path segment as the identifier.
| site_name=["indeed", "linkedin", "zip_recruiter", "google"], # "glassdoor", "bayt", "naukri", "bdjobs" | ||
| site_name=["indeed", "linkedin", "zip_recruiter", "google", "internshala"], # "glassdoor", "bayt", "naukri", "bdjobs" | ||
| search_term="software engineer", | ||
| internshala_search_term="software engineer", |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example shows setting internshala_search_term="software engineer" for a location-based search in San Francisco. However, as noted in the PR description, Internshala is India-focused and ignores the location parameter. This example may confuse users as it suggests Internshala will respect the San Francisco location. Consider clarifying that Internshala will search in India regardless of the location parameter, or provide a separate example for Internshala usage.
| internshala_search_term="software engineer", | |
| internshala_search_term="software engineer", # Internshala is India-focused and ignores `location`; results will be for India |
| query = (scraper_input.internshala_search_term or scraper_input.search_term or "").strip() | ||
| paths: list[tuple[str, str]] | ||
| if query: | ||
| encoded_query = quote(query.lower(), safe="") |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL encoding uses quote(query.lower(), safe="") which will percent-encode spaces and special characters. However, Internshala URLs typically use hyphens to separate words (e.g., "/keywords-software-engineer/"). The current implementation would create URLs like "/keywords-software%20engineer/" which may not match Internshala's expected format. Consider replacing spaces with hyphens before encoding, or verify that percent-encoded URLs work correctly with Internshala's search.
| encoded_query = quote(query.lower(), safe="") | |
| slug = "-".join(query.lower().split()) | |
| encoded_query = quote(slug, safe="-") |
| def __init__( | ||
| self, proxies: list[str] | str | None = None, ca_cert: str | None = None, user_agent: str | None = None | ||
| ): | ||
| super().__init__(Site.INTERNSHALA, proxies=proxies, ca_cert=ca_cert, user_agent=user_agent) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user_agent parameter is not passed to the parent class Scraper.__init__(). For consistency with other scrapers like Indeed and Naukri, consider passing user_agent=user_agent to the super().init call.
|
|
||
| stipend_comp = parse_stipend(card) | ||
|
|
||
| description: str | None = None |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name linkedin_fetch_description is misleading when used for Internshala scraping. While reusing this existing parameter may be intentional to avoid adding another parameter, the name suggests it's LinkedIn-specific. Consider documenting this behavior in the PR or README, or consider using a more generic parameter name in future refactoring.
| description: str | None = None | |
| description: str | None = None | |
| # NOTE: `linkedin_fetch_description` is used as a generic "fetch full description" | |
| # flag across sites (including Internshala), despite its LinkedIn-specific name. |
|
|
||
| soup = BeautifulSoup(resp.text, "html.parser") | ||
|
|
||
| about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower()) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description fetching logic on line 279 searches for headers containing "about the internship", but this won't work for job postings which may have different header text like "about the job" or "job description". Consider using a more generic pattern or checking for multiple header variations to properly handle both internships and jobs.
| about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower()) | |
| patterns = ["about the internship", "about the job", "job description", "about the opportunity"] | |
| about_header = soup.find( | |
| lambda tag: tag.name in ("h2", "h3") | |
| and any(pattern in tag.get_text(strip=True).lower() for pattern in patterns) | |
| ) |
| hours_limit = scraper_input.hours_old | ||
| posted_cutoff: Optional[datetime] = None | ||
| if hours_limit is not None: | ||
| posted_cutoff = datetime.utcnow() - timedelta(hours=hours_limit) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of datetime.utcnow() is deprecated as of Python 3.12 in favor of datetime.now(timezone.utc). While this still works, consider updating to the recommended approach for future compatibility.
| posted_ago = parse_posted_ago(card) | ||
| date_posted: Optional[date] = None | ||
| if posted_ago is not None: | ||
| dt_posted = datetime.utcnow() - posted_ago |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of datetime.utcnow() is deprecated as of Python 3.12 in favor of datetime.now(timezone.utc). While this still works, consider updating to the recommended approach for future compatibility.
Summary
I added support for scraping Internshala along with the existing job boards.
The scraper can collect both internships and full time jobs from Internshala and returns them in the same JobSpy dataframe.
What changed
https://internshala.com/internships/keywords-<query>/https://internshala.com/jobs/keywords-<query>/job_typeandlisting_typeso internships and jobs can be told apart.internshala_search_termargument toscrape_jobs.search_term.site_nameoptions.internshala_search_termparameter.site_namelist.listing_typein the JobPost schema.How it works
scrape_jobsbuilds aScraperInputas before and passes it to the Internshala scraper.individual_internshipcontainers.employment_typeattribute on the card to decide whether it is a job or an internship.hours_oldvalue as the other sites to filter older posts.JobPostobjects that are merged into the main dataframe.Notes and limits
country_indeedis not used for this site.scrape_jobsis not applied on Internshala. The location comes from the text in each card.Compensationmodel.Testing
site_name=["internshala"]and with Internshala included together with other sites.