Add Internshala scraper for internships and jobs #311

vitesh-reddy · 2025-12-26T07:32:38Z

Summary

I added support for scraping Internshala along with the existing job boards.
The scraper can collect both internships and full time jobs from Internshala and returns them in the same JobSpy dataframe.

What changed

Added Internshala as a new site in the Site enum and in the scraper mapping.
Implemented a new Internshala scraper under internshala that:
- Uses the keyword urls
  - https://internshala.com/internships/keywords-<query>/
  - https://internshala.com/jobs/keywords-<query>/
- Scrapes both internships and jobs from the listing pages.
- Parses title, company name, location, posted date, and stipend or salary.
- Sets country to India for all Internshala results.
- Fills job_type and listing_type so internships and jobs can be told apart.
Added an optional internshala_search_term argument to scrape_jobs.
- This is only used by Internshala.
- If it is not set, Internshala falls back to search_term.
Updated README:
- Included Internshala in the list of supported sites and in the site_name options.
- Documented the internshala_search_term parameter.
- Updated the example usage to show Internshala in the site_name list.
- Added a short Internshala note in the “Supported Countries” section.
- Mentioned listing_type in the JobPost schema.

How it works

scrape_jobs builds a ScraperInput as before and passes it to the Internshala scraper.
The Internshala scraper:
- Builds internship and job urls from the given query.
- Paginates through the result pages.
- Finds each card inside individual_internship containers.
- Uses the employment_type attribute on the card to decide whether it is a job or an internship.
- Uses the same hours_old value as the other sites to filter older posts.
- Returns a list of JobPost objects that are merged into the main dataframe.

Notes and limits

Internshala is treated as India only. The country flag from country_indeed is not used for this site.
The location filter from scrape_jobs is not applied on Internshala. The location comes from the text in each card.
Stipend and salary are parsed as monthly INR in the shared Compensation model.

Testing

Ran index.py with site_name=["internshala"] and with Internshala included together with other sites.
Confirmed that:
- Results are returned without errors.
- Fields like title, company, location, date_posted, job_type, listing_type and compensation are filled as expected.

Feat: Implement Internshala scraping

Copilot

Pull request overview

This PR adds support for scraping job postings and internships from Internshala, an India-focused job board platform. The implementation follows the existing scraper pattern established in the codebase and includes comprehensive documentation updates.

Key changes:

Added Internshala as a new site option alongside existing job boards (LinkedIn, Indeed, etc.)
Implemented a dedicated scraper that handles both internship and job listings from Internshala.com
Added internshala_search_term parameter to allow Internshala-specific search queries

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`jobspy/model.py`	Added `INTERNSHALA` to Site enum and `internshala_search_term` field to ScraperInput model
`jobspy/internshala/util.py`	Utility functions for parsing job cards, locations, posting dates, and compensation (stipend/salary) in INR
`jobspy/internshala/constant.py`	HTTP headers configuration for Internshala requests
`jobspy/internshala/__init__.py`	Main scraper implementation with pagination, dual listing type support (internships and jobs), and optional description fetching
`jobspy/exception.py`	Added InternshalaException for Internshala-specific error handling
`jobspy/__init__.py`	Integrated Internshala scraper into the main scraping interface and parameter handling
`README.md`	Updated documentation with Internshala support, usage examples, parameter descriptions, and India-specific notes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-26T07:39:34Z

jobspy/internshala/util.py

+    if rupee_idx != -1:
+        prefix = text[:rupee_idx]
+
+    matches = re.findall(r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b", prefix)


The regex pattern r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b" will only match locations starting with uppercase letters followed by lowercase (like "Mumbai", "New Delhi"). This will fail to match all-caps location names (like "DELHI") or locations with special characters. Consider using a more flexible pattern or additional parsing logic to handle various location name formats.

Suggested change

matches = re.findall(r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b", prefix)

matches = re.findall(

r"(?i)\b([a-z][a-z.&'-]*(?:\s+[a-z][a-z.&'-]*)*(?:,\s*[a-z][a-z.&'-]*(?:\s+[a-z][a-z.&'-]*)*)*)\b",

prefix,

)

Copilot · 2025-12-26T07:39:34Z

jobspy/internshala/__init__.py

+        about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())
+        if about_header:
+            desc_parts: list[str] = []
+            for sib in about_header.find_all_next():


Using find_all_next() without any arguments will traverse all following elements in the entire document, which can be inefficient for large HTML pages. Consider using find_next_siblings() or limiting the search with a limit parameter to improve performance.

Suggested change

for sib in about_header.find_all_next():

for sib in about_header.find_next_siblings():

Copilot · 2025-12-26T07:39:35Z

jobspy/internshala/__init__.py

+            listing_type = "internship"
+
+        job_post = JobPost(
+            id=f"internshala-{hash(job_url)}",


Using hash(job_url) for ID generation can produce negative values and is not guaranteed to be stable across Python runs (hash randomization). Other scrapers in the codebase use explicit job IDs from the site (e.g., "li-{job_id}", "nk-{job_id}"). Consider extracting a stable ID from the job URL path (e.g., the job detail ID) or using a URL-safe hash function like hashlib.md5 or just using the URL path segment as the identifier.

Copilot · 2025-12-26T07:39:35Z

README.md

-    site_name=["indeed", "linkedin", "zip_recruiter", "google"], # "glassdoor", "bayt", "naukri", "bdjobs"
+    site_name=["indeed", "linkedin", "zip_recruiter", "google", "internshala"], # "glassdoor", "bayt", "naukri", "bdjobs"
    search_term="software engineer",
+    internshala_search_term="software engineer",


The example shows setting internshala_search_term="software engineer" for a location-based search in San Francisco. However, as noted in the PR description, Internshala is India-focused and ignores the location parameter. This example may confuse users as it suggests Internshala will respect the San Francisco location. Consider clarifying that Internshala will search in India regardless of the location parameter, or provide a separate example for Internshala usage.

Suggested change

internshala_search_term="software engineer",

internshala_search_term="software engineer", # Internshala is India-focused and ignores `location`; results will be for India

Copilot · 2025-12-26T07:39:35Z

jobspy/internshala/__init__.py

+        query = (scraper_input.internshala_search_term or scraper_input.search_term or "").strip()
+        paths: list[tuple[str, str]]
+        if query:
+            encoded_query = quote(query.lower(), safe="")


The URL encoding uses quote(query.lower(), safe="") which will percent-encode spaces and special characters. However, Internshala URLs typically use hyphens to separate words (e.g., "/keywords-software-engineer/"). The current implementation would create URLs like "/keywords-software%20engineer/" which may not match Internshala's expected format. Consider replacing spaces with hyphens before encoding, or verify that percent-encoded URLs work correctly with Internshala's search.

Suggested change

encoded_query = quote(query.lower(), safe="")

slug = "-".join(query.lower().split())

encoded_query = quote(slug, safe="-")

Copilot · 2025-12-26T07:39:36Z

jobspy/internshala/__init__.py

+    def __init__(
+        self, proxies: list[str] | str | None = None, ca_cert: str | None = None, user_agent: str | None = None
+    ):
+        super().__init__(Site.INTERNSHALA, proxies=proxies, ca_cert=ca_cert, user_agent=user_agent)


The user_agent parameter is not passed to the parent class Scraper.__init__(). For consistency with other scrapers like Indeed and Naukri, consider passing user_agent=user_agent to the super().init call.

Copilot · 2025-12-26T07:39:36Z

jobspy/internshala/__init__.py

+
+        stipend_comp = parse_stipend(card)
+
+        description: str | None = None


The parameter name linkedin_fetch_description is misleading when used for Internshala scraping. While reusing this existing parameter may be intentional to avoid adding another parameter, the name suggests it's LinkedIn-specific. Consider documenting this behavior in the PR or README, or consider using a more generic parameter name in future refactoring.

Suggested change

description: str | None = None

description: str | None = None

# NOTE: `linkedin_fetch_description` is used as a generic "fetch full description"

# flag across sites (including Internshala), despite its LinkedIn-specific name.

Copilot · 2025-12-26T07:39:37Z

jobspy/internshala/__init__.py

+
+        soup = BeautifulSoup(resp.text, "html.parser")
+
+        about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())


The description fetching logic on line 279 searches for headers containing "about the internship", but this won't work for job postings which may have different header text like "about the job" or "job description". Consider using a more generic pattern or checking for multiple header variations to properly handle both internships and jobs.

Suggested change

about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())

patterns = ["about the internship", "about the job", "job description", "about the opportunity"]

about_header = soup.find(

lambda tag: tag.name in ("h2", "h3")

and any(pattern in tag.get_text(strip=True).lower() for pattern in patterns)

)

Copilot · 2025-12-26T07:39:37Z

jobspy/internshala/__init__.py

+        hours_limit = scraper_input.hours_old
+        posted_cutoff: Optional[datetime] = None
+        if hours_limit is not None:
+            posted_cutoff = datetime.utcnow() - timedelta(hours=hours_limit)


The use of datetime.utcnow() is deprecated as of Python 3.12 in favor of datetime.now(timezone.utc). While this still works, consider updating to the recommended approach for future compatibility.

Copilot · 2025-12-26T07:39:37Z

jobspy/internshala/__init__.py

+        posted_ago = parse_posted_ago(card)
+        date_posted: Optional[date] = None
+        if posted_ago is not None:
+            dt_posted = datetime.utcnow() - posted_ago


The use of datetime.utcnow() is deprecated as of Python 3.12 in favor of datetime.now(timezone.utc). While this still works, consider updating to the recommended approach for future compatibility.

vitesh-reddy and others added 3 commits December 26, 2025 02:18

feat: add Internshala scraper and related components

ebef1db

feat: integrate Internshala scraping

87a3dd5

Merge pull request #1 from vitesh-reddy/generalise

071996d

Feat: Implement Internshala scraping

Copilot AI review requested due to automatic review settings December 26, 2025 07:32

vitesh-reddy requested a review from cullenwatson as a code owner December 26, 2025 07:32

Copilot started reviewing on behalf of vitesh-reddy December 26, 2025 07:33 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Internshala scraper for internships and jobs #311

Add Internshala scraper for internships and jobs #311

vitesh-reddy commented Dec 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	for sib in about_header.find_all_next():
	for sib in about_header.find_next_siblings():

	internshala_search_term="software engineer",
	internshala_search_term="software engineer", # Internshala is India-focused and ignores `location`; results will be for India

	encoded_query = quote(query.lower(), safe="")
	slug = "-".join(query.lower().split())
	encoded_query = quote(slug, safe="-")


		stipend_comp = parse_stipend(card)

		description: str \| None = None


		soup = BeautifulSoup(resp.text, "html.parser")

		about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())

-        about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())
+        patterns = ["about the internship", "about the job", "job description", "about the opportunity"]
+        about_header = soup.find(
+            lambda tag: tag.name in ("h2", "h3")
+            and any(pattern in tag.get_text(strip=True).lower() for pattern in patterns)
+        )

Add Internshala scraper for internships and jobs #311

Are you sure you want to change the base?

Add Internshala scraper for internships and jobs #311

Conversation

vitesh-reddy commented Dec 26, 2025

Summary

What changed

How it works

Notes and limits

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant