Skip to content

Conversation

@vitesh-reddy
Copy link

Summary

I added support for scraping Internshala along with the existing job boards.
The scraper can collect both internships and full time jobs from Internshala and returns them in the same JobSpy dataframe.

What changed

  • Added Internshala as a new site in the Site enum and in the scraper mapping.
  • Implemented a new Internshala scraper under internshala that:
    • Uses the keyword urls
      • https://internshala.com/internships/keywords-<query>/
      • https://internshala.com/jobs/keywords-<query>/
    • Scrapes both internships and jobs from the listing pages.
    • Parses title, company name, location, posted date, and stipend or salary.
    • Sets country to India for all Internshala results.
    • Fills job_type and listing_type so internships and jobs can be told apart.
  • Added an optional internshala_search_term argument to scrape_jobs.
    • This is only used by Internshala.
    • If it is not set, Internshala falls back to search_term.
  • Updated README:
    • Included Internshala in the list of supported sites and in the site_name options.
    • Documented the internshala_search_term parameter.
    • Updated the example usage to show Internshala in the site_name list.
    • Added a short Internshala note in the “Supported Countries” section.
    • Mentioned listing_type in the JobPost schema.

How it works

  • scrape_jobs builds a ScraperInput as before and passes it to the Internshala scraper.
  • The Internshala scraper:
    • Builds internship and job urls from the given query.
    • Paginates through the result pages.
    • Finds each card inside individual_internship containers.
    • Uses the employment_type attribute on the card to decide whether it is a job or an internship.
    • Uses the same hours_old value as the other sites to filter older posts.
    • Returns a list of JobPost objects that are merged into the main dataframe.

Notes and limits

  • Internshala is treated as India only. The country flag from country_indeed is not used for this site.
  • The location filter from scrape_jobs is not applied on Internshala. The location comes from the text in each card.
  • Stipend and salary are parsed as monthly INR in the shared Compensation model.

Testing

  • Ran index.py with site_name=["internshala"] and with Internshala included together with other sites.
  • Confirmed that:
    • Results are returned without errors.
    • Fields like title, company, location, date_posted, job_type, listing_type and compensation are filled as expected.

Copilot AI review requested due to automatic review settings December 26, 2025 07:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for scraping job postings and internships from Internshala, an India-focused job board platform. The implementation follows the existing scraper pattern established in the codebase and includes comprehensive documentation updates.

Key changes:

  • Added Internshala as a new site option alongside existing job boards (LinkedIn, Indeed, etc.)
  • Implemented a dedicated scraper that handles both internship and job listings from Internshala.com
  • Added internshala_search_term parameter to allow Internshala-specific search queries

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
jobspy/model.py Added INTERNSHALA to Site enum and internshala_search_term field to ScraperInput model
jobspy/internshala/util.py Utility functions for parsing job cards, locations, posting dates, and compensation (stipend/salary) in INR
jobspy/internshala/constant.py HTTP headers configuration for Internshala requests
jobspy/internshala/__init__.py Main scraper implementation with pagination, dual listing type support (internships and jobs), and optional description fetching
jobspy/exception.py Added InternshalaException for Internshala-specific error handling
jobspy/__init__.py Integrated Internshala scraper into the main scraping interface and parameter handling
README.md Updated documentation with Internshala support, usage examples, parameter descriptions, and India-specific notes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if rupee_idx != -1:
prefix = text[:rupee_idx]

matches = re.findall(r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b", prefix)
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b" will only match locations starting with uppercase letters followed by lowercase (like "Mumbai", "New Delhi"). This will fail to match all-caps location names (like "DELHI") or locations with special characters. Consider using a more flexible pattern or additional parsing logic to handle various location name formats.

Suggested change
matches = re.findall(r"\b([A-Z][a-zA-Z]+(?:,\s*[A-Z][a-zA-Z]+)*)\b", prefix)
matches = re.findall(
r"(?i)\b([a-z][a-z.&'-]*(?:\s+[a-z][a-z.&'-]*)*(?:,\s*[a-z][a-z.&'-]*(?:\s+[a-z][a-z.&'-]*)*)*)\b",
prefix,
)

Copilot uses AI. Check for mistakes.
about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())
if about_header:
desc_parts: list[str] = []
for sib in about_header.find_all_next():
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using find_all_next() without any arguments will traverse all following elements in the entire document, which can be inefficient for large HTML pages. Consider using find_next_siblings() or limiting the search with a limit parameter to improve performance.

Suggested change
for sib in about_header.find_all_next():
for sib in about_header.find_next_siblings():

Copilot uses AI. Check for mistakes.
listing_type = "internship"

job_post = JobPost(
id=f"internshala-{hash(job_url)}",
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using hash(job_url) for ID generation can produce negative values and is not guaranteed to be stable across Python runs (hash randomization). Other scrapers in the codebase use explicit job IDs from the site (e.g., "li-{job_id}", "nk-{job_id}"). Consider extracting a stable ID from the job URL path (e.g., the job detail ID) or using a URL-safe hash function like hashlib.md5 or just using the URL path segment as the identifier.

Copilot uses AI. Check for mistakes.
site_name=["indeed", "linkedin", "zip_recruiter", "google"], # "glassdoor", "bayt", "naukri", "bdjobs"
site_name=["indeed", "linkedin", "zip_recruiter", "google", "internshala"], # "glassdoor", "bayt", "naukri", "bdjobs"
search_term="software engineer",
internshala_search_term="software engineer",
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example shows setting internshala_search_term="software engineer" for a location-based search in San Francisco. However, as noted in the PR description, Internshala is India-focused and ignores the location parameter. This example may confuse users as it suggests Internshala will respect the San Francisco location. Consider clarifying that Internshala will search in India regardless of the location parameter, or provide a separate example for Internshala usage.

Suggested change
internshala_search_term="software engineer",
internshala_search_term="software engineer", # Internshala is India-focused and ignores `location`; results will be for India

Copilot uses AI. Check for mistakes.
query = (scraper_input.internshala_search_term or scraper_input.search_term or "").strip()
paths: list[tuple[str, str]]
if query:
encoded_query = quote(query.lower(), safe="")
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL encoding uses quote(query.lower(), safe="") which will percent-encode spaces and special characters. However, Internshala URLs typically use hyphens to separate words (e.g., "/keywords-software-engineer/"). The current implementation would create URLs like "/keywords-software%20engineer/" which may not match Internshala's expected format. Consider replacing spaces with hyphens before encoding, or verify that percent-encoded URLs work correctly with Internshala's search.

Suggested change
encoded_query = quote(query.lower(), safe="")
slug = "-".join(query.lower().split())
encoded_query = quote(slug, safe="-")

Copilot uses AI. Check for mistakes.
def __init__(
self, proxies: list[str] | str | None = None, ca_cert: str | None = None, user_agent: str | None = None
):
super().__init__(Site.INTERNSHALA, proxies=proxies, ca_cert=ca_cert, user_agent=user_agent)
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user_agent parameter is not passed to the parent class Scraper.__init__(). For consistency with other scrapers like Indeed and Naukri, consider passing user_agent=user_agent to the super().init call.

Copilot uses AI. Check for mistakes.

stipend_comp = parse_stipend(card)

description: str | None = None
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name linkedin_fetch_description is misleading when used for Internshala scraping. While reusing this existing parameter may be intentional to avoid adding another parameter, the name suggests it's LinkedIn-specific. Consider documenting this behavior in the PR or README, or consider using a more generic parameter name in future refactoring.

Suggested change
description: str | None = None
description: str | None = None
# NOTE: `linkedin_fetch_description` is used as a generic "fetch full description"
# flag across sites (including Internshala), despite its LinkedIn-specific name.

Copilot uses AI. Check for mistakes.

soup = BeautifulSoup(resp.text, "html.parser")

about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description fetching logic on line 279 searches for headers containing "about the internship", but this won't work for job postings which may have different header text like "about the job" or "job description". Consider using a more generic pattern or checking for multiple header variations to properly handle both internships and jobs.

Suggested change
about_header = soup.find(lambda tag: tag.name in ["h2", "h3"] and "about the internship" in tag.get_text(strip=True).lower())
patterns = ["about the internship", "about the job", "job description", "about the opportunity"]
about_header = soup.find(
lambda tag: tag.name in ("h2", "h3")
and any(pattern in tag.get_text(strip=True).lower() for pattern in patterns)
)

Copilot uses AI. Check for mistakes.
hours_limit = scraper_input.hours_old
posted_cutoff: Optional[datetime] = None
if hours_limit is not None:
posted_cutoff = datetime.utcnow() - timedelta(hours=hours_limit)
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of datetime.utcnow() is deprecated as of Python 3.12 in favor of datetime.now(timezone.utc). While this still works, consider updating to the recommended approach for future compatibility.

Copilot uses AI. Check for mistakes.
posted_ago = parse_posted_ago(card)
date_posted: Optional[date] = None
if posted_ago is not None:
dt_posted = datetime.utcnow() - posted_ago
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of datetime.utcnow() is deprecated as of Python 3.12 in favor of datetime.now(timezone.utc). While this still works, consider updating to the recommended approach for future compatibility.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant