-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Hi,
I have noticed an issue where the ScrapeService (specifically in classes like Nam and Olbs) fails to retrieve reports when the source returns a variant header, such as SPECI, TAF AMD, or TAF COR.
The Problem
The current implementation of _extract in several StationScrape subclasses relies on strict string matching based on self.report_type.
For example, when fetching a METAR, the code constructs a search tag using self.report_type.upper() (e.g., searching for >METAR <). However, if the station has issued a SPECI, the HTML response typically contains >SPECI < instead of >METAR <.
Because the code only looks for METAR, the extraction fails, and the service returns an error (or claims the station doesn't exist), even though valid data is present.
The same logic applies to TAFs. If a TAF is amended (TAF AMD) or corrected (TAF COR), the scraper looks for >TAF < and fails to match the longer headers used by some providers (like NorthAviMet).
Affected Classes
-
avwx.service.scrape.Nam(NorthAviMet) -
avwx.service.scrape.Olbs(India) -
Potentially others relying on
_simple_extractwith strict headers.
Root Cause Analysis
In avwx/service/scrape.py, lines like this cause the failure:
# In class Nam
starts = [f">{self.report_type.upper()} <", f">{station.upper()}<", "top'>"]When self.report_type is "metar", it strictly looks for METAR. It does not account for SPECI.
Suggested Fix
The extraction logic should prioritize searching for specific variants before falling back to the generic type.
I recommend adding a property to ScrapeService to define these variants (Longest Match First):
@property
def search_tags(self) -> list[str]:
rt = self.report_type.upper()
if rt == "TAF":
return ["TAF AMD", "TAF COR", "TAF"]
if rt == "METAR":
return ["SPECI", "METAR"]
return [rt]And updating the _extract methods to iterate through these tags to find which one exists in the raw HTML.
For example, in Nam._extract:
# Detect which tag is actually present in the raw HTML
tag = self.report_type.upper()
for candidate in self.search_tags:
if f">{candidate} <" in raw:
tag = candidate
break
# Use the detected tag for extraction
starts = [f">{tag} <", f">{station.upper()}<", "top'>"]This change ensures that SPECI and TAF AMD are correctly identified and parsed.
Thanks for your work on this library!