-
Notifications
You must be signed in to change notification settings - Fork 5
Add SmartNet data #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@javier-jimenez-shaw-pix4d GGRS87 does not appear to have a Geodetic 3D CRS. The best I could find was |
@bscholer I have to have a deeper look on it. But I finding strange things. https://www.smartnetna.com/resources_configuration.cfm says in the webpage: "SmartNet North America". Then I do not see why there are many CRSs from Europe (including, but not only, GGRS87, the old system in Greece). That does not make any sense to me. For instance "SmartNet - DE.SMARTNETNA.COM - Port 9101" using url http://de.smartnetna.com:9101 , filters by mountpoints "jdmrtk3" and "jdmrtk4", first with CRS "NAD83(2011)" and next with "ETRS89/DREF91/2016", that is the CRS for Germany, Europe. Even if it were correct (that I seriously doubt), the second condition will never we reached. How can I get to that information in the webpage? |
Regarding "SmartNet North America", it seems like they added global coverage but just haven't updated the logo or URL from that page. I doubt that data would be there if it truly didn't exist though. Good catch on I'm not sure where you'd get more info, besides the link above. I'm also not sure how to navigate to the page we're scraping, @hernando provided it in a comment on the original PR, so I just ran with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass on the scraping script. I only reviewed the data collection so far. I didn't spot yet why many entries have stream definitions for ETRS89 and NAD83 (or others), which doesn't make sense.
# They are not strictly HTTP requests, so we need an alternative. -> https://github.com/pycurl/pycurl | ||
pycurl | ||
|
||
# Libraries for scrape_smartnet.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To not slow down CI in the general case, I would put these packages in a separate requirements file.
"Content-Type": "application/x-www-form-urlencoded", | ||
"Origin": base_url, | ||
"Referer": base_url + "/resources_configuration.cfm", | ||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the name of the script or something like that instead of impersonating a browser? I would even consider omitting "User-agent"
even if it's recommended and the server accepts it.
{ | ||
"Accept": "*/*", | ||
"Content-Type": "application/x-www-form-urlencoded", | ||
"Origin": "https://www.smartnetna.com", | ||
"Referer": "https://www.smartnetna.com/resources_configuration.cfm", | ||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36", | ||
"DNT": "1", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this dict to a constant declared at the top and re-use it in fetch_manufacturer_devices
.
return df | ||
|
||
|
||
def fetch_connection_info(m_id, r_id, region): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter names are too cryptic.
def fetch_connection_info(m_id, r_id, region): | |
def fetch_connection_info(manufacturer_id, rover_id, region): |
if resp is None: | ||
return [] | ||
|
||
soup = BeautifulSoup(resp.text, "html.parser") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
soup = BeautifulSoup(resp.text, "html.parser") | |
html = BeautifulSoup(resp.text, "html.parser") |
all_records = [] | ||
columns = None | ||
|
||
for manufacturer_id in tqdm(range(1, 26), desc="Fetching manufacturer data"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of having the magic constant 26. What about:
for manufacturer_id in tqdm(range(1, 26), desc="Fetching manufacturer data"): | |
# Collecting of manufacturer ids from the combo box. | |
html = BeautifulSoup(r.text, "html.parser") | |
# The first element is skipped because it corresponds | |
# to the placeholder text "select" | |
manufacturer_ids = [x['value'] for x in html.find("select", id="ManufacturerID").find_all("option")[1:]] | |
for manufacturer_id in tqdm(manufacturer_ids, desc="Fetching manufacturer data"): |
for _, row in df_devices.iterrows() | ||
for region in range( | ||
1, 120 | ||
) # This is just the min/max of the region IDs, some may be missing but this script will handle that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would collect the region IDs using a code similar to the suggestion above for the the manufacturer IDs.
However, I see that there are region IDs out of the US that the web pages doesn't list, but for which the script can collect data. Can we find the list of valid ID somewhere?
executor.submit(fetch_connection_info, m, r, reg): (m, r, reg) | ||
for (m, r, reg) in tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
executor.submit(fetch_connection_info, m, r, reg): (m, r, reg) | |
for (m, r, reg) in tasks | |
executor.submit(fetch_connection_info, brand, model, region): (brand, model, region) | |
for (brand, model, region) in tasks |
if recs: | ||
results.extend(recs) | ||
except Exception as e: | ||
m, r, reg = futures[future] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I can understand the code, I think it's not super readable to use a dict
to recover the parameters used to create each future.
Can't you create a list of tuples (brand, model, region) and then use a list comprehension to create a list of futures and use a zip iterator in the for at line 316?
Hi @bscholer I have been looking at the output, not the script yet. But I see many strange things. I think we should understand them before going forward. As an intro, I would like to ask the provider (Hexagon). Hopefully the will tell us useful information. Do you know anybody there to contact? After looking that the json, these are some of my findings:
... now I see that DE can be Germany and Delaware (USA). NL can be Netherlands and New Foundland (Canada). There is something fishy with those country/state identifiers. It could be also problematic for the German states (Bundesländer), like SL for Saarland (Germany) or Slovenia, or Slovakia, or Sierra Leone. |
@javier-jimenez-shaw-pix4d all great points, and thank you for the detailed review. As you've pointed out, there are a plethora of issues with the data scraped from this site, and it's anybody's guess as to whether it's actively maintained. I think the best approach is likely to reach out to them and ask about this, but I don't have a contact there. I'll go ahead and reach out to their customer support about this though. Not sure about Pix4D, but at DroneDeploy, we only see a couple customers using this network to begin with, and I've just manually added the ones we do see to our fork. Most customers use our Instant RTK via our flight app, which automatically connects to PointOne/Geodnet and handles all this complexity under the hood. If you agree with this being more trouble than it's worth, feel free to close out this PR. I don't think I'll have the time to dive very deeply into it. |
Work Done
scrape_smartnet.py
that will scrape the SmartNet webpage for all URLs and mountpoints available for all the different devices and regions that they support.data/World/smartnet.json
.black
andflake8
so that they don't contradict each other and make committing impossible.