Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curation of records, data, processes from original CARS #20

Open
BecCowley opened this issue Nov 14, 2023 · 40 comments
Open

Curation of records, data, processes from original CARS #20

BecCowley opened this issue Nov 14, 2023 · 40 comments
Assignees

Comments

@BecCowley
Copy link
Contributor

BecCowley commented Nov 14, 2023

Collect the notes and data (and perhaps some of the code) used by Jeff to create the CARS2009 product.
It would be good to understand how Jeff did:

  • the data curation and collation,
  • QC,
  • duplicate checking,
  • declustering of data where many replicates were taken

Also to rescue the data he already collected and re use it in the new product.
Where to put this information? Locally it is available in the datalib location, in Jeff's folders.
Maybe we need to replicate it somewhere useful for the new CARS, or just make sure we can identify where the important parts are.

Also, thinking about the final format for the new product, it should match the original so users can easily slot the new product into their existing applications.

@BecCowley BecCowley self-assigned this Nov 14, 2023
@BecCowley
Copy link
Contributor Author

See here for data locations and code information from Jeff:
https://github.com/CARSv2/cars-v2/wiki/CARS2009-helpful-information

@Thomas-Moore-Creative
Copy link
Contributor

Good chat today @BecCowley & @ChrisC28

Here is white board image.image

@ChrisC28
Copy link
Contributor

ChrisC28 commented Dec 4, 2023

I've made a first pass at producing the "CODA" form of the output from the WOD.

The data is contained in yearly directories. Each directory includes daily files with the naming conventions:

CODA_WOD_<platform_type>.nc

platform_type = ctd, pfl, xbt, etc....
variable = temperature, pressure, salinity, oxygen,....

So for each variable and each platform type and variable, there are 365 or 366 files.

The files themselves are two dimension (cast, depth_index). Each variable includes the data, the depth levels (WOD data is on depth and NOT pressure it seems) and the WOD flags.

There are a few quirks that I'm working through that could make the data a little easier to read and deal with. For example, the length of the depth dimension varies from file to file, which isn't optimal for reading the data. Exactly what data/metadata to carry through is also something we should all discuss.

@ChrisC28
Copy link
Contributor

ChrisC28 commented Dec 4, 2023

The first pass of the data is here: /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/WOD

@BecCowley
Copy link
Contributor Author

BecCowley commented Dec 5, 2023

@ChrisC28, looking at the pressure files - is the Pressure_depth variable meant to be the pressure converted to depth? Looks erroneous (looking at one of the CTD files).
Will discuss with you!

@ChrisC28
Copy link
Contributor

ChrisC28 commented Dec 5, 2023

I've pushed the example notebook to the main branch

@ChrisC28
Copy link
Contributor

ChrisC28 commented Dec 5, 2023

@BecCowley The "Pressure_depth" variable is simply the pressure as read in the WOD data on the depth levels. I treat pressure as any other variable. I did notice some strangeness myself. Could you let me know which profile you looked at?

@ChrisC28
Copy link
Contributor

Let's try this again.

Using the wodpync module, I've created some test CODA files. Not all the meta-date is there as I had some boring problems processing strings that I still haven't worked out. Additionally, it seems like not all metadata is carried through in WOD (for example, Salinity doesn't have units).

You can find the test dataset on tube: /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test

It's currently CTD only, although I've run some tests on XBT and profiling floats without issues.
Let me know if you get a chance to have a play.

@BecCowley
Copy link
Contributor Author

@ChrisC28 I had a quick look at the files. Certainly there needs to be some transfer of variable attributes, fixes to fill values etc.
Salinity shouldn't have units.

I did some very basic plotting using the WOD flags and temperature and salinity. There are some strange out of range numbers in the WOD_flag variable (-127) for the one file I looked at.

The data itself looks reasonable.
I would like to tidy up the files with the correct fill values etc so that they load without issues.
Also, I wonder if mixing the names 'depth' and 'z' could be reconciled?

Happy to work on tidying up when back next year!

@BecCowley
Copy link
Contributor Author

@ChrisC28
I am reviewing the requirements for the duplicate checking code.
Here is a list of metadata that needs to be included in the CODA files if it is available:

accession_number
dataset_id
lat
lon
year
month
day
probe_type
recorder
hour
minute
country_id
GMT_time
dbase_orig
Project_name
platform
vehicle
Institute

@ChrisC28
Copy link
Contributor

NOTE: wodpy uses masked arrays, which are extremely slow.

I've found places where I think masked arrays can be replaced by regular arrays. I've running soe tests now. Should hopefully speed things up.

@ChrisC28
Copy link
Contributor

I've now modified WODpy to make use of standard numpy arrays rather than masked arrays. I'm still checking things to make sure that what I've done is sensible, but it speeds things up by an order of magnitude.

@ChrisC28
Copy link
Contributor

Hi @BecCowley
I've put a couple of CODA test outputs here: /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test/2010

The files have the metadata above that should hopefully help the duplicate checker work.

The -127 WODFlag values still appear and I've traced these back to the original data files.

Could have you run your sceptical eye over these files and let me know if they are fit for purpose? I can now regenerate the files quickly, so fixes should be pretty easy to implement

@BecCowley
Copy link
Contributor Author

BecCowley commented Feb 20, 2024

@ChrisC28 Some comments:

  • The temperature, salinity, oxygen etc have a 'grid_mapping' attribute that references a crs variable. Need to carry this from the WOD files:
    int crs ;
    crs:grid_mapping_name = "latitude_longitude" ;
    crs:epsg_code = "EPSG:4326" ;
    crs:longitude_of_prime_meridian = 0.f ;
    crs:semi_major_axis = 6378137.f ;
    crs:inverse_flattening = 298.2572f ;

  • Need to carry the origflagset(casts, strnlen) variables from WOD for *_origflag variables
    char origflagset(casts, strnlen) ;
    origflagset:comment = "set of originators flag codes to use" ;

  • Include a long_name attribute for Access_no variable. Maybe 'WOD_accession_number'

  • Can you include the 'needs_z_fix' variable for XBTs please?

  • Can you include the 'Ocean_Vehicle' variable if present (probably only in the APB data, you might already have it)?

  • The ctd files contain the *_Instrument variables for every parameter (Temperature, Salinity, Oxygen). Perhaps only need to include the one for that file (eg, Temperature_Instrument in the ctd_Temperature file).

  • We can include some more global attributes to describe the dataset, project, references etc. Need to create a list of these.

I will do some testing on the data itself and let you know what I find.

@BecCowley
Copy link
Contributor Author

BecCowley commented Feb 21, 2024

@ChrisC28 some comments/queries on the data:

  • there are some missing time data. Is this representative of what is in the original file, or an issue with the conversion? Eg, the WOD_CODA_2010_xbt_Temperature_test_.nc file is missing date/time info in 44 profiles, including locations:
    688, 715, 895, 1046, 1517 ( WOD unique ids: 13047310, 13047318, 13047370, 12321440)

  • time in the pfl files is int64. Should be double as per the other files

@ChrisC28
Copy link
Contributor

@BecCowley

I'm moving the WOD_2018 over to:
/oa-decadal-climate/work/observations/CARSv2_ancillary/WOD2018
I've got ctd data from 1970 onwards, and argo, xbt, glider, etc... from 2005 onwards

@ChrisC28
Copy link
Contributor

I've pushed my changes to wodpy to my fork here: https://github.com/ChrisC28/wodpy

Being lazy, the original code is left in the file but commented out

@ChrisC28
Copy link
Contributor

Fixed issue:

  • included 'crs' variable and full attributes.

Working through the remainder of @BecCowley 's list

@ChrisC28
Copy link
Contributor

@BecCowley

there are some missing time data. Is this representative of what is in the original file, or an issue with the conversion? Eg, the WOD_CODA_2010_xbt_Temperature_test_.nc file is missing date/time info in 44 profiles, including locations:688, 715, 895, 1046, 1517 ( WOD unique ids: 13047310, 13047318, 13047370, 12321440)

Found the issue - wodpync creates a datetime output based on the "date" variable and the "GMT_time" variable. The later is occasionally missing. Wodpy tests for this, but when changing from masked to regular python arrays, the test failed. I've reverse engineered the test with a bit of a hack but it seems to catch those cases.... when GMT time is missing, it takes the time as midnight (as in the original wodpy).

@ChrisC28
Copy link
Contributor

Hi @BecCowley

A new batch of CODA files to check. I've placed them here
/oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test/2010
the files are:

  • WOD_CODA_2010_pfl_test_all_vars.nc
  • WOD_CODA_2010_xbt_test_all_vars.nc
  • WOD_CODA_2010_ctd_test_all_vars.nc
    They include the new CODA_identifier, that has the format: <obs_platform>

were is a 3 character code identifying the original data (WOD, MNF, etc...), obs platform is the three character code identifying the observation type (CTD, pfl, ....) YYYYMMDD is the date of the observation and is a 4 character datastring that counts the number of profiles on that date (first observation, 2nd oversation, etc...).

@BecCowley
Copy link
Contributor Author

@ChrisC28 the WOD_unique id contains latitude, needs correcting as discussed.

@ChrisC28
Copy link
Contributor

Fixed... regenerating the WOD derived CODA files

@ChrisC28
Copy link
Contributor

Added the first batch of MNF -> CODA files:

  • /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test/2005/MNF_CODA_2005_ctd_test_all_vars.nc
  • /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test/2006/MNF_CODA_2006_ctd_test_all_vars.nc
  • /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test/2007/MNF_CODA_2007_ctd_test_all_vars.nc
  • /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test/2008/MNF_CODA_2008_ctd_test_all_vars.nc

Note that, as discussed, I haven't included a lot of the meta-data (things like COUNTRY, etc....). Probably worth discussing what we need for the QC/duplicate checking and making sure that we include what's required.

Next step: repeat with the AIMS data.

@BecCowley
Copy link
Contributor Author

BecCowley commented Mar 19, 2024

@ChrisC28
I think we need to add these to all the files at the time of conversion to CODA format:

  • country
  • dbase_orig
  • Project
  • Platform
  • Institute
  • Temperature_Instrument

We will absolutely need the following information for XBT files:

  • Recorder

For profiling floats and glider files (when we get there):

  • Ocean_vehicle

The WOD code tables are available here and we should use these values if possible https://www.ncei.noaa.gov/access/world-ocean-database/wod-codes.html

Can you finagle that? Happy to help.

Also still need the CODA WOD files updated to fix the wod_unique_id issue.
And, we need to make sure the longitudes of any datasets are in -180:180 degree format, not 360 degrees.

@ChrisC28
Copy link
Contributor

@BecCowley

I've fixed a few bugs and placed the newly created files in a new directory:
/oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test_v2/
I've had a quick look and most of the supporting variables seem to be included. I've have't yet included the WOD Codes, and I think we need a brief discussion as to how to carry these over to the MNF/CSIRO and AIMS (+ other) data.

@ChrisC28
Copy link
Contributor

@BecCowley

Another one for your to-do list:
I've put the first pass test of AIMS->CODA files. They are together with the files I've produced from WOD and MNF in the directory:
/oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test_v2/
divided up by year.

Couple of things to note:

  • The available metadata was even more limited than with the data from the MNF. I've guessed here and there, but it's a bit of a fools errand;
  • I haven't carried through the QAQC values or flags from the original dataset, as they are all (at least from my inspection) missing. Didn't really think it was worth carrying forward a bunch of NaNs.
  • Some of the Oxygen data has different units (% saturation or umol/kg). I'm not sure how to convert?

Please have a quick squizz when you get a moment.
Could we catch up briefly about what exactly is required from the IQuOD duplicate checker/QC-er.... ?

@BecCowley
Copy link
Contributor Author

@BecCowley

I've fixed a few bugs and placed the newly created files in a new directory: /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/CODA_test_v2/ I've had a quick look and most of the supporting variables seem to be included. I've have't yet included the WOD Codes, and I think we need a brief discussion as to how to carry these over to the MNF/CSIRO and AIMS (+ other) data.

@ChrisC28 the attributes in the variables for the AIMS files have an issue (there is a long string in there - ncdump it to see).
Will come see you to discuss how to put in the variables.

@ChrisC28
Copy link
Contributor

Hi @BecCowley
WOD, MNF, AIMS and RAN data has now been CODA-fyied!

Path for the new test dataset is: /oa-decadal-climate/work/observations/CARSv2_ancillary/CODA/WOD_CODA_test_v2/

I'll push the converters next week. Would be good to try to refactor the code into a common set of functions (the MNF and RAN code is very similiar).

Note: I found that nutrient etc.... profiles are actually in the ocean station files and not, as I suspected, in the ctd files in WOD (there are also a few in the profiling float data). I had been ignoring ocean stations, but turns out that they are important.

@BecCowley
Copy link
Contributor Author

@ChrisC28, @Thomas-Moore-Creative, I see the WOD files are only there from 2000 to 2017. I think Thomas was going to download the latest WOD from 2000 to now - has this been done yet and can we then complete the conversion?

@Thomas-Moore-Creative
Copy link
Contributor

@Thomas-Moore-Creative - has this been done yet and can we then complete the conversion?

It has not, apologies. I'll start this now.

@Thomas-Moore-Creative
Copy link
Contributor

I'll do this over in #19

@ChrisC28
Copy link
Contributor

Hi all,

In my haste to get this out on a Friday evening, I neglected to mention that I've downloaded WOD2018 from the OpenDAP server.

It turned out to be very easy (took me less than 30 minutes):
/oa-decadal-climate/work/observations/WOD2018

@Thomas-Moore-Creative
Copy link
Contributor

Thomas-Moore-Creative commented May 15, 2024

@BecCowley - given the above diligence from @ChrisC28 I assume that is as much as we can grab for now from WOD?
CleanShot 2024-05-15 at 11 54 47@2x

.... I note it goes up to 2022
drwxrwsr-x 2 cha674 1109763 4.0K May 11 07:18 2022

@ChrisC28
Copy link
Contributor

Have a look at the notebook here:
https://github.com/CARSv2/cars-v2/blob/main/notebooks/Download_WOD_from_OpenDap.ipynb

I've only downloaded a subset of WOD2018. However, it might be worth having nearly the whole thing? Not sure how useful data from Captain Cook might be, but you never know....

@BecCowley
Copy link
Contributor Author

Yes, looks like it's downloaded. Thanks for doing this @ChrisC28
However, not all is translated to CODA, we can do that now the tools are there!

@ChrisC28
Copy link
Contributor

@BecCowley I'm running the script now! Converting a bunch of other variables (nutrients, CO2, etc...).

@BecCowley
Copy link
Contributor Author

@ChrisC28 here are a list of format issues in the CODAv1 files:

  1. There are -ve depths in the MNF and RAN files. Maybe in others? WOD is ok.
  2. Can we remove the '2018' in the WOD filenames and the CODA id? I'd like the following naming format: NNN_CODA_yyyy_ttt.nc where NNN = dataset, yyyy = year, ttt = lowercase datatype.
  3. In the MNF* files, the CODA id has lower-case 'mnf' while in the WOD files, it is upper case. Can we be consistent in the CODA id format with case.

@BecCowley
Copy link
Contributor Author

@ChrisC28 another issue to fix - the WOD and originator flag values are float type in the WOD CODA files but double in the MNF versions. I think they should be byte types.

Also, the originator flags in the WOD files are dependent on the origflagset variable which isn't carried through to the CODA files. I would suggest doing a conversion and change the *origflag to be consistent, add flag_values and flag_meanings to the *origflag variables. Then the data type can be made byte.

@ChrisC28
Copy link
Contributor

@BecCowley : I've updated the MNF and RAN CODA files to fix the negative z issue. This came form using the TEOS10 package to convert from pressure to depth - TEOS10 defines z as negative below the surface.

Now looking into the WOD files.

@ChrisC28
Copy link
Contributor

@BecCowley
I've modified the WOD files as mentioned and am currently regnerating the CODA database from 2000 to 2023. It should run overnight.
I'm also about to generate the CODA database for CTDs only from 1975 onwards for Ariaan Purich and her student Helen, who have volunteered to act as guinea-pigs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants