[MRG] Austin workflow refactor part 2#134
[MRG] Austin workflow refactor part 2#134asoplata wants to merge 48 commits intojonescompneurolab:mainfrom
Conversation
|
I've marked it as WIP, but almost all the remaining changes I plan on making for this PR are documentation changes. There may be some further code changes, but they will not be very large. |
6b9f88e to
6ba91ac
Compare
|
This is now at a good stopping point and is ready for review. I did a bunch of the things we talked about in addition to the Issues mentioned previously. Much of this was adding and updating documentation, including fixing a newly found (but preexisting) bug or two |
This is the first part of a major refactoring overhaul of how we handle indexing, input-markdown and output-HTML pathing, and sidebar generation. This is still a work in progress, and most functionality has been commented out for debugging/testing. Major changes: - `update_page_index.py` has been renamed to `create_indices.py`. This new file now creates two indices: 1. "hier_index", which is the same as the old "index.json". Its content has not changed. Instead of being saved to file every time and then loaded later during sidebar creation, it creates and returns the dictionary instead, and that is passed to sidebar for usage. I'm not sure if we need to save it as a file at all, since its value can always be easily debugged live, and we were not doing any kind of checking to skip its loading before (i.e. if we have to reconstruct it every time, then there's no gain from saving a copy to disk). 2. "flat_index". This will eventually replace "ordered_links" and is greatly expanded from what "ordered_links" does. This is treated like "hier_index" as an explicit "generate_page_html" function, but will replace much functionality/data of the rest of the code. This index includes the ordering of legitimate "pages" (i.e. not "sections"), and for every page contains a dictionary containing: A. title of the page B. absolute path of the "input" markdown file C. absolute path of the "output" HTML file (including processing dev_build, and including creation of any parent directories to this not-yet-created file) D. "website-root"-relative path of the "output" HTML file (including processing dev_build) The next major step is using "flat_index" to replace ordered_links and output path/filename lookups throughout the rest of generate_page_html. Minor changes: - Location of the saved index files has been moved to inside "scripts" from the root - Many functions and some script files have been renamed
This continues the use of "flat_index", including in every case of the main loop, except for the "ordered_links" which are next. This includes many minor changes as well for formatting (including a ruff run), documentation, and other small changes.
Among other minor changes, including adding a new optional "save_indices" CLI arg that does what it sounds like, this finishes applying the usage of the new "flat_index" to the remaining code, which was mostly the "ordered" links used in the per-page footers. Everything appears to be working correctly, including that the sidebar and footer navigation uses "dev" or "content" properly. Next up is general cleaning and documenting.
This begins a multi-step transition from the CLI "--build-on-dev" argument to the combination of "--code-version" (formerly the unused "--build-type" argument) and "--custom-owner-commit". The CLI args have already been created, and this commit focuses on applying their use inside `get_commit_hash.py`. This also includes a great deal of minor changes `get_commit_hash`, including better error messages, more explicit variable names, etc. There is also incomplete support for a new "--code-version=no-check" option that skips the version checking in case one is debugging or if one wants to run 'build.py' a lot, but doesn't want to spam HTTP requests. This use-case is useful in itself, but also related to a weak hypothesis I had: that the fact that version mismatches in the previous "default" case of 'build.py' usage provided only *warnings*, instead of RuntimeErrors, like in all other mismatch cases (see https://github.com/dylansdaniels/textbook/blob/workflow_refactor/build.py#L610 )
`dev_build` has finished being replaced by a combination of `code_version` and (either `custom_owner_commit` or `commit_hash`) depending on the function. Documentation has not yet been updated, but it was incomplete to begin with.
This comments out what I think is something that may have been added by accident. Compare this line to https://github.com/dylansdaniels/textbook/blob/31dbaf09d6d854a4e8527024fa99f2b415820192/scripts/convert_notebooks.py#L923 . What was originally going on at the time of the above commit is the following: `notebook_was_run` was used as whether or not the notebook was attempted to be run at all, regardless of if it succeeded (I have changed this variable to `execution_initiated`). `notebook_executed` was used as whether or not the notebook successfully finished its execution (I have changed this variable to `execution_successful`). In the control flow, which has not changed at all, in this particular case, we are in the case where `notebook_was_run`/`execution_initiated` is False, meaning no execution was attempted at all. Also in the original version, `dev_build` would have been activated, meaning we're in a dev state. However, even if we were in a dev state, but if the notebook execution was never initiated/attempted, then in this case, I do not think it makes sense to record the commit hash of the current HNN version into the notebook. If we did, it would imply that the last dev execution was using this commit, but that is not the case.
This adds a large description of the building program as a whole to the CLI for 'build.py' such that it is output when calling python build.py --help This includes description of the assumed file structure (which has many, many assumptions), the overall code path, some examples, and the important options along the way. Minor: this also adds a docstring for 'get_hnn_commit_hash', renames the 'write_standalone_html' variable to 'save_standalone_nb_html', and makes a new CLI argument of a similar name. This documentation was first generated with Claude AI before undergoing *heavy* modification of every word.
This partially reverts our post-hnn_commit_hash-validation code's checks of the "code-version" to use a new var like in the original, called "is_dev_build". The distinction between a "dev" and non dev build is now fully documented, apparent, and set in 'build.py'. "is_dev_build" is also a proper boolean flag (true/false only) and the actual commit hash to use (when necessary) is handled separately. More refactoring: this also renames the hnn-core commit hash module and splits its functions, so that the 'get_hnn_commit_hash' function is ONLY getting the installed version, and the rest of the code is validation of the installed version against various other versions.
There are also some minor variable name changes
Previously, we were loading the same notebook from disk twice, once to calculate the hash, then once to actually use it. It's more efficient to load from disk once.
This includes a special comment about how `_read_nb_json_output_metadata` looks in different places for the JSON output files depending on if you're doing a stable or dev build
SHOULD be no code changes...
You can probably guess where this is going... Note: this does not affect the Workshop page, since it's raw HTML.
My Claude's filesystem scan was out-of-date, and lightly "hallulcinated" a `templates/navbar.html` that no longer exists. This removes that.
Appears to work just as well.
9400138 to
86eb4da
Compare
|
@dylansdaniels this has now been rebased off of |
|
This is complicated enough that we may want to NOT squash these into one big mega commit before merge. Instead, may want to way until we're happy with the current state, then I do a rebase to resolve the merge conflicts, then we merge |
|
For some reason, even if I re-add then re-delete the (no longer necessary) files for Everything else should be up to date and work with |
|
@asoplata reviewed everything and looks good in theory. doing some local testing that the builds are working as expected. I pushed a commit to add "no-check" to the "--code-version" argument, as it was missing . |
|
Thanks for catching that, when going over this again I forgot that I'll add a new commit that forces |
|
Another refactoring idea is, even though we're already at pretty high CLI complexity, we may want to separate how whether you're doing a |
|
Done with the addition, now |
build.py
Outdated
| be skipped) using the current master (development) branch that you have | ||
| installed (WITHOUT checking against the latest master branch on Github), and | ||
| build the website in your local `dev` directory (creating it if necessary): |
There was a problem hiding this comment.
I think this is confusing, because you techically don't need master installed since you're passing --no-version-validation, so it's actually not guaranteed that you're using the current master (dev) branch. I think we should remove example 4, reindex, and add explanation to the current example 8 stating that the --no-version-validation flag supercedes any use of --code-version. That, or we make it such that you can't actually pass --code-version with --no-version-validation
There was a problem hiding this comment.
this is related to your comment on control flow. But I agree that we don't necessarily have to rethink that entirely in this PR, but I do think we should change the examples at the very least
There was a problem hiding this comment.
I def agree it's super confusing. It's possible that I meant to strip out --code-version=no-check in its ENTIRETY, and only left it in (e.g. at the end of process_hnn_commit_hashes.py) by accident. It's just been so long that I don't remember.
What about removing no-check entirely? no-check and --no-version-validation are essentially doing the same thing in process_hnn_commit_hashes.py already (except for --code-version=stable where only --no-version-validation is doing the correct behavior and returning None for the commit instead of the current installed commit).
If we remove it entirely, it's much simpler, and the docs will be so too. --code-version determines a content vs dev build based on either stable or {master, custom}. Then, --no-version-validation overrides the version validation by using the appropriate commit in the metadata (the installed one if dev, or None if content) and executing only using the installed commit.
I'll add that even if we drop no-check, it will be an improvement, but still has a lot of room left for further improvement after the PR
There was a problem hiding this comment.
Hmm my gut is that it should be the opposite actually - we keep --code-version=no-check and remove --no-version-validation.
Then we entirely avoid the confusion of commands like: python build.py --execution-type=execute-updated-unskipped-notebooks --code-version=master --no-version-validation -> this implies it should be master, but we don't validate, so you can actually run it with stable and it will build anyway. users may thing their build is running on master but it's not
There was a problem hiding this comment.
There would never be a case where you need a code version specified when using --no-version-validation in its current form anyway, it's really an either or situation. so it might as well be an option for --code-version. and if there's no version validation, it should probably always be a dev build
There was a problem hiding this comment.
I'm okay with removing --no-version-validation and replacing it exclusively with --code-version=no-check, since I agree that it way easier to understand, EXCEPT for one caveat: I think we should simultaneously split dev build logic into its own CLI argument. Reducing the cognitive load of "no check" into a single item is definitely the way to go.
However, we need to be able to control whether we're doing a build into content or dev, independent of that. For example, when I was coding this, I was actively using that situation -- I was trying to test and code the situation where --code-version=stable (which is currently the only way to use the code paths that build into content instead of dev), but I didn't want to be pinging Github everytime I tried to build (since I didn't want my IP to get on the bad list). Therefore, we (moreso you and I, not necessarily the expected user) need to be able to use --code-version=no-check but control whether we're outputting to and building from content or dev.
This has the added benefit of making it very explicit at the CLI level where you're building to. We could still retain the current control flow (i.e. --code-version=stable defaults to building to content, but {master, custom, no-check} default to building to dev) BUT this gives us (and users) a way to force override that preference.
There was a problem hiding this comment.
Makes sense! I think that's a good plan and I'm on board 👍
There was a problem hiding this comment.
Implemented the change in the latest commit
This removes the CLI arg `--no-version-validation` (so that only `--code-version=no-check` remains), and adds a new CLI arg named `--build-directory`. This new arg accepts values of `auto` (the default, which follows our `is_dev_build` "algorithm"), `content` (user forces website to be built into content dir), and `dev` (user forces website to be built into `dev`). This also upgrades all the documentation to reflect these changes, along with some minor other docstring improvements.
This is based on, and continuing from, this PR here dylansdaniels#2 . This includes many large refactors, including changing
ordered_page_linksto a greatly expandedflat_index, along with many other changes.If merged, this resolves #125 and #126 .