A script to remove node_modules
directories to free up disk space.
We find that web development eats up a lot of disk space in
in our labs, and that most of that is due to large
node_modules
directories that end up lying around long after
the course they were generated for is over. In Software Design,
for example, we might end up with 7 or 8 labs and iterations
for each student, each of which has their own large
node_modules
directory. If a class has 20 students, that
could easily be 150 or more node_modules
directories from
a single semester.
We could take a really simplistic approach and just search for
everything called node_modules
and delete it. There are a
few concerns there, however:
- Someone might have a folder called
node_modules
that wasn't created bynpm
and be really unhappy that it got deleted. This isn't super likely, to be honest, but it is possible. - It's also dimly possible (but again unlikely) that someone
wants their
node_modules
directory for reasons unimagined by us. - We could delete what might be considered an "active"
node_modules
directory, i.e., one on a project that the user is currently working on. That can obviously be reconstructed, but it's slow to redownload all those files, and it would certainly be quite annoying if every week or so all yournode_modules
directories disappeared on you.
A possible approach for dealing with the first two concerns would be
using .gitignore
as a proxy for "it's OK to delete this".
If a user has a file or directory (like node_modules
) in
a .gitignore
file, then that would have to be regenerated
by anyone (including this user) who clones the project in the
future, so at some level they probably won't be too upset if
it got deleted. Happily it seems that the git check-ignore
may do exactly the checking we want, without us having to scan
up through the directory tree, etc.
The third concern might possibly be addressed through one or
more of the mtime
, ctime
, or atime
attributes of the
node_modules
directory. We could, for example, decide to
only delete node_modules
directories that are at least six
months old by some measure. Six months would probably be long
enough to "protect" directories across breaks (including
summers). We could probably get away with three months,
however, and it might make sense to start with the shorter
timespan and see if anyone complains. (People will definitely
fuss if we set it too short, but no one will tell us if we
set it too long.)
It's not obvious which of mtime
, ctime
, or atime
to use.
I think that atime
may be our best bet. Users often won't
modify (i.e., change mtime
or ctime
) node_modules
for
long periods of time, especially once a project's structure
has stabilized. I think they'll "access" (atime
) the
node_modules
folder fairly often (e.g., whenever they build
the project), so hopefully that will be informative.
Adding a new dependency with ng add
changes all three times.
Running ng serve
after ng add
changes atime
but leaves
the other two alone. Running ng serve
a little later without
having touched or changed any other files also updates atime
without changing anything else.
This supports the idea that atime
is the
one that would be the most useful.
Unfortunately it looks like running tools like du
can sometimes update
the atime
, so we might have cases where things look like they've
been accessed by the user a lot more recently than we would expect.
I'm envisioning a script that takes one or more arguments
which are the directories that should be checked for
node_modules
directories. If no arguments are given, it might
be reasonable to use the current directory as the directory to
explore, but given the potentially destructive nature of the
command it might make sense to be more conservative and require
an explicit directory.
We might allow the minimum age of directories to delete to be configurable through command line arguments; 3 months might be the default, but we could allow the user to set different time bounds.
The current use of git check-ignore --quiet node_modules
in
the script leads to lots of errors like:
fatal: Not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
We should have a --dry-run
option that lists
- All the
node_modules
directories that will be deleted - All the
node_modules
directories it's skipping, with info on why
(This could be fancied up with additional flags, but just reporting everything would do for starters.)
Maybe a flag to list --not-ignored
and --too-new
, and maybe a
--summary
flag.
Maybe a --interactive
that shows you how many directories it will delete
and ask for confirmation. Or maybe that should be the default, and we
instead have a --force
flag (modelled after rm
) that skips that step?
Allow people to set the time as an argument, e.g., --min-age="now"
so they
can clean up everything at the end of a semester.
Should this be generalized to finding and deleting all things that are old and
ignored by git
? That would include things like .class
and .o
files,
generated executables, etc. That might be a nice generalization of this that
would make it more useful/interesting to a broader audience without completely
rewriting an existing command. We probably wouldn't want to look for all "old"
files and then see if they're ignored by git
. Maybe a better approach would
be to find all .git
files, and then search for old, ignored files in those
repositories? It sounds like it would be pretty computationally slow, though,
on a large directory structure.