You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+48-6Lines changed: 48 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,7 @@ To some extent this library will trade some smaller amount of extra query latenc
8
8
9
9
-**DataFrame caching**: Intelligent caching system using DBFS (Databricks File System).
10
10
-**Query complexity estimation**: Tools to analyze and estimate Spark query complexity and trigger caching if above some set threshold.
11
+
-**Hybrid Spark/DBFS caching**: On classic clusters, you can now prefer Spark's in-memory cache (`.cache()`) for fast iterative work, and only persist to DBFS when needed (see below).
11
12
12
13
## Installation
13
14
@@ -55,6 +56,8 @@ Requires Python 3.10 or higher. Install using pip:
Note: serverless clusters do not support monkey patching, ie extending with new methods, so `df.cacheToDbfs()` needs to be replaced with `cacheToDbfs(df)` and similar for `createCachedDataFrame(spark, ...)`. However `cacheToDbfs` (imported from `dbfs_spark_cache.caching`) can be used with `df.transform(cacheToDbfs)`. Unfortunatley write (and read) performance is really poor with serverless clusters, so it can't be recommended for general use with library.
60
+
58
61
Default cache trigger thresholds can be set per notebook by calling
Set either threshold to None to disable that specific check.
72
75
Caching occurs only if BOTH conditions are met (or the threshold is None).
73
76
77
+
## Hybrid Spark/DBFS Caching and Backup
78
+
79
+
Because spark cache is faster than dbfs cache when used with clusers with enough memory or disk space (and fast SSD disks are use as well), we can use it for fast iterative work, and only persist to dbfs when needed, ie when shutting down the cluster.
80
+
81
+
-**Backup of Spark-cached DataFrames**: Use `backup_spark_cached_to_dbfs()` to persist all Spark-cached DataFrames to DBFS before cluster termination, although the performance win of having it in sparch chache is not that big compared to rerunning all with dbfs caching directly.
82
+
-**Configurable caching mode**: The config `PREFER_SPARK_CACHE` (default: True) controls whether Spark in-memory cache is preferred on classic clusters. On serverless clusters, DBFS caching is always used.
83
+
-**Automatic registry of Spark-cached DataFrames**: DataFrames cached via `.cacheToDbfs()` in Spark-cache mode are tracked and can be listed or backed up.
84
+
85
+
By default (on classic clusters), calling `.cacheToDbfs()` will:
86
+
- Use Spark's in-memory cache (`.cache()`) if no DBFS cache exists. `.wcd()` will cache with Spark or not based on the estimated compute complexity of the query.
87
+
- If a DBFS cache exists, it will be read instead.
88
+
- You can persist all Spark-cached DataFrames to DBFS at any time (e.g. before cluster shutdown) with:
89
+
90
+
To force always caching to DBFS set:
91
+
92
+
```python
93
+
from dbfs_spark_cache.config import config
94
+
config.PREFER_SPARK_CACHE=False
95
+
```
96
+
97
+
On serverless clusters, DBFS caching is always used regardless of this setting (spark cache is not available). If you want to disable all calls to the extensions you can do:
and it will keep the DataFrame unchanged. When using spark cache you can do the persist to DBFS like this:
102
+
103
+
```python
104
+
from dbfs_spark_cache.caching import backup_spark_cached_to_dbfs
105
+
# backs up one or more specific DataFrames, eg the final result of your work and the DataFrames used with withCachedDisplay(), so you can pick up faster next time
where `"THE_HASH"` is the hash of the DataFrame you saved from before via:
115
+
```python
116
+
from dbfs_spark_cache.caching import get_table_hash
117
+
print(get_table_hash(df))
118
+
```
119
+
74
120
### Dataframe cache invalidation techniques that triggers cache invalidation?
75
121
76
122
Dataframe storage type|Query plan changes|Data changes
@@ -83,15 +129,11 @@ In-Memory|No not directly, but via conversion to BDFS table through createCached
83
129
This library has been primarily tested under the following Databricks environment configuration, but anything supported by Databricks and PySpark DataFrame API should or may work too:
Note that serverless performance when writing to DBFS is currently abysmal and can only be used for limited testing on small datasets. You can use file `serverless_env.yml` to automatically install the library on a serverless cluster.
0 commit comments