Releases: facebook/rocksdb
Releases · facebook/rocksdb
RocksDB 8.7.3
8.7.3 (2023-10-30)
Behavior Changes
- Deleting stale files upon recovery are delegated to SstFileManger if available so they can be rate limited.
8.7.2 (2023-10-25)
Public API Changes
- Add new Cache APIs GetSecondaryCacheCapacity() and GetSecondaryCachePinnedUsage() to return the configured capacity, and cache reservation charged to the secondary cache.
Bug Fixes
- Fixed a possible underflow when computing the compressed secondary cache share of memory reservations while updating the compressed secondary to total block cache ratio.
- Fix an assertion failure when UpdeteTieredCache() is called in an idempotent manner.
8.7.1 (2023-10-20)
Bug Fixes
- Fix a bug in auto_readahead_size where first_internal_key of index blocks wasn't copied properly resulting in corruption error when first_internal_key was used for comparison.
- Add bounds check in WBWIIteratorImpl and make BaseDeltaIterator, WriteUnpreparedTxn and WritePreparedTxn respect the upper bound and lower bound in ReadOption. See 11680.
8.7.0 (2023-09-22)
New Features
- Added an experimental new "automatic" variant of HyperClockCache that does not require a prior estimate of the average size of cache entries. This variant is activated when HyperClockCacheOptions::estimated_entry_charge = 0 and has essentially the same concurrency benefits as the existing HyperClockCache.
- Add a new statistic
COMPACTION_CPU_TOTAL_TIME
that records cumulative compaction cpu time. This ticker is updated regularly while a compaction is running. - Add
GetEntity()
API for ReadOnly DB and Secondary DB. - Add a new iterator API
Iterator::Refresh(const Snapshot *)
that allows iterator to be refreshed while using the input snapshot to read. - Added a new read option
merge_operand_count_threshold
. When the number of merge operands applied during a successful point lookup exceeds this threshold, the query will return a special OK status with a new subcodekMergeOperandThresholdExceeded
. Applications might use this signal to take action to reduce the number of merge operands for the affected key(s), for example by running a compaction. - For
NewRibbonFilterPolicy()
, made thebloom_before_level
option mutable through the Configurable interface and the SetOptions API, allowing dynamic switching between all-Bloom and all-Ribbon configurations, and configurations in between. See comments onNewRibbonFilterPolicy()
- RocksDB now allows the block cache to be stacked on top of a compressed secondary cache and a non-volatile secondary cache, thus creating a three-tier cache. To set it up, use the
NewTieredCache()
API in rocksdb/cache.h.. - Added a new wide-column aware full merge API called
FullMergeV3
toMergeOperator
.FullMergeV3
supports wide columns both as base value and merge result, which enables the application to perform more general transformations during merges. For backward compatibility, the default implementation implements the earlier logic of applying the merge operation to the default column of any wide-column entities. Specifically, if there is no base value or the base value is a plain key-value, the default implementation falls back toFullMergeV2
. If the base value is a wide-column entity, the default implementation invokesFullMergeV2
to perform the merge on the default column, and leaves any other columns unchanged. - Add wide column support to ldb commands (scan, dump, idump, dump_wal) and sst_dump tool's scan command
Public API Changes
- Expose more information about input files used in table creation (if any) in
CompactionFilter::Context
. SeeCompactionFilter::Context::input_start_level
,CompactionFilter::Context::input_table_properties
for more. Options::compaction_readahead_size
's default value is changed from 0 to 2MB.- When using LZ4 compression, the
acceleration
parameter is configurable by setting the negated value inCompressionOptions::level
. For example,CompressionOptions::level=-10
will setacceleration=10
- The
NewTieredCache
API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified inTieredCacheOptions
. Any capacity specified inLRUCacheOptions
,HyperClockCacheOptions
andCompressedSecondaryCacheOptions
is ignored. A new API,UpdateTieredCache
is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. - The
NewTieredVolatileCache()
API in rocksdb/cache.h has been renamed toNewTieredCache()
.
Behavior Changes
- Compaction read performance will regress when
Options::compaction_readahead_size
is explicitly set to 0 - Universal size amp compaction will conditionally exclude some of the newest L0 files when selecting input with a small negative impact to size amp. This is to prevent a large number of L0 files from being locked by a size amp compaction, potentially leading to write stop with a few more flushes.
- Change ldb scan command delimiter from ':' to '==>'.
- For non direct IO, eliminate the file system prefetching attempt for compaction read when
Options::compaction_readahead_size
is 0
Bug Fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
- Fix a bug with atomic_flush=true that can cause DB to stuck after a flush fails (#11872).
- Fix a bug where RocksDB (with atomic_flush=false) can delete output SST files of pending flushes when a previous concurrent flush fails (#11865). This can result in DB entering read-only state with error message like
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst
. - Fix an assertion fault during seek with async_io when readahead trimming is enabled.
- When the compressed secondary cache capacity is reduced to 0, it should be completely disabled. Before this fix, inserts and lookups would still go to the backing
LRUCache
before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead. - Updating the tiered cache (cache allocated using NewTieredCache()) by calling SetCapacity() on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But SetCapacity() would just set the primary cache capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, SetCapacity() will distribute the new capacity across the two caches by the same ratio.
- Fixed a bug in
MultiGet
for cleaning up SuperVersion acquired with locking db mutex. - Fix a bug where row cache can falsely return kNotFound even though row cache entry is hit.
- Fixed a race condition in
GenericRateLimiter
that could cause it to stop granting requests - Fix a bug (Issue #10257) where DB can hang after write stall since no compaction is scheduled (#11764).
- Add a fix for async_io where during seek, when reading a block for seeking a target key in a file without any readahead, the iterator aligned the read on a page boundary and reading more than necessary. This increased the storage read bandwidth usage.
- Fix an issue in sst dump tool to handle bounds specified for data with user-defined timestamps.
- When auto_readahead_size is enabled, update readahead upper bound during readahead trimming when reseek changes iterate_upper_bound dynamically.
- Fixed a bug where
rocksdb.file.read.verify.file.checksums.micros
is not populated - Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than
Status::NotSupported()
Performance Improvements
- Added additional improvements in tuning readahead_size during Scans when auto_readahead_size is enabled. However it's not recommended for backward scans and might impact the performance. More details in options.h.
- During async_io, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, it tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase that would save some CPU.
RocksDB 8.6.7
8.6.7 (2023-09-26)
Bug Fixes
- Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than
Status::NotSupported()
Behavior Changes
- For non direct IO, eliminate the file system prefetching attempt for compaction read when
Options::compaction_readahead_size
is 0
8.6.6 (2023-09-25)
Bug Fixes
- Fix a bug with atomic_flush=true that can cause DB to stuck after a flush fails (#11872).
- Fix a bug where RocksDB (with atomic_flush=false) can delete output SST files of pending flushes when a previous concurrent flush fails (#11865). This can result in DB entering read-only state with error message like
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst
. - When the compressed secondary cache capacity is reduced to 0, it should be completely disabled. Before this fix, inserts and lookups would still go to the backing
LRUCache
before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead.
8.6.5 (2023-09-15)
Bug Fixes
- Fixed a bug where
rocksdb.file.read.verify.file.checksums.micros
is not populated.
8.6.4 (2023-09-13)
Public API changes
- Add a column family option
default_temperature
that is used for file reading accounting purpose, such as io statistics, for files that don't have an explicitly set temperature.
8.6.3 (2023-09-12)
Bug Fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
8.6.2 (2023-09-11)
Bug Fixes
- Add a fix for async_io where during seek, when reading a block for seeking a target key in a file without any readahead, the iterator aligned the read on a page boundary and reading more than necessary. This increased the storage read bandwidth usage.
8.6.1 (2023-08-30)
Public API Changes
Options::compaction_readahead_size
's default value is changed from 0 to 2MB.
Behavior Changes
- Compaction read performance will regress when
Options::compaction_readahead_size
is explicitly set to 0
8.6.0 (2023-08-18)
New Features
- Added enhanced data integrity checking on SST files with new format_version=6. Performance impact is very small or negligible. Previously if SST data was misplaced or re-arranged by the storage layer, it could pass block checksum with higher than 1 in 4 billion probability. With format_version=6, block checksums depend on what file they are in and location within the file. This way, misplaced SST data is no more likely to pass checksum verification than randomly corrupted data. Also in format_version=6, SST footers are checksum-protected.
- Add a new feature to trim readahead_size during scans upto upper_bound when iterate_upper_bound is specified. It's enabled through ReadOptions.auto_readahead_size. Users must also specify ReadOptions.iterate_upper_bound.
- RocksDB will compare the number of input keys to the number of keys processed after each compaction. Compaction will fail and report Corruption status if the verification fails. Option
compaction_verify_record_count
is introduced for this purpose and is enabled by default. - Add a CF option
bottommost_file_compaction_delay
to allow specifying the delay of bottommost level single-file compactions. - Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.
- Implement a new admission policy for the compressed secondary cache that admits blocks evicted from the primary cache with the hit bit set. This policy can be specified in TieredVolatileCacheOptions by setting the newly added adm_policy option.
- Add a column family option
memtable_max_range_deletions
that limits the number of range deletions in a memtable. RocksDB will try to do an automatic flush after the limit is reached. (#11358) - Add PutEntity API in sst_file_writer
- Add
timeout
in microsecond option toWaitForCompactOptions
to allow timely termination of prolonged waiting in scenarios like recurring recoverable errors, such as out-of-space situations and continuous write streams that sustain ongoing flush and compactions - New statistics
rocksdb.file.read.{get|multiget|db.iterator|verify.checksum|verify.file.checksums}.micros
measure read time of block-based SST tables or blob files during db open,Get()
,MultiGet()
, using db iterator,VerifyFileChecksums()
andVerifyChecksum()
. They require stats level greater thanStatsLevel::kExceptDetailedTimers
. - Add close_db option to
WaitForCompactOptions
to call Close() after waiting is done. - Add a new compression option
CompressionOptions::checksum
for enabling ZSTD's checksum feature to detect corruption during decompression.
Public API Changes
- Mark
Options::access_hint_on_compaction_start
related APIs as deprecated. See #11631 for alternative behavior.
Behavior Changes
- Statistics
rocksdb.sst.read.micros
now includes time spent on multi read and async read into the file - For Universal Compaction users, periodic compaction (option
periodic_compaction_seconds
) will be set to 30 days by default if block based table is used.
Bug Fixes
- Fix a bug in FileTTLBooster that can cause users with a large number of levels (more than 65) to see errors like "runtime error: shift exponent .. is too large.." (#11673).
RocksDB 8.5.4
8.5.4 (2023-09-26)
Bug Fixes
- Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than
Status::NotSupported()
Behavior Changes
- For non direct IO, eliminate the file system prefetching attempt for compaction read when
Options::compaction_readahead_size
is 0
RocksDB 8.5.3
Please note 8.5.1 includes a fix for a persisted database corruption in an unlikely edge case. Upgrading to a version including this fix, like this one, is highly recommended!
8.5.3 (2023-09-01)
Bug Fixes
- Fixed a race condition in
GenericRateLimiter
that could cause it to stop granting requests
8.5.2 (2023-08-31)
Bug fixes
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
8.5.1 (2023-08-31)
Bug fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
8.5.0 (2023-07-21)
Public API Changes
- Removed recently added APIs
GeneralCache
andMakeSharedGeneralCache()
as our plan changed to stop exposing a general-purpose cache interface. The old forms of these APIs,Cache
andNewLRUCache()
, are still available, although general-purpose caching support will be dropped eventually.
Behavior Changes
- Option
periodic_compaction_seconds
no longer supports FIFO compaction: setting it has no effect on FIFO compactions. FIFO compaction users should only set optionttl
instead. - Move prefetching responsibility to page cache for compaction read for non directIO use case
Performance Improvements
- In case of direct_io, if buffer passed by callee is already aligned, RandomAccessFileRead::Read will avoid realloacting a new buffer, reducing memcpy and use already passed aligned buffer.
- Small efficiency improvement to HyperClockCache by reducing chance of compiler-generated heap allocations
Bug Fixes
- Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest. Right now it's an experimental feature.
- Fix a bug in FileTTLBooster that can cause users with a large number of levels (more than 65) to see errors like "runtime error: shift exponent .. is too large.." (#11673).
RocksDB 8.4.4
8.4.4 (2023-09-01)
Bug Fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
- Fixed a race condition in
GenericRateLimiter
that could cause it to stop granting requests
8.4.3 (2023-07-27)
Bug Fixes
- Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest.
8.4.0 (2023-06-26)
New Features
- Add FSReadRequest::fs_scratch which is a data buffer allocated and provided by underlying FileSystem to RocksDB during reads, when FS wants to provide its own buffer with data instead of using RocksDB provided FSReadRequest::scratch. This can help in cpu optimization by avoiding copy from file system's buffer to RocksDB buffer. More details on how to use/enable it in file_system.h. Right now its supported only for MultiReads(async + sync) with non direct io.
- Start logging non-zero user-defined timestamp sizes in WAL to signal user key format in subsequent records and use it during recovery. This change will break recovery from WAL files written by early versions that contain user-defined timestamps. The workaround is to ensure there are no WAL files to recover (i.e. by flushing before close) before upgrade.
- Added new property "rocksdb.obsolete-sst-files-size-property" that reports the size of SST files that have become obsolete but have not yet been deleted or scheduled for deletion
- Start to record the value of the flag
AdvancedColumnFamilyOptions.persist_user_defined_timestamps
in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag is only explicitly record if it's false. - Add a new option OptimisticTransactionDBOptions::shared_lock_buckets that enables sharing mutexes for validating transactions between DB instances, for better balancing memory efficiency and validation contention across DB instances. Different column families and DBs also now use different hash seeds in this validation, so that the same set of key names will not contend across DBs or column families.
- Add a new ticker
rocksdb.files.marked.trash.deleted
to track the number of trash files deleted by background thread from the trash queue. - Add an API NewTieredVolatileCache() in include/rocksdb/cache.h to allocate an instance of a block cache with a primary block cache tier and a compressed secondary cache tier. A cache of this type distributes memory reservations against the block cache, such as WriteBufferManager, table reader memory etc., proportionally across both the primary and compressed secondary cache.
- Add
WaitForCompact()
to wait for all flush and compactions jobs to finish. Jobs to wait include the unscheduled (queued, but not scheduled yet). - Add
WriteBatch::Release()
that releases the batch's serialized data to the caller.
Public API Changes
- Add C API
rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio
. - change the FileSystem::use_async_io() API to SupportedOps API in order to extend it to various operations supported by underlying FileSystem. Right now it contains FSSupportedOps::kAsyncIO and FSSupportedOps::kFSBuffer. More details about FSSupportedOps in filesystem.h
- Add new tickers:
rocksdb.error.handler.bg.error.count
,rocksdb.error.handler.bg.io.error.count
,rocksdb.error.handler.bg.retryable.io.error.count
to replace the misspelled ones:rocksdb.error.handler.bg.errro.count
,rocksdb.error.handler.bg.io.errro.count
,rocksdb.error.handler.bg.retryable.io.errro.count
('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then. - Overload the API CreateColumnFamilyWithImport() to support creating ColumnFamily by importing multiple ColumnFamilies It requires that CFs should not overlap in user key range.
Behavior Changes
- Change the default value for option
level_compaction_dynamic_level_bytes
to true. This affects users who use leveled compaction and do not set this option explicitly. These users may see additional background compactions following DB open. These compactions help to shape the LSM according tolevel_compaction_dynamic_level_bytes
such that the size of each level Ln is approximately size of Ln-1 *max_bytes_for_level_multiplier
. Turning on this option has other benefits too: see more detail in wiki: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size and in option comment in advanced_options.h (#11525). - For Leveled Compaction users,
CompactRange()
will now always try to compact to the last non-empty level. (#11468)
For Leveled Compaction users,CompactRange()
withbottommost_level_compaction = BottommostLevelCompaction::kIfHaveCompactionFilter
will behave similar tokForceOptimized
in that it will skip files created during this manual compaction when compacting files in the bottommost level. (#11468) - RocksDB will try to drop range tombstones during non-bottommost compaction when it is safe to do so. (#11459)
- When a DB is openend with
allow_ingest_behind=true
(currently only Universal compaction is supported), files in the last level, i.e. the ingested files, will not be included in any compaction. (#11489) - Statistics
rocksdb.sst.read.micros
scope is expanded to all SST reads except for file ingestion and column family import (some compaction reads were previously excluded).
Bug Fixes
- Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (#11538)
- Fix extra prefetching during seek in async_io when BlockBasedTableOptions.num_file_reads_for_auto_readahead is 1 leading to extra reads than required.
- Fix a bug where compactions that are qualified to be run as 2 subcompactions were only run as one subcompaction.
- Fix a use-after-move bug in block.cc.
RocksDB 8.3.3
8.3.3 (2023-09-01)
Bug Fixes
- Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
- Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
- Fixed a race condition in
GenericRateLimiter
that could cause it to stop granting requests
RocksDB 8.3.2
8.3.2 (2023-06-14)
Bug Fixes
- Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (#11538)
8.3.1 (2023-06-07)
Performance Improvements
- Fixed higher read QPS during DB::Open() reading files created prior to #11406, especially when reading many small file (size < 52 MB) during DB::Open() and partitioned filter or index is used.
8.3.0 (2023-05-19)
New Features
- Introduced a new option
block_protection_bytes_per_key
, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287). - Added
JemallocAllocatorOptions::num_arenas
. Settingnum_arenas > 1
may mitigate mutex contention in the allocator, particularly in scenarios where block allocations commonly bypass jemalloc tcache. - Improve the operational safety of publishing a DB or SST files to many hosts by using different block cache hash seeds on different hosts. The exact behavior is controlled by new option
ShardedCacheOptions::hash_seed
, which also documents the solved problem in more detail. - Introduced a new option
CompactionOptionsFIFO::file_temperature_age_thresholds
that allows FIFO compaction to compact files to different temperatures based on key age (#11428). - Added a new ticker stat to count how many times RocksDB detected a corruption while verifying a block checksum:
BLOCK_CHECKSUM_MISMATCH_COUNT
. - New statistics
rocksdb.file.read.db.open.micros
that measures read time of block-based SST tables or blob files during db open. - New statistics tickers for various iterator seek behaviors and relevant filtering, as *
_LEVEL_SEEK_
*. (#11460)
Public API Changes
- EXPERIMENTAL: Add new API
DB::ClipColumnFamily
to clip the key in CF to a certain range. It will physically deletes all keys outside the range including tombstones. - Add
MakeSharedCache()
construction functions to various cache Options objects, and deprecated theNewWhateverCache()
functions with long parameter lists. - Changed the meaning of various Bloom filter stats (prefix vs. whole key), with iterator-related filtering only being tracked in the new *
_LEVEL_SEEK_
*. stats. (#11460)
Behavior changes
- For x86, CPU features are no longer detected at runtime nor in build scripts, but in source code using common preprocessor defines. This will likely unlock some small performance improvements on some newer hardware, but could hurt performance of the kCRC32c checksum, which is no longer the default, on some "portable" builds. See PR #11419 for details.
Bug Fixes
- Delete an empty WAL file on DB open if the log number is less than the min log number to keep
- Delete temp OPTIONS file on DB open if there is a failure to write it out or rename it
Performance Improvements
- Improved the I/O efficiency of prefetching SST metadata by recording more information in the DB manifest. Opening files written with previous versions will still rely on heuristics for how much to prefetch (#11406).
RocksDB 8.1.1
8.1.1 (2023-04-06)
Bug Fixes
- In the DB::VerifyFileChecksums API, ensure that file system reads of SST files are equal to the readahead_size in ReadOptions, if specified. Previously, each read was 2x the readahead_size.
8.1.0 (2023-03-18)
Behavior changes
- Compaction output file cutting logic now considers range tombstone start keys. For example, SST partitioner now may receive ParitionRequest for range tombstone start keys.
- If the async_io ReadOption is specified for MultiGet or NewIterator on a platform that doesn't support IO uring, the option is ignored and synchronous IO is used.
Bug Fixes
- Fixed an issue for backward iteration when user defined timestamp is enabled in combination with BlobDB.
- Fixed a couple of cases where a Merge operand encountered during iteration wasn't reflected in the
internal_merge_count
PerfContext counter. - Fixed a bug in CreateColumnFamilyWithImport()/ExportColumnFamily() which did not support range tombstones (#11252).
- Fixed a bug where an excluded column family from an atomic flush contains unflushed data that should've been included in this atomic flush (i.e, data of seqno less than the max seqno of this atomic flush), leading to potential data loss in this excluded column family when
WriteOptions::disableWAL == true
(#11148).
New Features
- Add statistics rocksdb.secondary.cache.filter.hits, rocksdb.secondary.cache.index.hits, and rocksdb.secondary.cache.filter.hits
- Added a new PerfContext counter
internal_merge_point_lookup_count
which tracks the number of Merge operands applied while serving point lookup queries. - Add new statistics rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit}
- Add support for SecondaryCache with HyperClockCache (
HyperClockCacheOptions
inheritssecondary_cache
option fromShardedCacheOptions
) - Add new db properties
rocksdb.cf-write-stall-stats
,rocksdb.db-write-stall-stats
and APIs to examine them in a structured way. In particular, users ofGetMapProperty()
with propertykCFWriteStallStats
/kDBWriteStallStats
can now use the functions inWriteStallStatsMapKeys
to find stats in the map.
Public API Changes
- Changed various functions and features in
Cache
that are mostly relevant to custom implementations or wrappers. Especially, asychronous lookup functionality is moved fromLookup()
to a newStartAsyncLookup()
function.
RocksDB 7.10.2
7.10.2 (2023-02-10)
Bug Fixes
- Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.
7.10.1 (2023-02-01)
Bug Fixes
- Fixed a data race on
ColumnFamilyData::flush_reason
caused by concurrent flushes. - Fixed
DisableManualCompaction()
andCompactRangeOptions::canceled
to cancel compactions even when they are waiting on conflicting compactions to finish - Fixed a bug in which a successful
GetMergeOperands()
could transiently returnStatus::MergeInProgress()
- Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.
7.10.0 (2023-01-23)
Behavior changes
- Make best-efforts recovery verify SST unique ID before Version construction (#10962)
- Introduce
epoch_number
and sort L0 files byepoch_number
instead oflargest_seqno
.epoch_number
represents the order of a file being flushed or ingested/imported. Compaction output file will be assigned with the minimumepoch_number
among input files'. For L0, largerepoch_number
indicates newer L0 file.
Bug Fixes
- Fixed a regression in iterator where range tombstones after
iterate_upper_bound
is processed. - Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open
- Fixed a bug that multi-level FIFO compaction deletes one file in non-L0 even when
CompactionOptionsFIFO::max_table_files_size
is no exceeded since #10348 or 7.8.0. - Fixed a bug caused by
DB::SyncWAL()
affectingtrack_and_verify_wals_in_manifest
. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892). - Fixed a BackupEngine bug in which RestoreDBFromLatestBackup would fail if the latest backup was deleted and there is another valid backup available.
- Fix L0 file misorder corruption caused by ingesting files of overlapping seqnos with memtable entries' through introducing
epoch_number
. Before the fix,force_consistency_checks=true
may catch the corruption before it's exposed to readers, in which case writes returningStatus::Corruption
would be expected. Also replace the previous incomplete fix (#5958) to the same corruption with this new and more complete fix. - Fixed a bug in LockWAL() leading to re-locking mutex (#11020).
- Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.
- Fixed a heap use after free in async scan prefetching if dictionary compression is enabled, in which case sync read of the compression dictionary gets mixed with async prefetching
- Fixed a data race bug of
CompactRange()
underchange_level=true
acts on overlapping range with an ongoing file ingestion for level compaction. This will either result in overlapping file ranges corruption at a certain level caught byforce_consistency_checks=true
or protentially two same keys both with seqno 0 in two different levels (i.e, new data ends up in lower/older level). The latter will be caught by assertion in debug build but go silently and result in read returning wrong result in release build. This fix is general so it also replaced previous fixes to a similar problem forCompactFiles()
(#4665), generalCompactRange()
and auto compaction (commit 5c64fb6 and 87dfc1d). - Fixed a bug in compaction output cutting where small output files were produced due to TTL file cutting states were not being updated (#11075).
New Features
- When an SstPartitionerFactory is configured, CompactRange() now automatically selects for compaction any files overlapping a partition boundary that is in the compaction range, even if no actual entries are in the requested compaction range. With this feature, manual compaction can be used to (re-)establish SST partition points when SstPartitioner changes, without a full compaction.
- Add BackupEngine feature to exclude files from backup that are known to be backed up elsewhere, using
CreateBackupOptions::exclude_files_callback
. To restore the DB, the excluded files must be provided in alternative backup directories usingRestoreOptions::alternate_dirs
.
Public API Changes
- Substantial changes have been made to the Cache class to support internal development goals. Direct use of Cache class members is discouraged and further breaking modifications are expected in the future. SecondaryCache has some related changes and implementations will need to be updated. (Unlike Cache, SecondaryCache is still intended to support user implementations, and disruptive changes will be avoided.) (#10975)
- Add
MergeOperationOutput::op_failure_scope
for merge operator users to control the blast radius of merge operator failures. Existing merge operator users do not need to make any change to preserve the old behavior
Performance Improvements
RocksDB 8.0.0
8.0.0 (02/19/2023)
Behavior changes
ReadOptions::verify_checksums=false
disables checksum verification for more reads of non-CacheEntryRole::kDataBlock
blocks.- In case of scan with async_io enabled, if posix doesn't support IOUring, Status::NotSupported error will be returned to the users. Initially that error was swallowed and reads were switched to synchronous reads.
Bug Fixes
- Fixed a data race on
ColumnFamilyData::flush_reason
caused by concurrent flushes. - Fixed an issue in
Get
andMultiGet
when user-defined timestamps is enabled in combination with BlobDB. - Fixed some atypical behaviors for
LockWAL()
such as allowing concurrent/recursive use and not expectingUnlockWAL()
after non-OK result. See API comments. - Fixed a feature interaction bug where for blobs
GetEntity
would expose the blob reference instead of the blob value. - Fixed
DisableManualCompaction()
andCompactRangeOptions::canceled
to cancel compactions even when they are waiting on conflicting compactions to finish - Fixed a bug in which a successful
GetMergeOperands()
could transiently returnStatus::MergeInProgress()
- Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.
- Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.
Feature Removal
- Remove RocksDB Lite.
- The feature block_cache_compressed is removed. Statistics related to it are removed too.
- Remove deprecated Env::LoadEnv(). Use Env::CreateFromString() instead.
- Remove deprecated FileSystem::Load(). Use FileSystem::CreateFromString() instead.
- Removed the deprecated version of these utility functions and the corresponding Java bindings:
LoadOptionsFromFile
,LoadLatestOptions
,CheckOptionsCompatibility
. - Remove the FactoryFunc from the LoadObject method from the Customizable helper methods.
Public API Changes
- Moved rarely-needed Cache class definition to new advanced_cache.h, and added a CacheWrapper class to advanced_cache.h. Minor changes to SimCache API definitions.
- Completely removed the following deprecated/obsolete statistics: the tickers
BLOCK_CACHE_INDEX_BYTES_EVICT
,BLOCK_CACHE_FILTER_BYTES_EVICT
,BLOOM_FILTER_MICROS
,NO_FILE_CLOSES
,STALL_L0_SLOWDOWN_MICROS
,STALL_MEMTABLE_COMPACTION_MICROS
,STALL_L0_NUM_FILES_MICROS
,RATE_LIMIT_DELAY_MILLIS
,NO_ITERATORS
,NUMBER_FILTERED_DELETES
,WRITE_TIMEDOUT
,BLOB_DB_GC_NUM_KEYS_OVERWRITTEN
,BLOB_DB_GC_NUM_KEYS_EXPIRED
,BLOB_DB_GC_BYTES_OVERWRITTEN
,BLOB_DB_GC_BYTES_EXPIRED
,BLOCK_CACHE_COMPRESSION_DICT_BYTES_EVICT
as well as the histogramsSTALL_L0_SLOWDOWN_COUNT
,STALL_MEMTABLE_COMPACTION_COUNT
,STALL_L0_NUM_FILES_COUNT
,HARD_RATE_LIMIT_DELAY_COUNT
,SOFT_RATE_LIMIT_DELAY_COUNT
,BLOB_DB_GC_MICROS
, andNUM_DATA_BLOCKS_READ_PER_LEVEL
. Note that as a result, the C++ enum values of the still supported statistics have changed. Developers are advised to not rely on the actual numeric values. - Deprecated IngestExternalFileOptions::write_global_seqno and change default to false. This option only needs to be set to true to generate a DB compatible with RocksDB versions before 5.16.0.
- Remove deprecated APIs
GetColumnFamilyOptionsFrom{Map|String}(const ColumnFamilyOptions&, ..)
,GetDBOptionsFrom{Map|String}(const DBOptions&, ..)
,GetBlockBasedTableOptionsFrom{Map|String}(const BlockBasedTableOptions& table_options, ..)
andGetPlainTableOptionsFrom{Map|String}(const PlainTableOptions& table_options,..)
. - Added a subcode of
Status::Corruption
,Status::SubCode::kMergeOperatorFailed
, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions
Build Changes
- The
make
build now builds a shared library by default instead of a static library. UseLIB_MODE=static
to override.
New Features
- Compaction filters are now supported for wide-column entities by means of the
FilterV3
API. See the comment of the API for more details. - Added
do_not_compress_roles
toCompressedSecondaryCacheOptions
to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default. - Added a new
MultiGetEntity
API that enables batched wide-column point lookups. See the API comments for more details.