Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue=1258, t-cache support block-level cache evict #1266

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

caijieming-ng
Copy link
Collaborator

No description provided.

@baidubot
Copy link
Collaborator

Reviewers: @00k @bluebore @yvxiang

@caijieming-ng caijieming-ng force-pushed the show_bug branch 3 times, most recently from 72a31b8 to 140eabf Compare May 18, 2017 18:51
@caijieming-ng
Copy link
Collaborator Author

build

for (uint32_t i = 0; i < c_valid.size(); ++i) {
MutexLock lockgard(&cache_->mu_);
CacheBlock* block = c_valid[i];
block->cv.Wait();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也得是while吧

@@ -18,6 +18,7 @@
#include "leveldb/comparator.h"
#include "leveldb/env_dfs.h"
#include "leveldb/env_flash.h"
#include "leveldb/block_cache.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目测顺序有问题

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

@@ -31,6 +32,7 @@ DECLARE_string(tera_leveldb_env_hdfs2_nameservice_list);
DECLARE_string(tera_tabletnode_path_prefix);
DECLARE_string(tera_dfs_so_path);
DECLARE_string(tera_dfs_conf);
DECLARE_int32(tera_leveldb_block_cache_env_num_thread);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_num?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

@@ -66,6 +68,21 @@ leveldb::Env* LeveldbBaseEnv() {
}
}

// Tcache: default env
static pthread_once_t block_cache_once = PTHREAD_ONCE_INIT;
static leveldb::Env* default_block_cache_env;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥不像mem及flash一样,弄成函数内的静态变量

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是静态的

// found in the LICENSE file.

#ifndef STOREAGE_LEVELDB_UTIL_BLOCK_CACHE_H
#define STOREAGE_LEVELDB_UTIL_BLOCK_CACHE_H
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STOREAGE_LEVELDB_UTIL_BLOCK_CACHE_H_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

// compitable with legacy FlashEnv
leveldb::FlashEnv* flash_env = (leveldb::FlashEnv*)io::LeveldbFlashEnv();
flash_env->SetFlashPath(FLAGS_tera_tabletnode_cache_paths,
FLAGS_tera_io_cache_path_vanish_allowed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对齐好像有问题?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

return reinterpret_cast<LRUHandle*>(handle)->value;
}

uint64_t NewId() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是用来干嘛的?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个接口在block cache中没有用

@@ -29,9 +29,36 @@ namespace leveldb {

class Cache;

// An entry is a variable length heap-allocated structure. Entries
// are kept in a circular doubly linked list ordered by access time.
struct LRUHandle {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

挪出来的目的是?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能填充cache_id

Env* NewBlockCacheEnv(Env* base);

} // leveldb
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STOREAGE_LEVELDB_UTIL_BLOCK_CACHE_H_

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

// Dummy head of LRU list.
// lru.prev is newest entry, lru.next is oldest entry.
//LRUHandle hot_lru_;
//LRUHandle cold_lru_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

咋感觉像内核里的page cache:)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的

@@ -286,6 +261,170 @@ size_t LRUCache::TotalCharge() {
return usage_;
}

class LRU2QCache: public Cache {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没太明白和原生LRUCache的区别,除了在找坑方面有区别,其它的有吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原生lru是可超了,这个lru是描述持久化的ssd上block的lru,不能超

// Tcache
/////////////////////////////////////////////
uint64_t kBlockSize = 4096UL;
uint64_t kDataSetSize = 134217728UL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

搞成128 * 1024 * 1024或者128 << 20这种?


Status NewRandomAccessFile(const std::string& fname,
RandomAccessFile** result); // cache Pread
static void BlockDeleter(const Slice& key, void* v);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和前面比少一个空行?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

BlockCacheImpl* cache = new BlockCacheImpl(options);
Status s = cache->LoadCache();
assert(s.ok());
cache_vec_.push_back(cache); // no need lock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

前面调Load时没打算搞成多线程的?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是ts启动时,加载cache(没有开销),不是load tablet的加载;

MutexLock l(&mu_);
if (tmp_storage_ == NULL) {
tmp_storage_ = new std::string();
tmp_storage_->resize(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这货自己就是0吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

tmp_storage_->resize(0);
block_list_.push_back(tmp_storage_);
}
uint32_t begin = offset_ / block_size_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有没有必要把后面抠到锁外

Slice buf(data.data() + tmp_size, data.size() - tmp_size);
for (uint32_t i = begin + 1; i <= end; ++i) {
tmp_storage_ = new std::string();
tmp_storage_->resize(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

tmp_storage_ = new std::string();
tmp_storage_->resize(0);
block_list_.push_back(tmp_storage_);
if (i < end) { // last block
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释应该是写错了

for (uint32_t i = begin + 1; i <= end; ++i) {
tmp_storage_ = new std::string();
tmp_storage_->resize(0);
block_list_.push_back(tmp_storage_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有没有必要放掉锁,一波流扔后台去刷

Log("[%s] begin release %s\n", cache_->WorkPath().c_str(), fname_.c_str());
MutexLock lockgard(&cache_->mu_);
uint64_t block_idx;
std::string* block_data = write_buffer_.PopBackBlock(&block_idx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥是拿后面的?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为前面的块是通过BGFlush异步下刷的(按block size粒度,不满足这个粒度的,BGFlush不处理),只有close的时候,才能知道文件确实结束了,所有在此时刷最后一个不满粒度的数据块

dfs_file_ = NULL;
}

Log("[%s] begin release %s\n", cache_->WorkPath().c_str(), fname_.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块直接调一把close可以不,感觉重了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

public:
BlockCacheWritableFile(BlockCacheImpl* c, const std::string& fname, Status* s)
: cache_(c),
bg_cv_(&c->mu_),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不暴露出来这个锁可以吗?由BlockCacheImpl提供一个阻塞的接口

cache_->options_.block_size,
s->ToString().c_str());

MutexLock lockgard(&cache_->mu_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,感觉有点怪

mu_.AssertHeld();
uint64_t fid = 0;
std::string key = "FNAME#" + fname;
mu_.Unlock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉略怪,为啥这里突然把锁放了,而调的时候前面特意加个锁


Waiter* w = NULL;
LockKeyMap::iterator it = lock_key_.find(key);
if (it != lock_key_.end()){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目测少个空格

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

}
write_buffer_.Append(data);

MutexLock lockgard(&cache_->mu_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥这里要拿锁?

Log("[%s] Begin BGFlush: %s\n", cache_->WorkPath().c_str(), fname_.c_str());
MutexLock lockgard(&cache_->mu_);
uint64_t block_idx;
std::string* block_data = write_buffer_.PopFrontBlock(&block_idx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

前面扔了好多task,这里不搞成while一波流主要是为了并发考虑?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的

port::CondVar cv(&cache_->mu_);
cv.Wait(10); // timewait 10ms retry
}
block->state = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把0也搞成个前面那砦枚举的状态?

Copy link
Collaborator Author

@caijieming-ng caijieming-ng Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以改成断言,此时这个block必然是个无效块

s->ToString().c_str());

MutexLock lockgard(&cache_->mu_);
fid_ = cache_->FileId(fname_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如前,感觉锁略怪

s = block->s; // degrade read
}
block->Clear(kCacheBlockLocked);
block->cv.SignalAll();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要唤醒谁?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

唤醒其他线程的c_lock中等待的block


if (!block->Test(kCacheBlockLocked) &&
block->Test(kCacheBlockValid)) {
block->Set(kCacheBlockLocked | kCacheBlockCacheRead);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哈哈,感觉像内核玩法了:)

block->WaitOnClear(kCacheBlockDfsRead);
block->Set(kCacheBlockCacheFill);
if (!block->s.ok() && s.ok()) {
s = block->s; // degrade read
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果这里失败,是不是后面可以直接跳到dfs->Read了

Copy link
Collaborator Author

@caijieming-ng caijieming-ng Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,走降级读模式

new_fid_ = prev_fid_ + options_.fid_batch_num;
Log("[block_cache %s]: reuse block cache: prev_fid: %lu, new_fid: %lu\n",
dbname.c_str(), prev_fid_, new_fid_);
s = Status::OK();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

强行OK?

uint64_t BlockCacheImpl::AllocFileId() { // no more than fid_batch_num
mu_.AssertHeld();
uint64_t fid = ++new_fid_;
while (new_fid_ - prev_fid_ >= options_.fid_batch_num) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while在这里的作用是什么?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

防止leveldb写失败

PutFixed64(&hkey, block->block_idx);
block->sid = lc.sid;
block->cache_block_idx = DecodeFixed64(lkey.data());
block->state = (block->Test(kCacheBlockValid)) ? kCacheBlockValid : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0换个方式表式?

lc.KeyToString().c_str(),
lc.ValToString().c_str(),
s.ToString().c_str());
} else if (lc.type == kDataSetKey) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个分支实际上是一个load操作,是不是放别处比较好

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数主要维护一个lock table,给整个blockcacheimpl内部对象的并发控制使用;比如leveldb的读修改写,或并发的data set file的load,都走同一套并发控制逻辑;

bugfix:
1. cache reload core
2. support aio engine
3. cache fill TEST PASS
@caijieming-ng caijieming-ng changed the title issue=1258, Tcache support block-level cache evict issue=1258, t-cache support block-level cache evict Aug 21, 2017
@@ -0,0 +1,101 @@
// Copyright (c) 2015, Baidu.com, Inc. All Rights Reserved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个看起来好别扭,既然是新文件,为啥会有个2015

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

namespace leveldb {

/**
* Keep adding ticker's here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些代码是其他地方抄过来的么,质量有保证么,会不会带来版权问题?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哈哈被识破了感觉

Copy link
Collaborator Author

@caijieming-ng caijieming-ng Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删掉不需要的统计项,从rocksdb的性能统计方式参考的, :-(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种方式调性能问题,比日志方式有效;glog和leveldb的打一条日志,大概在10微妙,很费。

@@ -25,31 +25,6 @@ namespace {

// LRU cache implementation

// An entry is a variable length heap-allocated structure. Entries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个为啥要删掉,挪走了?

@taocp
Copy link
Collaborator

taocp commented Sep 7, 2017

您的执著和坚持,令人感动!继续加油~

@fxsjy
Copy link
Collaborator

fxsjy commented Sep 7, 2017

同赞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants