From 7b41b51a705ec0eb5f88060c9f724c8bc0e79eab Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Wed, 27 Mar 2013 12:24:17 -0700 Subject: bcache: Documentation updates Signed-off-by: Kent Overstreet --- Documentation/bcache.txt | 88 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) (limited to 'Documentation/bcache.txt') diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt index 533307d52c87..77db8809bd96 100644 --- a/Documentation/bcache.txt +++ b/Documentation/bcache.txt @@ -101,6 +101,94 @@ but all the cached data will be invalidated. If there was dirty data in the cache, don't expect the filesystem to be recoverable - you will have massive filesystem corruption, though ext4's fsck does work miracles. +ERROR HANDLING: + +Bcache tries to transparently handle IO errors to/from the cache device without +affecting normal operation; if it sees too many errors (the threshold is +configurable, and defaults to 0) it shuts down the cache device and switches all +the backing devices to passthrough mode. + + - For reads from the cache, if they error we just retry the read from the + backing device. + + - For writethrough writes, if the write to the cache errors we just switch to + invalidating the data at that lba in the cache (i.e. the same thing we do for + a write that bypasses the cache) + + - For writeback writes, we currently pass that error back up to the + filesystem/userspace. This could be improved - we could retry it as a write + that skips the cache so we don't have to error the write. + + - When we detach, we first try to flush any dirty data (if we were running in + writeback mode). It currently doesn't do anything intelligent if it fails to + read some of the dirty data, though. + +TROUBLESHOOTING PERFORMANCE: + +Bcache has a bunch of config options and tunables. The defaults are intended to +be reasonable for typical desktop and server workloads, but they're not what you +want for getting the best possible numbers when benchmarking. + + - Bad write performance + + If write performance is not what you expected, you probably wanted to be + running in writeback mode, which isn't the default (not due to a lack of + maturity, but simply because in writeback mode you'll lose data if something + happens to your SSD) + + # echo writeback > /sys/block/bcache0/cache_mode + + - Bad performance, or traffic not going to the SSD that you'd expect + + By default, bcache doesn't cache everything. It tries to skip sequential IO - + because you really want to be caching the random IO, and if you copy a 10 + gigabyte file you probably don't want that pushing 10 gigabytes of randomly + accessed data out of your cache. + + But if you want to benchmark reads from cache, and you start out with fio + writing an 8 gigabyte test file - so you want to disable that. + + # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff + + To set it back to the default (4 mb), do + + # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff + + - Traffic's still going to the spindle/still getting cache misses + + In the real world, SSDs don't always keep up with disks - particularly with + slower SSDs, many disks being cached by one SSD, or mostly sequential IO. So + you want to avoid being bottlenecked by the SSD and having it slow everything + down. + + To avoid that bcache tracks latency to the cache device, and gradually + throttles traffic if the latency exceeds a threshold (it does this by + cranking down the sequential bypass). + + You can disable this if you need to by setting the thresholds to 0: + + # echo 0 > /sys/fs/bcache//congested_read_threshold_us + # echo 0 > /sys/fs/bcache//congested_write_threshold_us + + The default is 2000 us (2 milliseconds) for reads, and 20000 for writes. + + - Still getting cache misses, of the same data + + One last issue that sometimes trips people up is actually an old bug, due to + the way cache coherency is handled for cache misses. If a btree node is full, + a cache miss won't be able to insert a key for the new data and the data + won't be written to the cache. + + In practice this isn't an issue because as soon as a write comes along it'll + cause the btree node to be split, and you need almost no write traffic for + this to not show up enough to be noticable (especially since bcache's btree + nodes are huge and index large regions of the device). But when you're + benchmarking, if you're trying to warm the cache by reading a bunch of data + and there's no other traffic - that can be a problem. + + Solution: warm the cache by doing writes, or use the testing branch (there's + a fix for the issue there). + SYSFS - BACKING DEVICE: attach -- cgit v1.2.3