-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlinux_storage.txt
1829 lines (1573 loc) · 75.6 KB
/
linux_storage.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
[[{linux,storage]]
[[{storage.block_layer,]]
# LINUX STORAGE BLOCK LAYER
## Monitoring Disk I/O performance [[{storage.profiling,monitoring.storage.i/o,]]
```
| # iotop --only # <· show only disk I/O
|
| # iostat -dxm # <· stats for devices and partitions.
| Monitors time devices and partitions are
| active in relation to their average transfer rates.
| x : Show (more) detailed stats
| d : output only device report
| m : Use MB in stat. output
|
| # vmstat -d 1 5 # <· Virt. Memory stats.
| d : filter by disk statistics
| 1 : 1 second interval
| 5 : repeat 5 times and exit
|
| # atop | grep DSK # <· report every 10 secs process
| activity (included finished ones )
|
| # dstat --disk --io \
| -D sda # <· Extra counters when compared to others
|
| # ioping /dev/sda1 -c4# <· monitor I/O speed and latency in real time
| "how long it takes a disk to respond to requests"
```
[[}]]
## Block Storage [[{storage.block_management,linux.101]]
* REF: <https://opensource.com/article/18/11/partition-format-drive-linux>
Linux(UNIX) sees storage devices like block-devices:
- read and write data is done fixed-size blocks. (4096 bytes or more, ussually)
- Memory RAM is used to cache disk data automatically to avoid slow but
frequent accesses.
- block Read/write is done to random places (vs serialized/ordered access).
Moving to random places is still slower (except for SSD disks).
```
$ lsblk # <·· list attached block devices:
| Example Output:
| NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
| sda 8:0 0 447,2G 0 disk
| └·sda1 8:1 0 223,6G 0 part /
| └·sda2 8:1 0 223,6G 0 part /var/backups
| sdb 8:16 0 931,5G 0 disk
| └·sdb1 8:17 0 900G 0 part
| └·md1 9:1 0 899,9G 0 raid1 /home
| sdc 8:32 0 931,5G 0 disk
| └·md1 9:1 0 899,9G 0 raid1 /home
| ...
| ^^^^^^^ ^^^^^^^^^^^^^^
| Device name assigned Visible File-system path
| by Linux kernel. (mount─point) to apps
| Only system.admin will and users.
| care about it, usually
| through a virtual path on
| the file-system like /dev/sda1,...
|
| Line 02-04: Disk ussually are "big". Partitions are used to make better
| use of them. In this case disk "sda" is split into partition sda1 and sda2
```
* Access to Block devices is done indirectly through the File-system.
```
| │Process│N<·>1│Userspace │ 1<·····>1│Linux │1<··>N│Linux │N<·> M │Block │
| │ │ │Filesystem│ │Kernel│ │File System│ │Device │
| ^ ^ ^ │implementa.│ ^
| | │ · ^ Disk, RAID
| - Shell, - Isolate Apps · · SCSI, Device
| - explorer from the internals · · Mapper, IP...
| - DDBB of blocks device. · ext4 (standard)
| - ... - Apps will see files · xfs (high load)
| distributed in a tree · fsfs (flash memory)
| of parent/children · nfs (remore network fs)
| directories.
| - i-nodes Takes care of all complexities
| - symbolic links of FS implementation and concurrent
| (if supported by access to a physical disk by different
| implemen.) apps.
| - FILE BLOCK
| CACHE BUFFERS
| └───┴··················································─┴─────┘
| - vmstat will show realtime cache/buffers
| - Kernel tries to cache as much user-space data as possible
| and just enough buffers for predicted "next-reads" to
| block devices [[{doc_has.keypoint}]]
```
* NOTE: Some advanced applications like Databases can directly claim access to the block-device
skiping kernel control and taking ownership of the device. This block-device
will not be visible to the file-system or accesible to any other
application. System. Admins call also skip the standard filesystem and access the
block-device directly through the special /dev/ paths. (but is discouraged 99% of the times)
## Setup Disk
```
$ sudo parted \ ← *STEP 1) Partitioning a disk (optional but recomended)*
/dev/sdc \ ← Physical disk
--align opt \ ← let 'parted' find optimal start/end point
mklabel msdos \ ← creates partition table (==disk label).
'msdos' or 'gpt' are very compatible/popular labels
0 4G ← start/end of partition. Can be slightly shifted/tunned
to adjust to the underlying disk technology
due to the '--align opt'
$ sudo mkfs.ext4 \ ← *STEP 2) Create a filesystem*
-n PicturesAndVideos \ - ext4 is a popular filesystem. Recomended for desktops and small/medium servers.
/dev/sdc1 - xfs is prefered for "big" systems with many apps running concurrently
- fsfs is prefered for Flash-based systems.
$ sudo*mount*\ ← *STEP 3) Mount it. Add also to /etc/fstab to persist on reboots*
/dev/sdc1 \ ← Partition
-t auto \ ← auto-detect type
/opt/Media ← Visible path to apps/users in file-system
For RAID systems system.admin first create a virtual RAID device /dev/md0 composed
of many disks (/dev/sda, /dev/sdb, ...). The STEP 1 is done the virtual RAID device
```
## Annotated /etc/fstab
```
| $ cat /etc/fstab | nl -
|→ 1 /dev/mapper/fedora-root / ext4 defaults 1 1
|→ 2 UUID=735acb4c-29bc-4ce7-81d9-83b778f6fc81 /boot ext4 defaults 1 2
|→ 3 LABEL=backups /mnt/backups xfs defaults 1 2
|→ 4 /dev/mapper/fedora-home /home ext4 defaults 1 2
|→ 5 /dev/mapper/fedora-swap swap swap defaults 0 0
|→ 6 UUID=8C208C30-4E8F-4096-ACF9-858959BABBAA /var/.../ xfs defaults,nofail*3 1 2
| └──────────────┬───────────────────┘ └─────┬────┘ └──┬──┘ └───┬───┘ │ │
| ↓ ↓ ↓ ↓ │ │
| As returned by $ lsblk -o +uuid mount point in FS CSV mount options │ │
| partition Universal Unique ID (UUID) or FS tree hierchy type for the FS type. │ │
| partition LABEL or PARTUUID (GPT) Different FSs can │ │
| identifies the partition. have different │ │
| UUID *2 are prefered to /dev/sd... mount opts. plus │ │
| like /dev/sdb3 since the device name common ones to │ │
| can change for USB or plugable devices all FS *1 │ │
| │ │
| this flag determines the │ │
| FS check order during boot: used by dump(8) to determine ←─────┘ │
| (0 disables the check) which ext2/3 FS need backup │
| - root (/) should be 1 (default to 0) │
| - Other FSs should be 2 │
| ↑ │
| └────────────────────────────────────────────────────────────────────┘
|*1 defaults mount options common to all File System types:
| rw : Read/write vs 'ro' (read only)
| suid : Allow set-user-ID | set-group-ID bits (nosuid: Ignore them)
| dev : Interpret character or block special devices
| exec : Permit execution of binaries
| auto :
| nouser : Do NOT allow ordinary user to mount the filesystem.
| async : Do not force synchronous I/O to file system.
| (WARN: sync may cause life-cycle shortening in flash drives)
|*2 Next command list the UUIDs of partitions:
| $ blkid
| → /dev/sda1: LABEL="storage" UUID="60e97193-e9b2-495f-8db1-651f3a87d455" TYPE="ext4"
| → /dev/sda2: LABEL="oldhome" UUID="e6494a9b-5fb6-4c35-ad4c-86e223040a70" TYPE="ext4"
| → /dev/sdb1: UUID="db691ba8-bb5e-403f-afa3-2d758e06587a" TYPE="xfs" ...
| ^^^^
| tags the filesystem actually (vs the partition itself)
|*3 TODO: Differences between nobootwait and nofail:
| nofail: allows the boot sequence to continue even if the drive fails to mount.
| On cloud systems it ussually allows for ssh access in case of failure
```
[[}]]
[[{storage.device_mapper]]
## Device Mapper
* <https://en.wikipedia.org/wiki/Device_mapper>
- kernel framework mapping virtual block devices
to one (or more) physical block device
- Optionally can process and filter in/out data
### DMSETUP — LOW LEVEL LOGICAL VOLUME MANAGEMENT*
* <https://linux.die.net/man/8/dmsetup>
CY&P from <https://wiki.gentoo.org/wiki/Device-mapper>
> """
> Normally, users rarely use dmsetup directly. The dmsetup is a very low level.
> LVM, mdtool or cryptsetup is generally the preferred way to do it,
> as it takes care of saving the metadata and issuing the dmsetup commands.
> However, sometimes it is desirable to deal with directly:
> sometimes for recovery purposes, or to use a target that han't yet been ported
> to LVM.
> """
* FEATURES:
- The device mapper "touch" various layers of the Linux kernel's storage stack.
- Functions provided by the device mapper include linear, striped and error mappings,
as well as crypt and multipath targets.
* EXAMPLES:
- Two disks may be concatenated into one logical volume with a pair of linear
mappings, one for each disk.
- crypt target encrypts the data passing through the specified
device, by using the Linux kernel's Crypto API.
### KERNEL FEATURES AND PROJECTS BUILT ON TOP
Note: user-space apps talk to the device mapper via *libdevmapper.so *
which in turn issues ioctls to the /dev/mapper/control device node.
```
| - cryptsetup : utility to setup disk encryption based on dm-crypt
| - dm-crypt/LUKS : mapping target providing volume encryption
| - dm-cache : mapping target providing creation of hybrid volumes
| - dm-integrity : mapping target providing data integrity, either
| using checksumming or cryptographic verification,
| also used with LUKS
| - dm-log-writes : mapping target that uses two devices, passing through
| the first device and logging the write operations performed
| to it on the second device
| - dm-verity : validates the data blocks contained in a file system
| against a list of cryptographic hash values, developed as
| part of the Chromium OS project
| - dmraid(8) : provides access to "fake" RAID configurations via the
| device mapper
| - DM Multipath : provides I/O failover and load-balancing of block devices
| within the Linux kernel
|
| - allows to configure multiple I/O paths between server nodes
| and storage arrays(separate cables|switches|controllers)
| into a single mapped/logical device.
|
| - Multipathing aggregates the I/O paths, creating a new device
| that consists of the aggregated paths.
|
| - Docker : uses device mapper to create copy-on-write storage for
| software containers
|
| - DRBD : Distributed Replicated Block Device
|
| - kpartx(8) : utility called from hotplug upon device maps creation and
| deletion
| - LVM2 : logical volume manager for the Linux kernel
|
| - Linux version of TrueCrypt
```
```
| DEVICE-MAPPER LOGICAL-TO-TARGET:
|---------------------------------
| *MAPPED DEVICE* │ *MAPPING TABLE* │*TARGET DEVICE*
| (LOGICAL DRIVE) │ │ PLUGIN (INSTANCE/s)
| │ │
| logical device provided by │ entry1: │ - filters
| device-mapper driver. │ mapped-device1 ←→ target-device1 │ - access physical
| It provides an interface to │ └─ start address └┬─ start address │ devices
| operate on. │ └─ sector-length │
| │ entry2: │ Example plugins:
| Ex: │ mapped-device2 ←→ target-device2 │ - mirror for RAID
| - LVM2 logical volumes │ └─ start address └┬─ start address │ - linear for LVM2
| - dm-multipath pseudo-disks │ └─ sector-length │ - stripped for LVM2
| - "docker images" │ entry3: ^^^^^^^^^^^^^ │ - snapshot for LVM2
| │ .... 1sector = 512 │ - dm-multipath
| │ bytes│
| │ NOTE: 1 sector = 512 bytes │
| ─────────────────────────────┴──────────────────────────────────────┴────────────────────
```
```
| DATA FLOW:
|───────────
| App → (Data) → MAPPED DEVICE → DEVICE MAPPER → TARGET-DEVICE → Physical
| Route to target PLUGIN instance Block Device
| based on:
| - MAPPED-DEVICE
| - MAPPING-TABLE
|
| Data can be also modified in transition, which is performed, for example,
| in the case of device mapper providing disk encryption or simulation of
| unreliable hardware behavior.
```
```
| AVAILABLE MAPPING TARGETS
|
|- cache : allows creation of hybrid volumes, by using solid-state drives
| (SSDs) as caches for hard disk drives (HDDs)
|- crypt : provides data encryption, by using kernel Crypto API
|- delay : delays reads and/or writes to different devices (testing)
|- era : behaves in a way similar to the linear target, while it keeps
| track of blocks that were written to within a user-defined
| period of time
|- error : simulates I/O errors for all mapped blocks (testing)
|- flakey : simulates periodic unreliable behaviour (testing)
|- linear : maps a continuous range of blocks onto another block device
|- mirror : maps a mirrored logical device, while providing data redundancy
|- multipath: supports the mapping of multipathed devices, through usage of
| their path groups
|- raid : offers an interface to Linux kernel's software RAID driver (md)
|- snapshot : (and snapshot-origin) used for creation of LVM snapshots,
| as part of the underlying copy-on-write scheme
|- striped : stripes the data across physical devices, with the number of
| stripes and the striping chunk size as parameters
|- thin : allows creation of devices larger than the underlying
| physical device, physical space is allocated only when
| written to
|- zero : equivalent of /dev/zero, all reads return blocks of zeros,
| and writes are discarded
```
[[storage.device_mapper}]]
[[{storage.block_management.RAID]]
## Software RAID 0/1/2/...
* NOTE: LVM can also be used to create RAID, but the approach looks
to be less mature/supported.
SETUP STEPS:
0. STEP 0
```
| D1=/dev/sda ; D2=/dev/sdb ;
| NAME="/dev/md0" # ← (DESIRED NAME for the array)
| MDADM_CREATE="sudo mdadm --create --verbose"
| RAID 0 │ RAID 1 │ RAID 5: │ RAID 6/10
| ------------------+------------------+-------------------+----------
| $MDADM_CREATE \ │ $MDADM_CREATE \ │ $MDADM_CREATE \ │ $MDADM_CREATE \
| $NAME \ │ $NAME \ │ $NAME \ │ $NAME \
|*--level= * \ │ --level=1 \ │ --level=5 \ │ --level=6 \ (=10)
| --raid-devices=2 │ --raid-devices=2 │ --raid-devices=3 │ --raid-devices=4 \
| $D1 $D2 │ $D1 $D2 │ $D1 $D2 $D3 │ --raid-devices=4 \
| │ │ $D1 $D2 $D3 $D4
| │ │
| │ *WARN:*Low perf. │ RAID 10 admits also
| │ in degraded mode │ an extra layout arg.
| │ │ near, far, offset
|
| NOTE: For RAID 1,5.. create will take a time.
| To monitor the progress:(*man 4 md, section "RAID10"*)
| $ cat /proc/mdstat
| → Output
| → Personalities : [linear] ... [raid1] ....
| → md0 : active raid1 sdb[1] sda[0]
| → 104792064 blocks super 1.2 [2/2] [UU]
| → *[==>..........] resync = 20.2% *
| → *(212332/1047920) finish=... speed=.../sec*
| → ...
```
1. STEP 1. Ensure RAID was created properly
```
|
| $ cat /proc/mdstat
| Personalities : [linear] [multipath] ...
| md0 : active raid0 sdb[1] sda[0]
| 209584128 blocks super 1.2 512k chunks
| ...
```
2. STEP 2. create ext4|xfs|... filesystem
```
$ sudo mkfs.ext4 -F /dev/md0
```
3. STEP 3 Mount at will somewhere. Optionally add to /etc/fstab
```
| $ sudo mount /dev/md0 /var/backups
```
4. STEP 4. Keep layout at reboot
```
| $ sudo mdadm --detail --scan | \
| sudo tee -a /etc/mdadm/mdadm.conf
|
| jRAID 5 WARNING: check again to make sure the array has finished assembling. Because of
| the way that mdadm builds RAID 5, if the array is still building, the number of spares
| in the array will be inaccurately reported:
```
5. STEP 5, update initramfs, to make RAID available early at boot process (OPT.)
```
| $ sudo update-initramfs -u
```
## Mirror existing disk with data
This section covers the tipical case:
```
INITIAL SCENARIO ··> DESIRED FINAL SCENARIO
───────────────────┼────────────────────────────────────────────────
Disk0 with data │ Disk0, part of new RAID with Mirrowed data
(/dev/sda1) │ ─────────────
Disk1 New Disk │ Disk1, part of new RAID with Mirrowed data
(/dev/sdb) │ ─────────────
```
Some care must be taken to avoid loosing the data in the RAID creation procedure.
Add mirror to existing disk without deleting data:
1. STEP 1,Create **incomplete RAID** with missing disks:
```
| $ mdadm --create --verbose
| /dev/md0
| --level=1
| --raid-devices=2 /dev/sdb # NOTICE: skip /dev/sda1 with important data
```
2. STEP 2, Format partition:
```
| $ mkfs.ext4 /dev/md0
```
3. STEP 3, Copy Important data from **existing disk** to new array:
```
| $ sudo mount /dev/md0 /mnt/newarray
| $ tar -C */mnt/disk0WithData*-cf - | tar -C */mnt/newarray/* -xf -
| *WARN:* - Check that there are no errors in the execution. Maybe sudo is needed
| - Inspect visually the content of /mnt/newarray and get sure it contains
| all ourO*Important data* before continuing. Any other tests are welcome.
```
4. STEP 4, add original disk to disk array
```
| $ mdadm /dev/md0 --add /dev/sda1 # <·· *WARN:* if STEP 3 fails or is skipped
| *Important data* will be LOST!!!
```
5. STEP 5, Keep layout at reboot
```
| $ sudo mdadm --detail --scan | \
| sudo tee -a /etc/mdadm/mdadm.conf
```
## Releasing/freeing RAID Resources
0. PRE-SETUP (OPTIONAL) RESETTING EXISTING RAID DEVICES
```
| *WARN!* any data stored will be lost
| *WARN!* Backup your RAID data fisrt
|
| $ cat /proc/mdstat # <·· Find any active array
| ( Output like )
| Personalities : [raid0] [linear]
| [multipath] [raid1] [raid6]
| [raid5] [raid4] [raid10]
| md0 : active raid0 sdc[1] sdd[0]
| 209.. blocks super 1.2 512k chunks
| ...
| $ sudo umount /dev/md0 # <·· Unmount the array
| $ sudo mdadm --stop /dev/md0 # <·· STOP the array
| $ sudo mdadm --remove /dev/md0 # <·· REMOVE the array
|
| $ lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT # <·· Find devices used to build the array
| (Output like)
| NAME SIZE FSTYPE TYPE MOUNTPOINT
| sda 100G disk
| sdb 100G disk
| sdc 100G*linux_raid_member*disk
| sdd 100G*linux_raid_member*disk
| vda 20G disk
| ├─vda1 20G ext4 part /
| └─vda15 1M part
| ...
| ^^^
| WARN: /dev/sd* name can change at reboot!
|
| $ sudo mdadm --zero-superblock /dev/sdc # <· zero the superblock to reset to normal
| $ sudo mdadm --zero-superblock /dev/sdd # <· zero the superblock to reset to normal
|
| $ vim /etc/fstab
| ...
| # /dev/md0 /var/backups ext4 defaults,nofail,discard 0 0 # <·· remove/comment line
|
| $ vim /etc/mdadm/mdadm.conf
| ...
| # ARRAY /dev/md0 metadata=1.2 name=mdadmwrite:0 UUID=7...# <·· remoe/comment line
|
| $ sudo update-initramfs -u ← Update initramfs
```
[[storage.block_management.RAID}]]
[[{storage.block_management.DRDB,storage.distributed]]
## DRBD (Distributed Replicated Block Device)
* <https://www.tecmint.com/setup-drbd-storage-replication-on-centos-7/>
flexible and versatile replicated storage solution for Linux. It mirrors the
content of block devices such as hard disks, partitions, logical volumes etc.
between servers. It involves a copy of data on two storage devices, such that
if one fails, the data on the other can be used.
- It's a high-performance + low-latency low-level building block for block
replication.
NOTE:
* Probably, higher level replication strategies ("Ceph", "GlusterFS") are preferred.
CEPH offers integration with OpenStack.<br/>
"""...However, Ceph's performance characteristics prohibit its
deployments in certain low-latency use cases, e.g., as backend for
MySQL ddbbs"
(So it looks like DRBD is preferred for Databases that basically "ignore" the
File System tree structure).
[[{doc_has.comparative}]]
* See also DRBD4Cloud research project, aiming at ncreasing the applicability
and functionality for cloud markets.
* <https://www.ait.ac.at/en/research-topics/cyber-security/projects/extending-drbd-for-large-scale-cloud-deployments/>
> """RBD is currently storing up to 32full data replicas on remote storage
> nodes. DRBD4Cloudwill allow for the usage of erasure coding, which allows
> one to split data into a number of fragments (e.g., nine), such that
> only a subset (e.g., three) is needed to read the data. This will
> significantly reduce the required storage and upstream band-width
> (e.g., by 67 %), which is important, for instance, forgeo-replication with
> high network latency."""
[[storage.block_management.DRDB}]]
# FLASH BLOCK STORAGE [[{storage.flash]]
## Detecting false Flash [[{storage.flash.f3,security.storage,security.fake_flash,]]
* <https://www.linuxlinks.com/essential-system-tools-f3-detect-fix-counterfeit-flash-storage/>
2019-02-08 Steve Emms
- f3 (Fight Flash Fraud or Fight Fake Flash) Detect and fix counterfeit flash storage
- flash memory stage is particularly susceptible to fraud.
- The most commonly affected devices are USB flash drives,
but SD/CF and even SSD are affected.
- It’s not sufficient to simply trust what df, since it
simply relays what the drive reports (which can be fake).
- neither using dd to write data is a good test.
- f3 is a set of 5 open source utilities that detect and
repair counterfeit flash storage.
- test media capacity and performance.
- test real size and compares it to what the drive says.
- open source implementation of the algorithm used by H2testw.
- Installation:
```
| $ git clone https://github.com/AltraMayor/f3.git
| $ make # compile f3write,f3read
| $ make install # /usr/local/bin by default
| $ make extra # compile and install f3probe, f3fix, and f3brew
| $ sudo make install-extra
```
- Ussage:
- f3write fills a drive with 1GB .h2w files to test its real capacity.
-w flag lets you set the maximum write rate.
-p show the progress made
- f3read: After you’ve written the .h2w files to the flash media,
you then need to check the flash disk contains exactly the written
files. f3read performs that checking function.
- f3probe is a faster alternative to f3write/f3read.
particularly if you are testing high capacity slow writing media.
It works directly over the block device that controls the drive.
So the tool needs to be run with elevated privileges.
It only writes enough data to test the drive.
It destroys any data on the tested drive.
- f3fix
Obviously if your flash drive doesn’t have the claimed specifications,
there’s no way of ‘fixing’ that. But you can at least have the flash
correctly report its capacity to df and other tools.
- f3fix creates a partition that fits the actual size of the fake drive.
- f3brew
f3brew is designed to help developers determine how fake drives work.
[[storage.flash.f3}]]
## bmaptool ("dd enhanced") [[{]]
* <https://github.com/tldr-pages/tldr/blob/master/pages/common/bmaptool.md>
* Create or copy block maps intelligently (designed to be faster than cp or dd).
* <https://source.tizen.org/documentation/reference/bmaptool>
[[}]]
[[{storage.flash,storage.ssd,performance.storage]]
## Optimizing SSD
* REF: <https://searchdatacenter.techtarget.com/tip/Optimizing-Linux-for-SSD-usage>
* Setting disk partitions:
```
| SSD disks uses 4 KB blocks for reading
| 512KB blocks for deleting!!!
```
To makes sure partitions are aligned to SSD-friendly settings:
```
| $ sudo fdisk -H 32 -C 32 –c ....
| · · └───┴─ cylinder size
| └───┴─······ head size
```
### SETTING UP EXT4 FOR SSD)
- Optimize ext4 erase blocks by ensuring that files smaller
than 512 KB are spread across different erase blocks:
- specify stripe-width and stride to be used. (default: 4KB)
- alt.1: FS creation:
```
| $ sudo mkfs.ext4 -E stride=128,stripe-width=128 /dev/sda1
```
- alt.2: existing FS:
```
| $ tune2fs -E stride=128,stripe-width=128 /dev/sda1
```
- SETTING I/O SCHEDULER FOR SSD)
- Default value is Complete Fair Queueing.
SSD benefits from the deadline scheduler:
- Include a line like next one in `/etc/rc.local`:
```
$ sudo echo deadline > /sys/block/sda/queue/scheduler
```
- TRIMMING THE DATA BLOCKS FROM SSD)
- Trimming makes sure that when a file is removed, the data blocks
actually do get wiped.
- Without trimming, SSD performance degrades as data blocks get
filled up.
```
enable trimming ·─┬─────┐
v v
/dev/sda1 / ext4 discard,errors=remount-ro,noatime 0 1 . Ex:
^ ^
do not update file access time ···┴─────┘
EVERY TIME FILE IS READ, minimizing
writes to FS.
```
[[storage.flash}]]
[[{storage.flash,security.encryption]]
## Flash Friendly FS "F2FS"
* <https://www.usenix.org/conference/fast15/technical-sessions/presentation/lee>
* <https://mjmwired.net/kernel/Documentation/filesystems/f2fs.rst>
* <man mkfs.f2fs>
## Much better than EXT4 for Flash Storage: [[performance.storage]]
- ~3.1x faster with iozone
- ~2.0x faster with SQLite
- ~2.5x faster in SATA SSD and 1.9 (PCIe SSD).
`man mkfs.f2fs` summary
1. STEP 1: Formats to f2fs supporting encrypt.
```
| # mkfs.f2fs -O encrypt
| [ -d debugging-level ] \ ← 0: basic debug
| - *encrypt*[,options] \ ← encrypt: enable encryption
| [ -e extension-list ] \ ← treat files with extension as cold files to be stored in
| *cold log*. Default list includes most multimedia extensions
| (jpg, gif, mpeg, mkv, ...)
| [ -f ] \ ← Force overwrite if existing FS is detected.
| [ -l volume-label ] \
| [ -m ] \ ← Enable block-zoned-feature support
| Useful in NVMe disks according to <https://zonedstorage.io/>
| [ -q ] \ ← Quiet mode.
| [ other options ¹ ]\ \
| /dev/... [sectors]
|
| ¹
| [ -a heap-based-allocation ] [ -t nodiscard/discard ]
| [ -c device ] [ -w specific sector_size for target sectors ]
| [ -o overprovision-ratio-percentage ] [ -z #-of-sections-per-zone ]
| [ -s #-of-segments-per-section ]
```
2. STEP 2: Mount device partition "somewhere"
```
| # mount /dev/... ${MNT}
| # mkdir ${MNT}/dir1
| # mkdir ${MNT}/dir2
```
3. STEP 3: Use f2fscrypt to encrypt (f2fs-tools v1.9+).
create key in session keyring to be used to set the
policy for encrypted directories.
```
| # f2fscrypt add_key -S 0x1234 <· -k $keyringX to use keyringX.
| ... (default to Session keyring)
| Added key with descriptor -S 0x1234: use simple salt
| 28e21cc0c4393da1
| └──────┬────────┘
| Kernel will create new key with a key-descriptor.
| Users apps will later on inform kernel about key
| to use by passing the matching descriptor
```
4. use f2fscrypt new_session to isolate temporal keyring
```
| # f2fscrypt set_policy \ * ← Set enc.policy (8bytes/16-hex sym.key) for dir.
| 28e21cc0c4393da1 \ * ← (use by kernel to search for key)
| ${MNT}/dir1 ${MNT}/dir2 .. * ← dir.list to apply policy (sym.key encryp.key)
|
| # edit ${MNT}/dir1/test.txt * ← Create encrypted file
|
| (.......... REBOOT MACHINE .................)
```
5. Post reboot checks:
```
| # ls -l ${MNT}/dir1
| -rw-r--r-- ... *zbx7tsUEMLzh+... <· Output after reboot.
|
| # f2fscrypt get_policy $MNT/dir1 <· Retrieve enc.policy
| /.../dir1/: *28e21cc0c4393da1* <· This provide a hint about
| key (Salt?) to use.
| # f2fscrypt add_key -S 0x1234 ← Recreate same key using same salt
| ...
| Added key with descriptor
| *[28e21cc0c4393da1]* ← Key descriptor must match
|
| # ls -l ${MNT}/dir1/
| -rw-r--r--. ... *21:41 test.txt*
|
| # keyctl show ← Show process keyring/s [Troubleshooting]
| *Session*Keyring
| 84022412 --alswrv 0 0 keyring: _ses
| 204615789 --alswrv 0 65534 \_ keyring: _uid.0
| 529474961 --alsw-v 0 0 \_ logon: f2fs:28e21cc0c4393da1
| └─┬─┘
| ex key-type used by f2fs file-system.
```
[[storage.flash}]]
[[storage.flash}]]
[[storage.block_layer}]]
# FILE SYSTEM LAYER
[[{storage.101.fdupes,troubleshooting.storage.duplicates]]
## Find duplicate files
* <https://github.com/tldr-pages/tldr/blob/master/pages/common/fdupes.md>
* <https://github.com/tldr-pages/tldr/blob/master/pages/common/jdupes.md>
enhanced fork of fdupes. More information: <https://github.com/jbruchon/jdupes.>
[[troubleshooting.storage.duplicates}]]
[[{storage.file_system,linux.101]]
## File System Management Basics
### MOVING AROUND FS
```
| BASICS ---------------------------------------------------
| $ pwd # (P)rint (W)orking (D)irectory
| $ cd /my/new/dir # (C)hange (d)irectory
| $ cd # move to $HOME directory
| $ cd ~ # '~' is alias for $HOME
| $ pushd. # Remember current dir.
| (put on "FIFO" stack)
| $ cd .. # Change to parent directory
| $ popd # change to lastest dir. in stack
| (saved with pushd)
|
| MANAGING DIRECTORIES -------------------------------------
| $ mkdir -p ~/projects/project01 # Make directory
| ↑
| Create any intermediate dirs if needed.
| $ rm -rf ~/projects/project01/ # Remove recursively (dangerous)
|
| COPYING FILES --------------------------------------------
| $ cp fileToCopy /destination/directory/ *
| $ cp -R directoryToCopy /destination/directory/ *
| ^^
| -R: Recursive copy of dir.content
|
|
| *RENAME FILES*
| $ mv myFile1 finalName # ← move myFile1 to new name
| ^ (rename) finalName
```
### MOVING FILES
```
| $ mv myFile1 /my/path02/ * # ← move myFile1 to
| ^ /my/path02 directory
| ┌──┘
| *WARN*: The final '/' indicates taht path02
| is a directory.
| Otherwise if "path02" does NOT exits,
| myFile1 will move to the '/my/' directory
| and renamed as wrongl 'path02' file.
|
| *COPYING FILES - Cool and safe way ----------------------
| $ tar cf - dirToCopy | \ * ← Compress to STDOUT
| tar -C newBaseDir -xf - * ← Decrompress from STDIN
|
| *REMOVE FILES/DIRECTORIES*
| $ rm -r -f dirOrFile * ← Alternative 1: Unsafe way:
| -r recursive deletion of directories
| -f force deletion (do not confirm)
| *Never ever "-r -f" as root*
|
| $ find ./dir -mmtime +10 \ * ← Alternative 2: Find files
| | xargs rm -f * and send to STDOUT.
| Exec. rm -f for each line
| in STDIN
|
| *REMOVE FILES AND CONTENT BY OVERWRITING FIRST*
|
| $ shred -n 2 -z -v /tmp/myImportantSecret
| ^ ^ ^
| · │ └·· show progress
| · └····· Finally write over with zeroes
| └········ Overwrite twice with random data
|
| <https://linux.die.net/man/1/shred>
| prevent data from being recovered by
| hackers using software (and most
| probably hardware)
```
### LISTING FILES
```
| $ ls" -"optionalFlags" * ← list files in current
| ^^^^^^^^^^^^^^^ directory
| -l: ("long"), show permissions, size,
| modification date, ownership
| -a: ("all" ), shows also hidden (.*) files
| -d: Show only directory entry(vs directory contents)
| -F: ("format") append helper symbol
| -S: sort by size in descending order
| -R: ("recursive") list recursively children dirs
```
### CHECK DISK FREE/USED SPACE
```
| <https://linux.die.net/man/0/df>
| $ df -k -h -x devtmpfs -x tmpfs * ← df: (D)isk (F)ree
| ↑ ↑ ↑ ↑
| │ │ Skip "false" filesystems(dev,...)
| │ └── show in human readable units
| └───── scale size by 1K before printing
| <https://linux.die.net/man/1/du>
| $ du -sch dir1 file2 # * *isk * *ssage
| ↑↑↑
| ││└── show in human readable units
| │└─── produce a grand total
| └──── display only total for each arg
```
### FILE/DIRECTORIES PERMISSIONS
- Change who can read,write or execute to a file or directory
```
| $ ls -ld mySecretDir # Before changing permissions.
| ┌─┬······· owner can read, write and execute (enter) dir.
| · ·┌─┬···· (g)roup can read and execute (enter) the directory
| · ·· ·┌─┬─ (o)thers can read and execute (enter) the directory
| -rwxr-xr-x. 1 userA groupA ... mySecretDir/
|
| $ chmod go-rwx mySecretDir
| $ chmod go-7 mySecretDir # Change permissions for (g)roup and (o)thers
| └┘^^
| · ·└·· (r)ead, (w)rite, e(x)ecute
| · · 4 + 2 + 1
| · └··· - => remove permission
| · + => add permission
| └····· (u)ser (owner of file/dir)
| (g)roup (group of file/dir)
| (o)thers
|
| $ ls -ld mySecretDir # After changing permissions.
| ┌─┬······· owner can read, write and execute (enter) dir.
| · ·┌─┬···· (g)roup can not read/write/exec(enter)
| · ·· ·┌─┬─ (o)thers can not read/write/exec(enter)
| -rwx------. 1 userA groupA ... mySecretDir/
```
- Change someFile owner and group
```
| $ chown newOwner:newGroup someFile
```
### SEARCHING FILES/DATA
```
| REF: <https://linux.die.net/man/1/find>
|
| $ find /var/lib
| -type f \ ← f: only files
| d: only directories
| l: only sym.links
| -iname "*html" \ ← AND whose name matches *html
| name:do NOT ignore case
| iname:do ignore case
| -mmin -30 \ ← AND whose modification time
| is '30 or less'(-30)
| minutes (mmin)
| -msize +20k \ ← AND whose size (msize) is
| 20k(ilobytes) or more
| -not \( \ ← Skip
| -path "./node_modules/*" \ ← node_modules dir.
| -o \ ← OR
| -path "./build/*" ← build dir.
| \)
| -exec *grep "Variable01" \ ← execute *command...*
| {} \; for each found file
```
### GRAPHICAL TOOLS:
- Easier to use, but not scriptable.
- Discouraged for system administration.
```
| $ mc # launch Midnight Commander. UI running on terminals
| $ rancher # Light and Nice Console File Manager with VI Key Bindings
| $ nautius . # GNOME File explorer.
| $ dolphin # KDE File explorer.
```
### HARD/SOFT LINKS
- In UNIX an **inode** is the low-level structure that stores the physical location
in disk of a given file. This "inode" is not visible to the user.
- The user visible entities are the file system paths, that point to the real inodes:
```
| /my/file/path ·>(points to)·> |inode|·>(points to)·> physical block_on_disk
| ^^^^^^^^^^^^^ ^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
| visible in invisible to managed by HD,
| user shells, users, managed internal circuit
| GUI explorers,.. by the OS kernel networks NAS,...
```
```
| $ ln -s /my/file/path /my/symbolic/link <· create symbolic (-s) link
| (shortcut to filepath).
| If the original /my/file/path is
| deleted or moved the link is broken.
| $ ln /my/file/path /my/hard/link <· Hard link (no -s)
| /my/hard/link will point to the
| same *inode* of /my/file/path
|
| /my/symbolic/Link <· Sym link creates a new entry with a link to file-path
| ↓
| /my/file/path ──────┐
| ↓ physical
| *inode*──→ block
| ↑ ↑ on disk
| /my/hard/link ──────┘ |
| |
| - the inode will increase its number of references after a hard-link.
| - the inode and disk content will exists until all hardlinks(references)
| are deleted.
| - Once /my/hard/link is created it is a sibling of original
| /my/file/path. There is no way to know what was the original
| path and the hard-link path.
```
[[}]]
[[{storage.file_system,performance.storage,troubleshooting.storage]]
## Tunning Filesystem
### (/etc/fstab) mount options:
```
| noatime: Do not update inode access times on this filesystem.
| It implies nodiratime.
| Recommended for file-systems containing (PostgreSQL,...) databases since their
| engines do not care about this data.
|
| nodiratime: Do not update directory inode access times on this filesystem.
|
| lazytime : Only update times (atime, mtime, ctime) on the in-memory version of the file inode.
| Significantly reduces writes to the inode table for workloads with frequent random
| writes to preallocated files.
| The on-disk timestamps are updated only when:
| - the inode needs to be updated for SOME change unrelated to file timestamps
| - the application employs fsync(2), syncfs(2), or sync(2)
| - an undeleted inode is evicted from memory
| - more than 24 hours have passed since the i-node was written to disk.
```
[[}]]
[[{storage.file_system,monitoring.storage.ussage,profiling.file_system]]