-
Notifications
You must be signed in to change notification settings - Fork 1
/
Linux_Tutorial_12.txt
2872 lines (2115 loc) · 126 KB
/
Linux_Tutorial_12.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
A Linux Tutorial for HPC
========================
by Harry Mangalam <[email protected]>
v1.30 - Oct 11, 2015
:icons:
//killing jobs on same node
//killing jobs in different bash sessions
//chmod +x your script
//the shebang line - why its important.
// PATH - to execute a script in the cwd, have to reference it
// as './thescript', not 'thescript'
// fileroot="/home/hjm/nacs/Linux_Tutorial_12"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt moo:~/public_html/biolinux; ssh -t moo 'scp ~/public_html/biolinux/Linux_Tutorial_12.[ht]* [email protected]:/data/hpc/www/biolinux/'
== Introduction
This is an introduction to Linux specifically written for the
http://hpc.oit.uci.edu[HPC Compute Cluster at UCI]. Replace the UCI-specific
names and addresses and it should work reasonably well at your institution.
Feel free to steal the
http://moo.nac.uci.edu/~hjm/biolinux/Linux_Tutorial_12.txt[ASCIDOC src]
and modify it for your own use.
This is presented as a continuous document rather than slides since you're going
to be going thru it serially and it's often easier to find things and move around
with an HTML doc. About half of this tutorial is about Linux basics and bash
commands. The remainder covers a little Perl and more R.
.Mouseable commands
[NOTE]
====================================================================
The commands presented in the lightly shaded boxes are meant to be 'moused'
into the bash shell to be executed verbatim. If they don't work, it's most
likely that you've skipped a bit and/or haven't cd'ed to the right place.
http://moo.nac.uci.edu/~hjm/FixITYourselfWithGoogle.html[Try to figure out
why the error occurred], but don't spend more than a couple minutes on it.
Wave one of us down to help.
====================================================================
== Logging In
=== ssh
We have to connect to HPC with ssh or some version of it so let's try the basic
ssh. If you're using a Mac laptop, open the 'Terminal App' (or the better, free
http://iterm.sourceforge.net/[iTerm] (and type:
------------------------------------------------------------------------------
ssh -Y [email protected]
# enter YourUCINETID password in response to the prompt.
------------------------------------------------------------------------------
If you're using Windows and the excellent, and free
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html[putty],
(you'll need only 'putty.exe'),
type 'hpc.oit.uci.edu' into the *Host Name
(or IP Address)* pane, and your (lowercase) UCINETID when the login prompt is
presented (NOT the whole 'username@host' string). Once you connect, you can
save the configuration and click the saved configuration the next time.
==== Passwordless ssh
See http://hpc.oit.uci.edu/HPC_USER_HOWTO.html#HowtoPasswordlessSsh[the HPC HOWTO].
=== x2go
==== Macintosh installation
The Mac install requires a specific set of versions and steps.
* Recent OSX releases do not include X11 compatibility software, now called
http://xquartz.macosforge.org/landing/[XQuartz] (still free). If you have not
done so already, please download and install the latest version (2.7.7 at this writing).
The 'x2go' software will not work without it. After you install it and start it,
please configure it to these specifications:
- The 'XQuartz' must be configured to accept remote sessions in its 'Preferences'.
- After installation of 'XQuartz', the user has to log out of the current *Mac* session
and log back in again.
- 'XQuartz' must be running before starting 'x2go'.
- You may have to change your Mac's Security Preferences to allow remote sessions.
- If you're running an additional local or personal firewall, you may have to
specifically allow 'x2go' to work.
* Please install the http://code.x2go.org/releases/binary-macosx/x2goclient/releases/[Mac
OSX client from here]. The latest version (4.0.3.1, as of this writing)
works on the Mavericks MacOSX release.
*If your x2go DOESN'T work, please check the following:*
* Open 'Terminal.app', run *ssh -X -Y [email protected]*, and , once you
logged in to HPC, type 'xload' and see if it opens a window showing the
login node name.
* If it says "Error: Can't open display: ", please download the latest version
of XQuartz and reinstall it (even if you already have latest version installed). You will need to logout then log back in after the installation. Please make sure to uncheck "Reopen windows when logging back in" option when logging out. You can download XQuartz from here: http://xquartz.macosforge.org/
* Make sure you have the
http://code.x2go.org/releases/X2GoClient_latest_macosx.dmg[latest version of
x2go]
* Reset all the x2go settings 'on your Mac' by removing the '.x2go' and
'.x2goclient' dirs in your home directory with the command:+
*rm -rf \~/.x2go \~/.x2goclient*
* If you have any firewall on, please make sure x2go is in the whitelist.
(OS X's built-in firewall is off by default).
See below for 'x2go' configuration to connect to HPC.
('x2go' version compatibility changes fairly frequently, so if the above versions don't
work, please mailto:[email protected]?Subject=x2go%20configuration[send me email].)
==== Windows installation
The Windows installation is straightforward and
http://wiki.x2go.org/doku.php/doc:installation:x2goclient[follows the instructions listed here].
==== x2go configuration for HPC
Configure your x2go client to connect to HPC using this screenshot as a guide,
replacing 'hmangala' with your UCINETID.
image:x2goclient.png[x2go client]
*ONLY* if you have added your public ssh key to your HPC account
http://hpc.oit.uci.edu/HPC_USER_HOWTO.html#HowtoPasswordlessSsh[as described here],
you can CHECK the option:
------------------------------------------------------------------------------
[x] Try auto login (ssh-agent or default ssh key)
------------------------------------------------------------------------------
If you haven't set up passwordless ssh, *UNCHECK* it and use your UCINETID
password in the password challenge box that comes up when you click 'OK'.
Change the 'Session type' to that shown: 'Single Application' with the terminal
application *gnome-terminal* ('/usr/bin/gnome-terminal'). When you configure it
like this, only a terminal will pop up and then you can use it as a terminal as
well as to launch graphical applications from it.
NB: If your arrow, [Home], [End], [Del] keys don't work properly, try pasting
the following command into the terminal that just opened:
------------------------------------------------------------------------------
setxkbmap -model evdev -layout us
#(it has to be entered at each terminal startup, not as part of your ~/.bashrc)
------------------------------------------------------------------------------
You can start other specific applications (such as SAS or SPSS) by entering the
startup commands in the newly opened terminal. ie: 'module load rstudio; rstudio'
NB: We no longer allow full Desktops (such as http://en.wikipedia.org/wiki/KDE[KDE],
http://en.wikipedia.org/wiki/Gnome_desktop[Gnome], http://en.wikipedia.org/wiki/Unity_desktop[Unity])
on the HPC login nodes since they quickly take up too many resources.
They are great Desktops tho, and I'd recommend any of them if you are thinking
of using Linux on your own PC.
== Make your prompt useful
The bash shell prompt is, as almost everything in Linux, endlessly customizable.
At the very least, it should tell you what time it is, what host you've logged
into, and which dir you're in.
Just paste this into a shell window. (This should work via highlighting the
text and using the usual 'Copy/Paste' commands, but depending on which platform
you're using, it may take some effort to get the right key combinations.
------------------------------------------------------------------------------
PS1="\n\t \u@\h:\w\n\! \$ "
# if you want to get really fancy, you can try this for a multicolored
# one that also shows the load on the system:
PS1="\n\[\033[01;34m\]\d \t \[\033[00;33m\][\$(cat /proc/loadavg | cut -f1,2,3 -d' ')] \
\[\033[01;32m\]\u@\[\033[01;31m\]\h:\[\033[01;35m\]\w\n\! \$ \[\033[00m\]"
# that one will show up OK on most background, but on some light colored ones
might wash out.
------------------------------------------------------------------------------
There is a reason for this. When you report a bug or problem to us, it's helpful
to know when you submitted the command, how busy the system was, and where you
were when you submitted it. Including the prompt lets us know this info (most
of the time).
OK - let's do something.
== Simple Commands
=== Commandline Editing
*Remember:*
- '↑' and '↓' arrows scroll thru your bash history
- '←' and '→' cursor thru the current command
- the 'Home, End, Insert', and 'Delete' keys should work as expected.
- 'PgUp' and 'PgDn' often /don't/ work in the shell.
- as you get comfortable with the commandline, some ppl like to keep their fingers on the keypad, so
- '^' means 'Ctrl'
- '^a' = Home (start of line)
- '^e' = End (end of line)
- '^u' = deletes from cursor to the start
- '^k' = deletes from cursor to end of line
- '^w' = deletes left from cursor one word '←'
- 'Alt+d' = deletes right from cursor one word '→'
- the '\^' key 'amplifies' some editing functions in the bash shell, so that '\^ ←' and '^ →' will move the cursor by a 'word' instead of by a 'char'.
- as noted above, sometimes your terminal keymaps are set wrong. Entering
-----------------------------------------------------------------
setxkbmap -model evdev -layout us
-----------------------------------------------------------------
into the terminal will often fix the mappings.
Also, it's not an editing command, but you might accidentally type it when you're
editing:
- '^s' = means stop output from scolling to the terminal (locks the output)
- '^q' = means restart output scolling to the terminal (unlocks the output)
=== Copy and Paste
The 'copy' and 'paste' functions are different on different platforms (and even
in different applications) but they have some commonalities. When you're working
in a terminal, generally the native platform's copy & paste functions work as expected.
That is, in an 'editing context', after you've hilited a text selection'
*Cntl+C* copies and *Cntl+V* pastes. However, in the 'shell context', *Cntl+C*
can kill the program that's running, so be careful.
*Linux promo*: In the XWindow system, merely hiliting a selection automatically
copies it into the X selection buffer and a middle click pastes it. All platforms
have available http://en.wikipedia.org/wiki/Clipboard_manager[clipboard managers]
to keep track of multiple buffer copies; if you don't have one, you might want
to install one.
=== Where am I? and what's here?
--------------------------------------------------------------------------------
pwd # where am I?
# REMEMBER to set a useful prompt:
# (it makes your prompt useful and tells you where you are)
echo "PS1='\n\t \u@\h:\w\n\! \$ '" >> ~/.bashrc
. ~/.bashrc
ls # what files are here? (tab completion)
ls -l
ls -lt
ls -lthS
alias nu="ls -lt |head -20" # this is a good alias to have
cd # cd to $HOME
cd - # cd to dir you were in last (flip/flop cd)
cd .. # cd up 1 level
cd ../.. # cd up 2 levels
cd dir # also with tab completion
tree # view the dir structure pseudo graphically - try it
tree | less # from your $HOME dir
tree /usr/local |less
mc # Midnight Commander - pseudo graphical file browser, w/ mouse control
du # disk usage
du -shc *
df -h # disk usage (how soon will I run out of disk)
--------------------------------------------------------------------------------
[[DirB]]
==== DirB and bookmarks.
'DirB' is a way to bookmark directories around the filesystem so you can 'cd' to
them without all the typing.
It's http://moo.nac.uci.edu/~hjm/DirB.pdf[described here] in more detail and
requires minimal setup:
--------------------------------------------------------------------------------
# paste this line into your HPC shell
# (appends the quoted line to your ~/.bashrc)
echo '. /data/hpc/share/bashDirB' >> ~/.bashrc
# make sure that it got set correctly:
tail ~/.bashrc
# and re-source your ~/.bashrc
. ~/.bashrc
--------------------------------------------------------------------------------
After that's done you can do this:
--------------------------------------------------------------------------------
hmangala@hpc:~ # makes this horrible dir tree
512 $ mkdir -p obnoxiously/long/path/deep/in/the/guts/of/the/file/system
hmangala@hpc:~
513 $ cd !$ # cd's to the last string in the previous command
cd obnoxiously/long/path/deep/in/the/guts/of/the/file/system
hmangala@hpc:~/obnoxiously/long/path/deep/in/the/guts/of/the/file/system
514 $ s jj # sets the bookmark to this dir as 'jj'
hmangala@hpc:~/obnoxiously/long/path/deep/in/the/guts/of/the/file/system
515 $ cd # takes me home
hmangala@hpc:~
516 $ g jj # go to the bookmark
hmangala@hpc:~/obnoxiously/long/path/deep/in/the/guts/of/the/file/system
517 $ # ta daaaaa!
--------------------------------------------------------------------------------
.Don't forget about setting aliases.
[NOTE]
===========================================================================
Once you find yourself typing a longish command for the 20th time, you might
want a shorter version of it. Remember 'aliases'?
alias nu="ls -lt | head -22" # 'nu' list the 22 newest files in this dir
===========================================================================
=== Making and deleting & moving around directories
[source,bash]
-----------------------------------------------------------------
mkdir newdir
cd newdir
touch instafile
ls -l
# how big is that instafile?
cd # go back to your $HOME dir
# get & unpack the nco archive
curl http://hpc.oit.uci.edu/biolinux/nco/nco-4.2.5.tar.gz | tar -xzvf -
ls nco-4.2.5 # you can list files by pointing at their parent
cd nco-4.2.5
ls # see? no difference
file * # what are all these files?
du -sh * # how big are all these files and directories?
ls -lh * # what different information do 'ls -lh' and 'du -sh' give you?
less I<tab> # read the INSTALL file ('q' to quit, spacebar scrolls down, 'b' scrolls up, '/' searches)
-----------------------------------------------------------------
=== Permissions: chmod & chown
Linux has a Unix heritage so everything has an owner and a set
of permissions. When you ask for an 'ls -l' listing, the 1st
column of data lists the following:
--------------------------------------------------------------------------------
$ ls -l |head
total 14112
-rw-r--r-- 1 hjm hjm 59381 Jun 9 2010 a64-001-5-167.06-08.all.subset
-rw-r--r-- 1 hjm hjm 73054 Jun 9 2010 a64-001-5-167.06-08.np.out
-rw-r--r-- 1 hjm hjm 647 Apr 3 2009 add_bduc_user.sh
-rw-r--r-- 1 hjm hjm 1342 Oct 18 2011 add_new_claw_node
drwxr-xr-x 2 hjm hjm 4096 Jun 11 2010 afterfix/
|-+--+--+-
| | | |
| | | +-- other permissions
| | +----- group permissions
| +-------- user permissions
+---------- directory bit
drwxr-xr-x 2 hjm hjm 4096 Jun 11 2010 afterfix/
| | | +-- other can r,x
| | +----- group can r,x
| +-------- user can r,w,x the dir
+---------- it's a directory
# now change the 'mode' of that dir using 'chmod':
chmod -R o-rwx afterfix
||-+-
|| |
|| +-- change all attributes
|+---- (minus) remove the attribute characteristic
| can also add (+) attributes, or set them (=)
+----- other (everyone other than user and explicit group)
$ ls -ld afterfix
drwxr-x--- 2 hjm hjm 4096 Jun 11 2010 afterfix/
# Play around with the chmod command on a test dir until you understand how it works
--------------------------------------------------------------------------------
You also have to chmod a script to allow it to execute at all.
AND if the script is NOT on your PATH ('printenv PATH'), then you
have to reference directly:
--------------------------------------------------------------------------------
# get myscript.sh
$ wget http://hpc.oit.uci.edu/biolinux/bigdata/myscript.sh
or, from on HPC
$ wget http://nas-7-1/biolinux/bigdata/myscript.sh
# take a look at it
$ less myscript.sh
# what are the permissions?
$ ls -l myscript.sh
-rw-r--r-- 1 hmangala staff 96 Nov 17 15:32 myscript.sh
$ chmod u+x myscript.sh
$ ls -l myscript.sh
-rwxr--r-- 1 hmangala staff 96 Nov 17 15:32 myscript.sh*
$ myscript.sh # why doesn't this work?
$ printenv PATH
/data/users/hmangala/bin:/usr/local/sbin:/usr/local/bin:
/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:/opt/gridengine/bin:
/opt/gridengine/bin/lx-amd64:/usr/lib64/qt-3.3/bin:
/usr/local/bin:/bin:/usr/bin:/usr/local/sbin
$ pwd
/data/users/hmangala
$ ./myscript.sh # note the leading '.'; should work now.
====================
Hi there, [hmangala]
====================
--------------------------------------------------------------------------------
'chown' (change ownership) is more direct; you specifically set the ownership to what
you want, altho on HPC, you'll have limited ability to do this since 'you can only change
your group to to another group of which you're a member'. You can't change ownership of
a file to someone else, unless you're root.
--------------------------------------------------------------------------------
$ ls -l gromacs_4.5.5.tar.gz
-rw-r--r-- 1 hmangala staff 58449920 Mar 19 15:09 gromacs_4.5.5.tar.gz
^^^^^
$ chown hmangala.stata gromacs_4.5.5.tar.gz
$ ls -l gromacs_4.5.5.tar.gz
-rw-r--r-- 1 hmangala stata 58449920 Mar 19 15:09 gromacs_4.5.5.tar.gz
^^^^^
--------------------------------------------------------------------------------
=== Moving, Editing, Deleting files
These are utilities that create and destroy files and dirs. *Deletion on Linux
is not warm and fuzzy*. It is quick, destructive, and irreversible. It can
also be recursive.
.Warning: Don't joke with a Spartan
[WARNING]
==================================================================================
Remember the movie '300' about Spartan warriors? Think of Linux utilities like
Spartans. Don't joke around. They don't have a great sense of humor and they're
trained to obey without question. A Linux system will commit suicide if you ask it to.
==================================================================================
--------------------------------------------------------------------------------
rm my/thesis # instantly deletes my/thesis
alias rm="rm -i" # Please God, don't let me delete my thesis.
# alias logout="echo 'fooled ya'" can alias the name of an existing utility for anything.
# unalias is the anti-alias.
mkdir dirname # for creating dirname
rmdir dirname # for destroying dirname if empty
cp from/here to/there # COPIES from/here to/there
mv from/here to/there # MOVES from/here to/there (from/here is deleted!)
file this/file # what kind of file is this/file?
nano/joe/vi/vim/emacs # terminal text editors
gedit/nedit/jedit/xemacs # GUI editors
--------------------------------------------------------------------------------
=== The File Cache
When you open a file to read it, the Linux kernel not only directs the data to
the analytical application, it also copies it to otherwise unused RAM, called
the filecache. This assures that the second time that file is read, the data
is already in RAM and almost instantly available. The practical result of this
caching is that the SECOND operation (within a short time) that requests that
file will start MUCH faster than the first. A benefit of this is that
when you're debugging an analysis by repeating various commands, doing it multiple
times will be very fast.
[[ioredirection]]
== STDOUT, STDIN, STDERR, and Pipes
These are the input/output channels that Linux provides for communicating among
your input, and program input and output
- *STDIN*, usually attached to the keyboard. You type, it goes thru STDIN and
shows up on STDOUT
- *STDOUT*, usually attached to the terminal screen. Shows both your STDIN stream
and the program's STDOUT stream as well as ...
- *STDERR*, also usually connected to the terminal screen, which as you might
guess, sometimes causes problems when both STDOUT and STDERR are both writing to the screen.
BUT these input & output channels can be changed to make data dance in useful ways.
There are several IO redirection commands:
- *<* reads STDIN from file
- *>* writes STDOUT to a file
- *>>* appends STDOUT to a file
- *|* pipes the STDOUT of one program to the STDIN of another program
- *tee* splits the STDOUT and sends one of the outputs to a file. The other
output continues as STDOUT.
- *2>* redirects STDERR to file
- *2>>* appends STDERR to file
- *&>* redirects BOTH STDERR and STDOUT to a file
- *2>&1* merges STDERR with STDOUT
- *2>&1 |* merges STDERR with STDOUT and send to a pipe
- *|&* same as '2>&1 |' above
For example:
'ls' prints its output on STDOUT. 'less' can read either a file or STDIN. So..
--------------------------------------------------------------------
# '|' is an anonymous pipe; connects the STDOUT of 'ls' to the STDIN of 'less'
ls -lt *.txt | less
# if we wanted to capture that output to a file as well..
ls -lt *.txt | tee alltxtfiles |less
--------------------------------------------------------------------
While the above deals only with STDOUT and STDIN, you can also deal with STDERR
http://www.tldp.org/LDP/abs/html/io-redirection.html[in many confusing ways].
=== How to use pipes with programs
Here's a simple example:
--------------------------------------------------------------------
# What is the average size of the files in this directory?
# remember
# ls -lR will recursively list the long file listing, which contains the size in bytes
# so
ls -lR |scut -F=4 | stats # will tell you.
--------------------------------------------------------------------
Here's another. Break it up into individual commands and pipe each one into
'less' to see what it produces, then insert the next command to see what it does
--------------------------------------------------------------------
w |cut -f1 -d ' ' | sort | egrep -v "(^$|USER)" | uniq -c | wc
w | less
w |cut -f1 -d ' ' | less
w |cut -f1 -d ' ' | sort | less
w |cut -f1 -d ' ' | sort | egrep -v "(^$|USER)" | less
w |cut -f1 -d ' ' | sort | egrep -v "(^$|USER)" | uniq -c | less
--------------------------------------------------------------------
Pipes allow you to mix and match output and input in various useful ways.
Remember STDOUT/STDIN when you're designing your own programs so you can
format the output and read the input in useful ways down the road.
=== tee, subshells, and pipes.
As alluded to above, 'tee' taps the STDOUT and sends one copy to a file (or set of files)
and allows the other copy to continue to STDOUT. This allows you to duplicate the
STDOUT to do all kinds of useful things to keep your data 'in flight'.
'tee' is especially useful in conjunction with subshells - starting a new shell to process
one branch of the 'tee' while allowing the STDOUT to continue to other analyses.
The use of subshells is one way to allow arbitrary duplication of output as shown
below:
------------------------------------------------------------------------------
tar -czf - nco | pv -trb | tee >(tee >(shasum > sha.file) | wc -c > wc.file) > /dev/null
# what does this do? What is /dev/null? How would you figure it out?
------------------------------------------------------------------------------
so the format is *| tee >(some chain of operations)*, repeated as needed, including another
'tee'. On the right side of the last ')' is the STDOUT and you can process it
in any way you'd normally process it.
== Text files
Most of the files you will be dealing with are text files.
Remember the output of the 'file' command:
--------------------------------------------------------------------
Sat Mar 09 11:09:15 [1.13 1.43 1.53] hmangala@hpc:~/nco-4.2.5
566 $ file *
acinclude.m4: ASCII M4 macro language pre-processor text
aclocal.m4: Ruby module source text
autobld: directory
autogen.sh: POSIX shell script text executable
bin: directory
bld: directory
bm: directory
config.h.in: ASCII C program text
configure: POSIX shell script text executable
configure.eg: ASCII English text, with very long lines
configure.in: ASCII English text, with very long lines
COPYING: ASCII English text
data: directory
doc: directory
files2go.txt: ASCII English text, with CRLF line terminators <<<<
INSTALL: ASCII English text
m4: directory
Makefile.am: ASCII English text
Makefile.in: ASCII English text
man: directory
obj: directory
qt: directory
src: directory
--------------------------------------------------------------------
Anything in that listing above that has 'ASCII' in it is text,
also 'POSIX shell script text executable' is also a text file.
Actually everything in it that isn't 'directory' is a text file of
some kind, so you can read them with 'less' and they will all look like text.
.DOS EOLs
[NOTE]
===========================================================================
If the file description includes the term 'with CRLF line terminators' (see <<<< above), it has DOS http://en.wikipedia.org/wiki/Newline[newline] characters. You should convert these to Linux newlines with http://linuxcommand.org/man_pages/dos2unix1.html[dos2unix] before using them in analysis. Otherwise the analysis program will often be unable to recognize the end of a line. Sometimes even filenames can be tagged with DOS newlines, leading to very bizarre error messages.
===========================================================================
Text files are the default way of dealing with information on Linux. There are binary files (like '.bam' files or anything compressed (which a bam file is), or often, database files, and specialty data files such as netCDF or HDF5.
You can create a text file easily by capturing the STDOUT of a command.
In the example above, you could have captured the STDOUT at any stage by redirecting it to a file
--------------------------------------------------------------------
# We use '/usr/local/bin' as a target for ls because if you do it in
# your own dir, the 1st command will change the file number of and size
# and result in a sligtly different result for the second.
# '/usr/local/bin' is a stable dir that will not change due to this command.
ls -lR /usr/local/bin |scut -F=4 | stats
# could have been structured (less efficiently) like this:
ls -lR /usr/local/bin > ls.out
scut -F=4 < ls.out > only.numbers
cat only.numbers | stats
# note that '<' takes the STDOUT of the file to the right and directs it to
# the STDIN of the program to the left.
# '>' redirects the STDOUT of the app to the left to the file on the right
# while '|' pipes the STDOUT of the program on the left to the program on the right.
# what's the diff between this line?
cat only.numbers > stats
# and this line:
cat only.numbers | stats
# Hmmmmm?
--------------------------------------------------------------------
.Files vs Pipes
[NOTE]
===========================================================================
When you create a 'file', a great many operations have to be done to support creating that file. When you use a 'pipe', you use fewer operations as well as not taking up any intermediate disk space. All 'pipe' operations take place in memory, so are 1000s of times faster than writing a file. A 'pipe' does not leave any trace of an intermediate step tho, so if you need that intermediate data, you'll have to write to a file or 'tap the pipe' with a http://linuxcommand.org/man_pages/tee1.html[tee].
===========================================================================
=== Viewing Files
==== Pagers, head & tail
'less' & 'more' are pagers, used to view text files. In my opinion, 'less' is better than 'more', but both will do the trick.
--------------------------------------------------------------------------------
less somefile # try it
alias less='less -NS' # is a good setup (number lines, scroll for wide lines)
head -### file2view # head views the top ### lines of a file
tail -### file2view # tail views the bottom ### lines of a file
tail -f file2view # keeps dumping the end of the file if it's being written to.
---------------------------------------------------------------------------------
==== Concatenating files
Sometimes you need to concatenate / aggregate files; for this, 'cat' is the cat's meow.
--------------------------------------------------------------------------------
cat file2view # dumps it to STDOUT
cat file1 file2 file3 > file123 # or concatenates multiple files to STDOUT, captured by '>' into file123
--------------------------------------------------------------------------------
=== Slicing Data
'cut' and http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[scut] allow you to
slice out columns of data by acting on the 'tokens' by which they're separated.
A 'token' is just the delimiter between the columns, typically a space or <tab>,
but it could be anything, even a regex. 'cut' only allows single characters as
tokens, 'scut' allows any regex as a token.
--------------------------------------------------------------------------------
# lets play with a gene expression dataset:
wget http://moo.nac.uci.edu/~hjm/red+blue_all.txt.gz
# how big is it?
ls -l red+blue_all.txt.gz
# Now lets decompress it
gunzip red+blue_all.txt.gz
# how big is the decompressed file (and what is it called?
# how compressed was the file originally?
# take a look at the file with 'head'
head red+blue_all.txt
# hmm - can you tell why we got such a high compression ratio with this file?
# OK, suppose we just wanted the fields 'ID' and 'Blue' and 'Red'
# how do we do that?
# cut allows us to break on single characters (defaults to the TAB char)
# or exact field widths.
# Let's try doing that with 'cut'
cut -f '1,4,5' < red+blue_all.txt | less # cuts out the fth field (counts from 1)
# can also do this with 'scut', which also allows you to re-order the columns
# and break on regex tokens if necessary..
scut -f='4 5 1' < red+blue_all.txt | less # cuts out whatever fields you want;
--------------------------------------------------------------------------------
If you have ragged columns and need to view them in aligned columns,
use http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html#_the_cols_utility[cols]
to view data.
Or http://linux.die.net/man/1/column[column].
'cols' can use any http://www.pcre.org/[Perl-compatible Regular Expression] to
break the data. 'column' can use only single characters.
--------------------------------------------------------------------------------
# let's get a small data file that has ragged columns:
wget http://moo.nac.uci.edu/~hjm/MS21_Native.txt
less MS21_Native.txt # the native file in 'less'
# vs 'column'
column < MS21_Native.txt | less # sliced by column
# and by cols (aligns the top 44 lines of a file to view in columns)
# shows '-' for missing values.
cols --ml=44 < MS21_Native.txt | less #
--------------------------------------------------------------------------------
=== Rectangular selections
Many editors allow columnar selections and for small selections this may be the best approach
Linux editors that support rectangular selection
[options="header"]
|========================================================================================
|Editor |Rectangular Select Activation
|nedit |Ctrl+Lmouse = column select
|jedit |Ctrl+Lmouse = column select
|kate |Shift+Ctrl+B = block mode, have to repeat to leave block mode.
|emacs |dunno - emacs is more a lifestyle than an editor but it can be done.
|vim |Ctrl+v puts you into visual selection mode.
|========================================================================================
=== Finding file differences and verifying identity
Quite often you're interested the differences between 2 related files or verifying
that the file you sent is the same one as arrived. 'diff' and especially the
GUI wrappers (diffuse, kompare) can tell you instantly.
--------------------------------------------------------------------------------
diff file1 file1a # shows differences between file1 and file2
diff hlef.seq hlefa.seq # on hpc
# can also do entire directories
diff -r path/to/this/dir path/to/that/dir > diff.out &
# 'comm' takes SORTED files and can produce output that says which line
# is in file 1, which is file 2 & which is in both.
ie:
comm file1.sorted file2.sorted
# md5sum generates md5-based checksums for file corruption checking.
md5sum files # lists MD5 hashes for the files
# md5sum is generally used to verify that files are identical after a transfer.
# md5 on MacOSX, <http://goo.gl/yCIzR> for Windows.
md5deep -r # can recursively calculate all the md5 checksums in a directory
--------------------------------------------------------------------------------
=== The grep family
Sounds like something blobby and unpleasant and sort of is, but it's VERY powerful.
http://en.wikipedia.org/wiki/Regex[Regular Expressions] are formalized patterns.
As such they are not exactly easy to read at first, but it gets easier with time.
The simplest form is called http://en.wikipedia.org/wiki/Glob_(programming)[globbing] and is used within bash to select files that match a particular pattern
--------------------------------------------------------------------------------
ls -l *.pl # all files that end in '.pl'
ls -l b*. # all files that start with 'b' & end in '.pl'
ls -l b*p*.*l # all files that start with 'b' & have a 'p' & end in 'l'
--------------------------------------------------------------------------------
Looking at nucleic acids, can we encode this into a regex?:
gyrttnnnnnnngctww = g[ct][ag]tt[acgt]{7}gct[at][at]
--------------------------------------------------------------------------------
grep regex files # look for a regular expression in these files.
grep -rin regex * # recursively look for this case-INsensitive regex in all files and
# dirs from here down to the end and number the lines.
grep -v regex files # invert search (everything EXCEPT this regex)
egrep "thisregex|thatregex" files # search for 'thisregex' OR 'thatregex' in these files
egrep "AGGCATCG|GGTTTGTA" hlef.seq
# gnome-terminal allows searching in output, but not as well as 'konsole'
--------------------------------------------------------------------------------
http://www.regular-expressions.info/quickstart.html[This is a pretty good quickstart resource for learning more about regexes].
== Info About (& Controlling) your jobs
Once you have multiple jobs running, you'll need to know which are doing what. Here are some tools that allow you to see how much CPU and RAM they're consuming.
--------------------------------------------------------------------------------
jobs # lists all your current jobs on this machine
qstat -u [you] # lists all your jobs in the SGE Q
[ah]top # lists the top CPU-consuming jobs on the node
ps # lists all the jobs which match the options
ps aux # all jobs
ps aux | grep hmangala # all jobs owned by hmangala
ps axjf # all jobs nested into a process tree
pstree # as above
alias psg="ps aux | grep" # allows you to search processes by user, program, etc
kill -9 JobPID# # kill off your job by PID
--------------------------------------------------------------------------------
=== Background and Foreground
Your jobs can run in the 'foreground' attached to your terminal, or detached in the 'background', or simply 'stopped'.
Deep breath.....
- a job runs in the 'foreground' unless sent to the 'background' with '&' when started.
- a 'foreground' job can be 'stopped' with 'Ctrl+z' (think zap or zombie)
- a 'stopped' job can be started again with 'fg'
- a 'stopped' job can be sent to the 'background' with 'bg'
- a 'background' job can be brought to the foregound with 'fg'
If you were going to run a job that takes a long time to run, you could run it in the background with this command.
--------------------------------------------------------------------------------
tar -czf gluster-sw.tar.gz gluster-sw & # This would run the job in the background immediately
...
[1]+ Done tar -czvf gluster-sw.tar.gz gluster-sw
tar -czvf gluster-sw.tar.gz gluster-sw & # Why would this command be sub-optimal?
^ .. hint
--------------------------------------------------------------------------------
HOWEVER, for most long-running jobs, you will be submitting the jobs to the scheduler to run in 'batch mode'. See link:#qsub[here for how to set up a qsub run].
=== Your terminal sessions
You will be spending a lot of time in a terminal session and sometimes the terminal just screws up. If so, you can try typing 'clear' or 'reset' which should reset it.
You will often find yourself wanting multiple terminals to hpc. You can usually open multiple tabs on your terminal but you can also use the 'byobu' app to multiplex your terminal 'inside of one terminal window'. https://help.ubuntu.com/community/Byobu[Good help page on byobu here.]
The added advantage of using 'byobu' is that the terminal sessions that you open will stay active after you 'detach' from them (usually by hitting 'F6'). This allows you to maintain sessions across logins, such as when you have to sleep your laptop to go home. When you start 'byobu' again at HPC, your sessions will be exactly as you left them.
.A 'byobu' shell in not quite the same as using a direct terminal connection
[NOTE]
==========================================================================================
Because 'byobu' invokes some deep magic to set up the multiple screens, X11 graphics invoked from a
'byobu'-mediated window will 'sometimes' not work, depending on how many levels of shell you're in. Similarly, 'byobu' traps mouse actions so things that might work in a direct connection (mouse control of 'mc') will not work in a 'byobu' shell. Also some line characters will not format properly. Always tradeoffs...
==========================================================================================
== Finding files with 'find' and 'locate'
Even the most organized among you will occasionally lose track of where your files are.
You can generally find them on HPC by using the 'find' command. 'find' is a very fast and
flexible tool, but in complex use, it has some odd but common problems with the bash shell.
--------------------------------------------------------------------------------
# choose the nearest dir you remember the file might be and then direct find to use that starting point
find [startingpoint] -name filename_pattern
# ie: (you can use globs but they have to be 'escaped' with a '\'
find gluster-sw/src -name config\*
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.h
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.h.in
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.log
gluster-sw/src/glusterfs-3.3.0/argp-standalone/config.status
gluster-sw/src/glusterfs-3.3.0/argp-standalone/configure
gluster-sw/src/glusterfs-3.3.0/argp-standalone/configure.ac
gluster-sw/src/glusterfs-3.3.0/xlators/features/marker/utils/syncdaemon/configinterface.py
--------------------------------------------------------------------------------
You can also use find to do complex searches based on their names, age, etc.
Below is the command that finds
- zero sized files (any name, any age)
- files that have the suffix '.fast[aq]', .'f[aq]', '.txt', '.sam', 'pileup', '.vcf'
- but only if those named files are older than 90 days (using a bash variable to
pass in the '90')
The '-o' acts as the 'OR' logic and the '-a' acts as the 'AND' logic.
Note how the parens and brackets have to be *escaped* and the command is split over
multiple lines with the same backslash, but note that at the end of a line, it
acts as a 'continuation' character, not an 'escape'. Yes, this is confusing.
[source,bash]
------------------------------------------------------------
DAYSOLD=90
find . -size 0c \
-o \( \
\( -name \*.fast\[aq\] \
-o -name \*.f\[aq\] \
-o -name \*.txt \
-o -name \*.sam \
-o -name \*pileup \
-o -name \*.vcf \) \
-a -mtime +${DAYSOLD} \)
------------------------------------------------------------
'locate' is another very useful tool, but it requires a full indexing of the
filesystem (usually done automatically every night) and will only return
information based on the permission of the files it
has indexed. So you will not be able to use it to locate files you can't read.
In addition, 'locate' will have limited utility on HPC because there are so many
user files that it takes a lot of time and IO to do it. It is probably most useful
on your own Linux machines.
--------------------------------------------------------------------------------
# 'locate' will work on most system files, but not on user files. Useful for looking for libraries,
# but probably not in the module files
locate libxml2 |head # try this
# Also useful for searching for libs is 'ldconfig -v', which searches thru the LD_LIBRARY_PATH
ldconfig -v |grep libxml2
--------------------------------------------------------------------------------
== Modules
'Modules' are how we maintain lots of different applications with mutiple versions without (much) confusion. In order to load a particular module, you have to call it up with the specific version if you don't want the latest one.
Note that the latest one may not be the numerically largest one. Many packages (including Linux) number their packages such that '2.6.16' is newer than '2.6.3' (but older than '2.6.30').
--------------------------------------------------------------------------------
module load app # load the module
module load app/version # load the module with that specific version
module whatis app # what does it do?
module avail # what modules are available
module list # list all currently loaded modules
module rm app # remove this module (doesn't delete the module, just removes the paths to it)
module purge # removes ALL modules loaded (provides you with a pristine environment)
# hint to be able to page thru the modules and search the names
alias modav='module avail 2>&1 >/dev/null | less'
--------------------------------------------------------------------------------
== Getting files from the web
=== wget
'wget' will retrieve 'ftp' or 'http' URLs with a minimum of fuss, continuing a failed retrieval, creating a new name if a file already exists, and supporting a huge number of other options.
---------------------------------------------------------------------------
wget http://hpc.oit.uci.edu/biolinux/nco/nco-4.2.5.tar.gz # when outside HPC
or
wget http://nas-7-1/biolinux/nco/nco-4.2.5.tar.gz # when on the HPC cluster
# now get it again.
wget http://hpc.oit.uci.edu/biolinux/nco/nco-4.2.5.tar.gz # when outside HPC
or
wget http://nas-7-1/biolinux/nco/nco-4.2.5.tar.gz # when on the HPC cluster
# what happened?
---------------------------------------------------------------------------
Then uncompress it with gunzip.