pfp-changes.txt

### Changes, moved from top of pfp code

03-07-24
minor edits suggested by github comments (thanks Lucas Nussbaum)
- rsync option '-asl' is redundant as '-a' includes '-l'
- random 'cp' inserted to README at top (thanks Jakub Wilk)
- used 'die' inapropriately at '--help' termination

2.24.23

[FIXED, finally - use kompare to track down weird paren deletions] using 2.249, with 3 send hosts, 2 of them aren't starting correctly bc of .. something.
If theyre started after fpart has produced some chunks they work, so they're failing to start correctly.  

[FIXED] the creation of the remote ssh  SEND host cmd is duplicating the '--commondir' flag.  Doesn't hurt, but cosmetically offensive.

  [ssh c07n091 "export PATH=/mnt/HPC_T0/PKTEST/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
    /mnt/HPC_T0/PKTEST/.pfp2/pfp2.249  --date=18.25.30_2023-02-24 \
  --mstr_md5=52ef8cc4eefbe90f8905e47a43bc5ad5 \
  --nowait --verbose=3 --maxload=20 --slowdown=0.5 \
  --startdir=/mnt/HPC_T2/ITER/HPC/home/ITER  --skipfpart --fpstart=2 --fpstride=3 \
    --verbose=3 --NP=10 \
    
    --commondir=/mnt/HPC_T0/PKTEST --commondir=/mnt/HPC_T0/PKTEST \
    
    /mnt/HPC_T2/ITER/HPC/home/ITER  \
    10.110.5.15:/mnt/b4csz850/CODAC-BACKUP/DAN 2> /dev/null \
    |& tee -a /mnt/HPC_T0/PKTEST/.pfp2/c07n091/pfp-log-18.25.30_2023-02-24 "]
    
    
2.22.23
[ ] missed file named: '2.1.1'$'\n' literally, including '$\n. (used --ro='-a' ) try using NO rsync options (so use '-asl')

[x] change $n to $NBR_RSYNCS_RNG 

[fixed] REALLY need to be able to track the remaining rsyncs in the master or have the SLAVES keep emitting data after the pfp process ends or don't end the pfp until all the rsync PIDs have ended.
( $rPIDs, $crr ) = get_rPIDs( $RSYNC_PIDFILE, $sPIDs );

[x] from above, maybe keep the scrolling bandwidth chart going until all the PIDs are gone?
better format?

[x?] at iter, sometimes get this error, which must be coming from the remote SEND host:
bash: /mnt/HPC_T0/PKTEST/.pfp2/pfp2.248: /usr/bin/env: bad interpreter: Not a directory
Does this need a delay or sync to make sure the copy is complete?
[x] $MYFPRNG is used to signal that FPART is still running - provides the same functionality as FPART_IS_DONE, simplify...
FPART_IS_DONE = 1 when it ends
$MYFPRNG = 0 when it ends;  change to use FPART_IS_DONE

also the test:
if (-e "$FP_ROOT_DIR/FPART_DONE") { $MYFPRNG = 0; }
seems to be much overused.  Once it gets set to zero, nothing will change it back, so more tests isn't going to change things.


[x] # the PID does not cross nodes.  THAT'S why I use the FPART_DONE file.  Idiot.  But I still dont' know why it now gets set immediately after the command gets issued.    Use the 'ps -h -p' test to tell it when to write out the file which will comm it to the slaves.

if command; then command; else command; fi is not significantly longer and will always work as expected. –
[x] when using POD, check which interface is active on a per-node basis.

1-23-23

[fixed] Looks like you've never asked for the checkhost option - I think the checkfile is deleted when everythign else is.

[x] would be nice if at least on SH mode, if an rsync PID was still active, it would contnue reporting until it ended. (when doing unsplit huge files, xfer of those files can take much loonger than pfp2 itself.) Loop thru get_rPIDs(), until the PIDs go to 0.

[ ] Looks like the --bigfile option is not splitting the files inline, but waiting for all teh splits to be done beforehand ...?
check with  watch

[ ] About to send the reassemble script
reassemble script doesn't get the remote user name right , references '@bigben' doesn't prep rem_user correctly.  If use hjm@bigben, it works ok.

[ ] getting some rsync errors on the transfer of very large files, but they did transfer correctly in the end.

took 4.37m to xfer Downloads.  pfp2 ended at 1m45s, but the xfer went on for a total of 4m37s.
Will prob take same amount of time on a LAN, but it'll keep pfp2 running for the duration. yes, exactly that

[x] when --bigfiles, add INFO to say that it takes extra time to prep files before xfer starts

[ ] also reduce the bandwidth warning after 1st print to 1 line overwrite
stunted WARN: Bandwidth has been < 0.1MB/s for [10] checkperiods.  


[x] fixed user@POD::/ format results in an error.   Should be OK.
[x] OK - long pause after:
stunted INFO: [102] files from the previous run have been cleared .. continuing.
This is due to deleting and re-initializing the dirs/files, and starting fpart.  So a few seconds is 
expected


[x] fixed skipping column header, prob due to FPSTART being changed.
[ ] About line 477 in pfpzot - check this out and make a sub if needed.
    ##: Send pfp2 utils to SEND hosts    
    if ($CHECKHOST) {    # add to checkhost() when thrashed out.
                         # checking utilities required and providing them if not.
                         # this has to be DISallowed when communicating to an rsyncd server.
      if ( $VERBOSE > 2 ) { INFO("Sending utilities [$allutils] to [$a[0]:${parsync_dir}]\n"); }

[x] fixed in 2.45 and on, get this error on MH xfers:

stunted INFO: [57] files from the previous run have been cleared .. continuing.
Error: any valid prefix is expected rather than "POD".
Command line is not complete. Try option "help"

where is this coming from?  not pfp2

This is from the ssh command sent to bigben  I think it's parsing differently now.

ssh bigben "export PATH=/home/pfp/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
    /home/pfp/.pfp2/parsyncfp2.245  --date=12.29.53_2023-02-14 \
  --mstr_md5=0996e2b8e0b2404d16d995754366b853 \
  --nowait --verbose=3 --maxload=16 --slowdown=0.5 \
  --startdir=/home/pfp/  --skipfpart --fpstart=0 --fpstride=1 \
     --NP=8 --commondir=/home/pfp    /home/pfp/  \
>>>    POD::/ 2> /dev/null \
    |& tee -a /home/pfp/.pfp2/bigben/pfp-log-12.29.53_2023-02-14 "

    in 2.44, it reads:
[ssh bigben "export PATH=/home/pfp/.pfp2:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
    /home/pfp/.pfp2/parsyncfp2.244  --date=13.37.36_2023-02-14 \
  --mstr_md5=eba8ec56449ef795c88fecf88fe98764 \
  --nowait --verbose=3 --maxload=16 --slowdown=0.5 \
  --startdir=/home/pfp/  --skipfpart --fpstart=0 --fpstride=1 \
     --NP=8 --commondir=/home/pfp    /home/pfp/  \
>>>    tux@minilini-w:/home/tux 2> /dev/null \
    |& tee -a /home/pfp/.pfp2/bigben/pfp-log-13.37.36_2023-02-14 "    
    
  so in 2.44 it's a valid host/path, while in 2.45, it's POD::/  So the changes in the remote host path sub ned to be fixed.
  
[ ] fixed where is the double // coming from?
stunted DEBUG: [771]: Looping to add the fpart target: ['/home/pfp//zots' ]


[x] using POD::/ as a target with  parsyncfp2.245 results in this error:
stunted ** FATAL ERROR **: The MultiHost default REMOTE path [POD::] isn't formatted correctly. 
  It needs something on the right side of the '::'. Minimally a '/' to define a remote path
  (POD::/path/to/dir), or an rsync server module name without a leading '/' (POD::dumpster). 
  This string is appended to any pathless hosts in the '--hosts' option. Please check. 
  I think this is due to the changes in the remote host path checking that I  did recently cuz it's stripping the final '/' - I think I even remember doing that...
  
[ ] fixed --checkhost assumes MASTER user at all the hosts, even if the user is specified in the --hosts string as different. FIX

[ ] start on '--zotfiles|zf=(max avg file per chunk)' option. 'max avg file per chunk' can be an integer or string entered as 10k, 24M, etc, by being fed thru the conversion sub.  
- process is that when requested, each f.# file is checked for # of lines by 'wc -l' so we get the # of lines and then the chunk size is divided by the # of lines. so a chunk of 10g divided by 464533 lines is 
21527 bytes so if zf=100K, this mean that the zf processing is invoked.  If not, the f.# file is just fed to rsync as normal (altho it would be interesting to see if on average, this is a good way of transmitting the data ALL the time.)
- if the F.# needs tgz (proc'g to a compressed tarball, regardless of the compression mech - lz4 is probably the best overall for speed and compression), then:
  - append the decompression/untar command to the remote script array  (@rem_untgz), then send the command out as a bg system command (system(tar -c --fromfiles=f.# -f - | lz4 > tar.lz4.# &))
  Note oddity below - what seems to work is: 
  $rem_cmd = "tar -c --files-from=f.# -f - | lz4 > tar.lz4.# ";
  system("$rem_cmd &");
  
  - this approach ends up with the tgzs ending up at the top level of the remFS, so the tgzs have to be untgzed from there. Demo this to make sure: The cmd below works, < 1s for 10M
                   looks like the --files-from option has to be a fqpn, not a ~ path name.
                   vvvvvvvvvvvvvvvv
  tar --files-from=/home/hjm/.pfp2/fpcache/f.6 -czf - | lz4 > ./tgz-lz4.6
  reminder: wherever the tgz is placed, tar strips the leading '/' so it recursively places dirs from the dir it finds itself.
  
  so if I tar up a f.# that has 
dir7/util/jre/lib/zi/Asia/Aden
dir7/util/jre/lib/zi/Asia/Almaty
dir7/util/jre/lib/zi/Asia/Amman
dir7/util/jre/lib/zi/Asia/Anadyr
 
 and transfer it to rem_host:/foo/bar/nu
 and untgz it from there, it'll get placed as 
 /foo/bar/nu/dir7/util/jre/lib/zi/Asia/Aden
 etc.
 
 - this seems straightforward.  ~ same as with the bigfile approach. 
 - have to wait for f.1 to start, but after that, check each f.# and do the calc for each one.
 - median would be better than mean, but fpart doesn't give sizes of the individual files.
 - bc file size stats are extremely skewed to a few large ones, the mean size is not going to be representative of the pop'n.  median would be better. Could mod fpart to do it..?  OK but for now, get the basics running using mean.
 
 - want the zotfile proc'g happening on EVERY host or just the MASTER?  since zotfile processing can be going on while fpart is still running, each SEND host could process it locally. That offloads the proc on the MASTER, but does it complicate things on the SEND?  Since the f.# files are 'local' to all, the SEND hosts could do this, keeping the tar/compression load off the MASTER.  SO the SHs would have to eval the f.($CUR_FPI + $FPSTRIDE) file that they're given.  This is probably the better way to do it.
 
 
12-12-22
[x] Start prefixing dev 'Notes To Self' with ##NTS so they can be stripped out of the release versions.
[x] start prefixing debug lines with #dd ditto
[x] TODO $rem_fqpath from parse_rsync_target() doesn't return a FULLY QUALIFIED remote path, just the remote path specified by the command - ie ~/junk -> ~/junk, not /home/tux/junk, which is what it should.  
not a complete fix, but needs better explanation - local pfp exe can't know the FQP to a HOME without a
an ssh (ssh tux@minilini "printenv HOME") to figure it out, so until I decide to do that, just munge as below.

[x] munge ~/dir on the receiver to delete '~/' so that it rsync will digest it correctly.

11-24-22
[x] separate infiniband (and wifi?) checks by option so we don't generate pauses and crashes checking for utilities that aren't needed.  Assume that network is ethernet unless user specs wifi or IB.  If so, then check.  Wifi prob won't need anything but IB will need to distinguish between IPoIB & RDMA.
so need --ib

11-15-22
[] - add --bigfiles as an option  to split files larger than the chunk size, rsync separately, 
       then reassemble at the remote end and delete the split chunks both at the local and remote end when done. Otherwise, files bigger than the chunk size will just be rsynced intact, which might cause long delays in finishing the xfer, expecially if they're started at the end. This is a bit complicated, but should work as a perfect // load.
[] - --bigfiles works as:
    fpart now provides filesizes, so after fpart produces a complete file, if '--bigfiles' 
    (implies a file larger than the chunksize) then check if that file is > chunk size. 
    fpart Live mode:
    -L    live mode: generate partitions during filesystem crawling
    -S    do not pack files bigger than specified maximum partition size
        but print them to stdout instead (needs -L and -s)
    -s    limit partitions to <size> bytes
    above fpart behavior skips trying to pack the larger-than-chunk sizes and prints the names to STDOUT

    so std fpart output remains unchanged, just naming the files to be chunked:
    (line numbers shown below are not part of output)
      1 /home/hjm/nacs/circos/Originals/circos_cri-grants.conf
      2 /home/hjm/nacs/circos/Originals/circos_cri-grants.png
      3 /home/hjm/nacs/circos/Originals/circos_cri-grants.svg
      4 /home/hjm/nacs/circos/Originals/circos_cri-interests-marmod.conf
      5 /home/hjm/nacs/circos/Originals/circos_cri-interests-only.png
      6 /home/hjm/nacs/circos/Originals/circos_cri-interests-only.svg
      
      and the exceptions are listed:
$ ./fpart -L -S -s 10000000 -o chunks /home/hjm/nacs
S       23128222        /home/hjm/nacs/gpfs/gpfs-for-dummies.pdf
S       10356828        /home/hjm/nacs/nsf-nie/nsf-nie-campus-storage-plan.pdf
S       15016313        /home/hjm/nacs/LinuxJournal/LinuxJournal_195_Full_Enterprise.pdf
S       16935798        /home/hjm/nacs/LinuxJournal/LinuxJournal_198_full_DirB.pdf
S       19042107        /home/hjm/nacs/LinuxJournal/LinuxJournal_TF39.pdf
   where all the files listed are larger than the chunk size specified with '-s'
   (and whitespace is tabs, not spaces.
    so those should be APPENDED to the 'bigfile' as fpart goes along and then processed when 
    fpart exits successfully. 
    
    If NOT using the bigfiles processing, the larger-than-chunk-size file just terminates 
    the chunkfile

    If so, fork a process to:
      - read the bigfile, foreach  line, if size is < 1.2x chunksize, just copy that line to the next chunkfile else split on chunksize to orignal name.xx and copy the filename to the next chunkfile in the series.
      - 
      - split it in half (if close to the original size), or split into chunksize chunks to 
        be transferred later if it's >> chunksize.
      - these chunked files live in the same dir as the original; they will be deleted as 
        soon as they're rsynced over.
      - possibly compress it
      - ex: cat /home/hjm/nacs/hpc-support/chr1.fa | gzip | split -b 1000000 - chr1. 
      (the gzip yields: 10 files of 10MB instead of 38 
      418K Nov 17 18:23 chr1.aj
      977K Nov 17 18:23 chr1.ai
      977K Nov 17 18:23 chr1.ah
      ...
      977K Nov 17 18:23 chr1.aa) altho it does take the time to read, gzip, and write the results
      for TB sized files, this may not be useful.
      
      - create an f.big.#aa, f.big.#ab, etc  temp file, log it to big.log and wait until fpart 
        has finished.
      - then mv all the f.big.#aa serially to the appro f.# name to await processing as the 
        xfer continues.
      - on completion of xfer, iterate thru big.log (if naming is good, do we need a big.log? yes, 
        if only to reconstitute the original files.) called 'remote_reassemble.sh' in the test script.
        and cat the multiple parts of each file together, decompressing if nec.
        if ($COMPRESS) {cat /path/to/dir/f.big.#* | $DECOMPRESS_PROG > {name from f.#}}
        else {cat /path/to/dir/f.big.#* > {name from f.#}}
        
        so the end of pfp2 is
        scp the remote_reassemble.sh file over to the remote:/tmp/pfp2, chmod it to 'x'
        ssh user@remote 'cnmod +x /tmp/pfp2/remote_reassemble.sh ; /tmp/pfp2/remote_reassemble.sh', capturing stderr and stdout and dump that to the output as INFO.
        
        BUT, this delays the remote-join until the whole pfp2 is done.  It would be more efficient 
        to cat each fileset when the file set has been finished.  In order to do that, 
        each cat command has to be issued after the last of the fileset has been sent. Hmmm.
        And in order to do THAT, have to track each fileset and when done, init an ssh command
        to do the cat.  Hmmm.  certainly doable but wait til original bigfiles approach is working. 
        
        
08-17-22

[X] Correct help about FS to FS transfers
[x] detect when trying to do FS to FS transfers and decline and/or supply an alternative
[x] change handling of message that says:
  stunted INFO: 
      The number of chunk files generated by fpart [0] are currently 
      fewer than the # of rsync processes you specified [4]. However, this may be 
      due to a large chunkfile size and a slow filesystem crawl by fpart.  
      If so, it will work itself out as both progress. 
      
      Did you check the dir tree / file list to make sure you're setting the chunk 
      size appropriately (--chunksize) ?  It's currently set to [10485760].
      Will sleep for 2s to let fpart catch up.
      
      This is normal for the parsyncfp2 startup process.
  
  or put it behind a --verbose=3 test
  
  
06-26-22
[ ] should write up a brief description of how pfp2 works in both SH and MH mode
[x] reduce default verbosity 
[ ] start socket control; initiating machine is server; create, bind socket, listen, accept 
    sendhosts are clients, create socket at port given by server, connect to port
    
    commands to create:
      - kill fpart, pfp2 itself (which should kill all child rsyncs)
      - report interim values of .. bandwidth, load, use this to make up a new graphical dashboard?
      - set new values of period, maxload, etc
      - 
      
    server send out commandline with specified port number, incr +1 per sendhost
    clients create socket at that port, listen for commands, execute commands internally based on 
    internal loop.
    see: https://www.javatpoint.com/perl-socket-programming
    https://perldoc.perl.org/perlipc#TCP-Servers-with-IO::Socket
    
    we'll need bidirectional comms, since the client will need to send collected data and the server will need to be able to send commands to control the slaves.  Sockets are bidirectional so one connection should be enough.
    
    actual usage: once master has finished launching all the sendhosts, goes into a loop where it checks all sendhosts for data, and can send them commands from the vocab:
    - k (kill pfp2 on sendhosts)
    - s (suspend pfp2 on sendhosts)
    - c (change some parameter) checkperiod, maxload, verbosity, maxbw
    - r (report) everything that can be reported NOW (if long checkperiod, useful to see current values)
        might also report back on 'df -h' of the target filesystem - 
    
    the sendhost clients send back data in a stream of bytes formatted as
    [hostname],[1m loadavg],[BW],[running PIDs],[susp PIDs],[chunks done]
    (chunks total is from master running fpart)

06-19-22
After convo with Peter:
[x] be sure to omit killing rsync daemons in the autogenerated kill script by omitting any lines that contain
      'daemon'
[x] check the sequence that happens that creates 2 subdirs when a host is named via IP# -
      it causes 2 subdirs to be created, 1 by IP # and one by resolved name if it can be resolved.  
      So if use 128.200.182.222, get subdirs '128.200.182.222' as well as 'bridgit.mmg.uci.edu'
      Aha!  OK, solved.  This behavior is caused by one of the dirs being created by the master and one of them being created by the slave.  the master creates the dir according to the given hostname so if the 
      hostname is given as '10.34.124.44', that's what gets created.  But on the slave side, even if there's no DNS entry, it has a hostname, so it creates the dir as it's 'hostname -s' ie : 10.153.7.104 -> c07n104.
      And sometimes you can't do nslookup as the names are stored elsewhere (like /etc/hosts), as is often the case in clusters where pfp2 is going to be used a lot.  Since you need ssh access to the server, just ask it:
      ssh 10.153.7.104 ' hostname -s' 2> /dev/null
      c07n104
      and use that hostname instead of the IP #. (ping won't work since if given an IP #, it uses it)
      So if encounter an IP# as a sending host (not required for receiving hosts - yet), just resolve it to 
      the short hostname and use that bc it's going to be the same on both master and slave.
      created sub short_hostname() to address this.
[ ] go back to --bigfiles and --losf (in singlehost mode) when I get back to Irvine.  Should be fairly
      straightforward since I've written it previously (pfp2-losf).
      Note from main code:
      TODO: (LOSF fork only) modify for allowing the executable to be renamed and stop.
      the losf stuff in MH mode (in pfp2c-losf) will require a better approach.  
      The current pfp2c-losf works fine in SH mode, 
      but falls apart in MH mode bc the g2g_*** stuff won't transfer across hosts.  It would require a 
      file-based approach. So instead of using the g2g_-based trackers, return to using f.# but processed as 
      the g.# and s.# files.  They're distinguished not by file name but by wc of the file & the contents 
      (1 file, ends in tar.lz4.pfp).  But, instead of being able to do the lookahead via filenames, going 
      to have to do them via actually reading, evaluating the f.# file.


03-30-22
- [ ] to make a --delete option that works, ganael wrote this ..?>
  see: https://github.com/martymac/fpart/blob/master/docs/Solving_the_final_pass_challenge.txt
  It sounds a bit like the thing I came up with but his works on the local side only.
  1st pass: launch fpart simulataneously on the local and remote ends.
  once fpart is finished locally, handle the --bigfiles option to extend the number of f.# files.
  cat all the files together, sort, compress with lz4 into a tarchive and send to the remote.
  - this tarchive contains the best estimate of all the files that will be on the local side.
  - this tarchive file can be 'shared' by writing it to the shared dir)
  - on the remote side, when fpart finishes (writes a REMOTE_FPART_DONE file on the shared dir), cat all the files together, sort, and then diff against the local tarchive.  somethign like 
     diff files.local file.remote | grep '>' | sed 's/^> //' | exec rm -f yadda
     (or use the rsync -a --delete empty/)
  that should delete all the files that are on the remote side that ARE NOT on the local side.
  & should be a much smaller collection than the local list
  & can be done asynch after pfp2 officially ends if nec.  Could also just start the job in the background
  & report back by email when it's done.
  - this may not handle empty dirs correctly - when all the files from a dir have been deleted, the dir may still exist unless there's a way of detecting empty dirs.  fpart has an option for this I think.
  
  
03-11-22
- [ ] add bigfiles silent option that uses fpart's new ability to 'S'kip files > chunk size and write to STDOUT a stream of those files that can be captured to a file so that onece fpart is done, there will be a file to be read of the form:
S[tab]size[tab]filename
once fpart  is finished, read this file line by line, splitting it up and decide what to do with it.
  - if the filesize < 2x chunksize, just rsync it as a single file.
  - if it's >2x chunksize, split it into chunksize parts and put each chunk in its own f.#
  - AND add that file to the list that has to be cat'ed back together again after everything has finished.
  - do need to see how much speed this adds to the process.  If it's too slow, why bother.


02-17-22
- [x] when syncing, get repeated output of this kind of stuff.  SHould dig into this in more detail to 
figure out how to tell the difference between failure and a perfect sync.

c07n091 INFO: rsync log [/mnt/HPC_T2/testzone/pfp2/.pfp/c07n091/rsync-log-05.09.15_2022-02-18_75]
    has 0 lines of transfer data, indicating failure or all remote files in that chunk are identical.
c07n091 INFO: rsync log [/mnt/HPC_T2/testzone/pfp2/.pfp/c07n091/rsync-log-05.09.15_2022-02-18_84]
    has 0 lines of transfer data, indicating failure or all remote files in that chunk are identical.
c07n091 INFO: rsync log [/mnt/HPC_T2/testzone/pfp2/.pfp/c07n091/rsync-log-05.09.15_2022-02-18_90]
    has 0 lines of transfer data, indicating failure or all remote files in that chunk are identical.
c07n091 INFO: rsync log [/mnt/HPC_T2/testzone/pfp2/.pfp/c07n091/rsync-log-05.09.15_2022-02-18_96]           

Solved - due to ssh refusal from the UCI VPN after a few logins. Ditto the issue below.

- [ ] when syncing, get runs of this:
[[
c07n104 INFO: next chunk [69] of [114].
c07n103 INFO: next chunk [71] of [114].
c07n091 INFO: next chunk [76] of [114].
c07n104 INFO: next chunk [72] of [114].
kex_exchange_identification: Connection closed by remote host
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
kex_exchange_identification: Connection closed by remote host
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.3]
c07n104 INFO: next chunk [75] of [114].
05.09.51    0.40     0.07       0.00 / 0.00             2    <>   0          [74] of [114]  < c07n103       
c07n104 INFO: next chunk [78] of [114].
]]

so it looks like the REC host can't keep up with the rsyncs being spawned .. or something.  Check this out in more detail.
see this:

https://askubuntu.com/questions/554754/rsync-creates-error-message-unexplained-error-code-255-at-io-c837
That would make it a server timeout whereby the REC host is takign too much time to respond so the ssh client times out.  I've added the recommended solution:
add these settings to your local ~/.ssh/config:

Host *
  ServerAliveInterval 30 (maybe go to 60)
  ServerAliveCountMax 6  (maybe go to 10)
  
and on the remote server (if you've got the access), setup these in your /etc/ssh/sshd_config:

ClientAliveInterval 30
ClientAliveCountMax 6

Definitely a condition where too many rsyncs are started at the same time.  Maybe add an option to add time between rsync starts.  --slowdown=3s ??  test ping times in checkhost() and return them at the same time.
if ping times > 1ms, emit INFO, and set sleep params based on the below.

emit warning to set .  Mention the ssh config changes as well.

After increasing the sleep from 0.5 to 1s it looked like thre were fewer fails, but there were still some.
increase to 2s and decrease the number of NP?

YES!! that did it!  With 2s sleeps between rsyncs, no failures.
So have to ID all the targets and then ping all of them.
maybe..
if the ping avg is < 1ms, 0.5 sleep
if the ping avg is >1 && < 10ms, 1s sleep
if the ping avg is >10ms && < 30ms, 2s sleep
if the ping avg is >30ms && < 50ms, 3s sleep
etc

and just use the select() to substitute in the different periods of pause.


02-16-22
- [ ] - look at parse_rsync_target() & see if there's a good reason for distinguishing betw. $recv_hoststring & $TARGET...?  Seee line ~352 in pfp2c.
 - [x] Check generation of MHplotfile.sh - it doesn't pick up the user@host  designation and munges the 2 together so it drops out any data that has a user@host send or rec bc it's not parsing the data correctly (I think).

- [ ] forbid MH rsyncs on a mounted FS for now, but non-host paths are fine here.

02-11-22
- [x] looks like the tailoff of BW that results in send host hanging is bc the rsync on the rec end has finished but the send host rsync is waitign for the termination signal. YES, that's what it is.
maybe if it gets to that point, query the receiver to see if the rsync is still going?
This is due to UCI's firewall killing off repeated ssh login attempts.  Has nothing to do with pfp2 or rsync.

02-10-22
- [ ] reduce width of output by selecting either tcp or RDMA on the scroll ?


02-09-22
- [x] there apppears to be a serious bug in the syncing part of the code.  For fresh forward syncs, it works fine, in SH and MH, but in MH mode, if there's a lot of files alreadytransferred, the rsyncs hang at a fairly early stage and fail with a 'can't find the correct rsync log' (which I think is only hit during sync parts)
This is a serious bug.

SOLVED: This is UCI's firewall arbitrarily closing ssh connections. Finally got an answer after they completley closed access.  Will leave the text below as reminder of what happened.  Delete eventually.
{
The below is (mostly) due to bad logic in the main rsync startup loop.  subtracting the stride instead of adding it, and some other stuff.  Ugly.  The failing rsyncs were mostly due to a bad creation of the chunk file paths that in some cases prefixed the 'home/pfp' argument to the file path, so you ended up with /home/pfp/home/pfp/..(fqfn). Also corrected.  I'lll leave teh below in place until I'm sure no more exceptions come up.
It now pushes fresh data correctly AND syncs both full and partial data correctly, ALTHO: it sometimes hits the hanging rsyncs error, but they're gone by the time I check them, so it may be the rsync server dying off a bit behind the client.

So when it completes a new file push. it ends like this:

cooper INFO: Checking rsync logs vs chunkfiles) log_seq = [2], CUR_FPI = [80]
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_2' (/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_2)] is [681]
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_5' (/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_5)] is [1746]
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_8' (/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_8)] is [53]
...
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_74' (/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_74)] is [108]
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_77' (/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_77)] is [118]
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_80' (/home/pfp/.pfp/cooper/rsync-log-19.22.03_2022-02-09_80)] is [110]
cooper INFO: All rsyncs appear to have completed.
cooper INFO: Done.  Please check the target to make sure expected files are where they're supposed to be.

but when it is in sync mode (already a lot of files on the remote end), the rsync logs aren't generated after a few iterations.
bigben completed OK.

stunted ended with:
kex_exchange_identification: read: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
20.19.37   17.03     1.02       0.00 / 0.00             0    <>   0          [44] of [45]  < stunted             
stunted INFO: 1557: Checking rsync logs vs chunkfiles) FPSTART = [1], CUR_FPI = [43]
1565: r_fle_wc from ['/home/pfp/.pfp/stunted/rsync-log-20.02.28_2022-02-09_1' (/home/pfp/.pfp/stunted/rsync-log-20.02.28_2022-02-09_1)] is [226]
stunted INFO: rsync log [/home/pfp/.pfp/stunted/rsync-log-20.02.28_2022-02-09_22] 
    has 0 lines, indicating failure or all remote files in that chunk are identical.
that file contains:
      1 2022/02/09 20:19:37 [120753] rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      2 2022/02/09 20:19:37 [120753] rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
so a failure.

so maybe the chunks that are noted should be sent again..?  Isn't that already set up in the SH version?
YES!
  If this is a SingleHost send and you would like to resend them, answer 'Y' or 'y'
  to the following question.  
  
  Would you like to re-send the chunkfiles that were noted above? [Ny] 
  
  
and cooper ended with:
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-20.02.28_2022-02-09_20' (/home/pfp/.pfp/cooper/rsync-log-20.02.28_2022-02-09_20)] is [0]
cooper INFO: rsync log [/home/pfp/.pfp/cooper/rsync-log-20.02.28_2022-02-09_20] 
    has 0 lines, indicating failure or all remote files in that chunk are identical.
1565: r_fle_wc from ['/home/pfp/.pfp/cooper/rsync-log-20.02.28_2022-02-09_23' (/home/pfp/.pfp/cooper/rsync-log-20.02.28_2022
which also contains evidence of error:
      1 2022/02/09 20:19:38 [228401] rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      2 2022/02/09 20:19:38 [228401] rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
/home/pfp/.pfp/cooper/rsync-log-20.02.28_2022-02-09_20 lines 1-2/2 (END)

trying to reuse the fpcache and re-send:

$ pfp2c --reuse  --ro='-slaz' --chunk=30M  --NP=3 --commondir=/home/pfp   --maxload=30 --checkhost --hosts="bigben=bridgit,stunted=bridgit,cooper=bridgit" dir[123456]   POD::/home/hjm/test

and get the same problem:
stunted INFO: next chunk [14] of [45].
bigben INFO: next chunk [13] of [45].
stunted INFO: 1557: Checking rsync logs vs chunkfiles) FPSTART = [1], CUR_FPI = [16]
1565: r_fle_wc from ['/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_1' (/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_1)] is [0]
1565: r_fle_wc from ['/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_4' (/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_4)] is [0]
1565: r_fle_wc from ['/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_7' (/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_7)] is [0]
20.30.31    0.07     0.16       0.15 / 0.00             3    <>   0          [15] of [45]  < cooper              
1565: r_fle_wc from ['/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_10' (/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_10)] is [0]
1565: r_fle_wc from ['/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_13' (/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_13)] is [0]
1568: weird, no [/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_16] written..?
stunted INFO: rsync log [/home/pfp/.pfp/stunted/rsync-log-20.30.23_2022-02-09_16] 
    has 0 lines, indicating failure or all remote files in that chunk are identical.

and now bigben and cooper are just timing out:
20.31.17    0.83     1.49       0.00 / 0.00             3    <>   0          [28] of [45]  < bigben              
20.31.19    0.87     0.30       0.00 / 0.00             3    <>   0          [33] of [45]  < cooper              
20.31.21    0.90     1.49       0.00 / 0.00             3    <>   0          [28] of [45]  < bigben              
20.31.23    0.93     0.27       0.00 / 0.00             3    <>   0          [33] of [45]  < cooper              
20.31.25    0.97     1.54       0.00 / 0.00             3    <>   0          [28] of [45]  < bigben              
20.31.27    1.00     0.33       0.00 / 0.00             3    <>   0          [33] of [45]  < cooper              
                                ^^^^

  and bigben rsyncs are just hanging:
Wed Feb 09 20:32:24 [1.66 0.83 0.69]  hjm@bigben:~
301 $ sudo strace -p 649116
strace: Process 649116 attached
select(8, [7], NULL, [7], {tv_sec=59, tv_usec=458354}^Cstrace: Process 649116 detached
 <detached ...>


Wed Feb 09 20:32:48 [1.82 0.93 0.73]  hjm@bigben:~
302 $ sudo strace -p 649176
strace: Process 649176 attached
select(8, [7], NULL, [7], {tv_sec=42, tv_usec=144149}^Cstrace: Process 649176 detached
 <detached ...>


Wed Feb 09 20:33:02 [1.50 0.90 0.72]  hjm@bigben:~
303 $ sudo strace -p 649192
strace: Process 649192 attached
select(8, [7], NULL, [7], {tv_sec=29, tv_usec=251051}^Cstrace: Process 649192 detached
 <detached ...>

 executing the bigben cmd independently after killing everything:
  ssh bigben "export PATH=~/.pfp:~/bin:/bin:/usr/sbin:/sbin:/usr/bin:$PATH; \
>     ~/.pfp/pfp2c  --date=20.30.23_2022-02-09 \
>   --mstr_md5=dc0d871108bd4e9e1f804ad4c7cb3baa \
>   --nowait --verbose=2 --maxload=30  \
>   --startdir=/home/pfp  --skipfpart --fpstart=0 --fpstride=3 \
>     --reuse --ro=-slaz  --NP=3 --commondir=/home/pfp --maxload=30  /home/pfp  \
>     bridgit:/home/hjm/test 2> /dev/null \
>     |& tee -a /home/pfp/.pfp/bigben/pfp-log-20.30.23_2022-02-09
    
  ends in much the same way - ends too early (22 of 45) and 
    
20.34.41    0.13     1.40       0.18 / 0.00             2    <>   0          [22] of [45]  < bigben              
bigben INFO: next chunk [22] of [45].
bigben INFO: 1557: Checking rsync logs vs chunkfiles) FPSTART = [0], CUR_FPI = [24]
1565: r_fle_wc from ['/home/pfp/.pfp/bigben/rsync-log-20.30.23_2022-02-09_0' (/home/pfp/.pfp/bigben/rsync-log-20.30.23_2022-02-09_0)] is [0]
1565: r_fle_wc from ['/home/pfp/.pfp/bigben/rsync-log-20.30.23_2022-02-09_3' (/home/pfp/.pfp/bigben/rsync-log-20.30.23_2022-02-09_3)] is [0]
1565: r_fle_wc from ['/home/pfp/.pfp/bigben/rsync-log-20.30.23_2022-02-09_6' (/home/pfp/.pfp/bigben/rsync-log-20.30.23_2022-    
    
    
  This seemed to be a problem with spaces in teh file names and rsyncopts being passed as '-slaz'.  When no rsync opts are bwing passed, it works better, but there's still what looks like a host-oder dependent bug.  Change the order of the hosts and it goes away.
  
  Also, the last chunk (or a goes-to-zero' bandwidth error is problematic. maybe instead of killing all the process, issue a WARN("The bandwidth on host [xxx] has been close to zero for xx cycles.  You can let it continue or you can kill the entire pfp run with the killscript [$killscript]")
  
  More info: it seems that when in sync mode, the rsyncs finish very quickly (good) but pfp is not handling the quick finish very well so that it thinks that bc there aren't any more rsyncs running, the job has finished.
  So we have to fine-tune the job end requirements.
    
    
15.29.26    3.97     2.18       1.21 / 0.00             1    <>   0          [14] of [14]  < stunted             
15.29.30    4.03     2.00       1.43 / 0.00             1    <>   0          [14] of [14]  < stunted             
15.29.34    4.10     1.92       0.51 / 0.00             0    <>   0          [14] of [14]  < stunted             
15.29.38    4.17     1.77       0.01 / 0.00             0    <>   0          [14] of [14]  < stunted             
15.29.42    4.23     2.19       0.07 / 0.00             0    <>   0          [14] of [14]  < stunted             
        | Elapsed |   1m   |    [ wlp3s0]   MB/s  | Running || Susp'd  |      Chunks       [2022-02-10]
  Time  | time(m) |  Load  |     TCP / RDMA  out  |   PIDs  ||  PIDs   | [UpTo] of [ToDo]
15.29.46    4.30     2.19       0.07 / 0.00             0    <>   0          [14] of [14]  < stunted             
15.29.50    4.37     2.73       0.04 / 0.00             0    <>   0          [14] of [14]  < stunted             
15.29.54    4.43     2.59       0.00 / 0.00             0    <>   0          [14] of [14]  < stunted             
15.29.58    4.50     2.47       0.03 / 0.00             0    <>   0          [14] of [14]  < stunted             
15.30.02    4.57     2.83       0.05 / 0.00             0    <>   0          [14] of [14]  < stunted             
15.30.06    4.63     2.83       0.01 / 0.00             0    <>   0          [14] of [14]  < stunted             
    
    
now kinda working except that finish criteria is off for the SEND hosts.
cooper and bigben are not sending their last few chunks.
    
    
    }
    
    
01-30-22
-[x] is there a bug at the very end of long transfers where the error message:
Waiting for fpart to get ahead of the transfer..
come on - see it when there are maybe 2 of 8 processes waiting to complete, wen in fact the transfer is complete.  Check ending condition..

- [x] test if --hosts, has to be a POD:: as target (!)

- [x] at exit, should re-emit the complete calling commandline as a reminder.

- [x] if specify chunk size larger than a single chunk, no erros emitted and it just runs ad infinitum.
      so needs to check if max number of chunks when FPART_DONE < number of chunks, emit WARNING.
      
- [x] get_nbr_chunk_files() is really expensive in terms of FS activity.  Should look at this to reduce the number of calls.

- [ ] re-write the plotting sequence (~1030) in native gnuplot to remove the dependency on feedgnuplot.

- [ ] consider spec'ing ITER send hosts to REQUIRE x GB NVME to support --losf / --bigfiles
bc... reading a dir tree to tar/compress it or read a huge file to split it into chunks is really no faster than using rsync to send it. UNLESS, it's only a single read/tar/compress stream into RAM or a fast SSD and then send the chunk from there.  writing it back to even a fast // FS is a net loss.  For --losf, it is probably useful for chunk of zotfiles -> tar -> lz4 -> rsync.
But then the remote data chunk has to be extracted again (optionally?)
ITER is almost certainly not going to be sending zotfiles, much more likely there will be hugefiles which need to get (possibly) compressed, split, then sent, then reconstituted on the remote side.


01-29-22
- [x] add --reuse to the options list to reuse the fpart chunk files so don't have to do a full recursive descent of the FS again.  Is much more efficient than sending fpart down the tree, which takes up a lot of FS IO.  Basically skips the deletion of the chunk files, forking fpart, and jumps directly to launching the rsyncs again.  On TB size transfers, will save lots of operations.

- [ ] consider using /dev/shm as a cache again for the losf option.  use a file called f.losf to store names or indices of chunks that need to be tarchived and do that using /dev/shm as the cache. Do only 1 or 2 at a time to prevent IO saturation, and just keep doing losf chunks as we go.  Requires a separate loop to iterate over the chunks that need this, but that can be done in a fork or system call.

01-25-22
- [x] for pfp2c, add check so that the --checkhost has been run once to make sure that the utilities have been copied into place.

01-19-22
Consolidated changes TODO for pfp2
1 [delay until fpart settles]- test new fpart and integrate into pfp2 (starts at 1, not 0, captures bigfiles 
2 - [x] change warnings to be emitted only with headers, not every output line
      from its stdout to postprocess)
3 - [x] finish & verify changes to handle different remote users/hosts, etc in the new sub.
4 - [ ] if this works, can add --bigfiles option to postprocess them into chunks
5 - [ ] see if can bring in --losf into the singlehost version
6 - [no - too complex and not efficient enough] possible to integrate --cachefs into both 
    the --losf and --bigfiles, so can pre-tar/compress stuff
    to another FS to save overall time to transfer.  Currently, looks like there isn't a big improvement 
    in tar/send vs direct send, altho there is for LOTS of small files.  Maybe detect the size of /dev/shm 
    and stream tar'ed/LZ4'ed chunks to /dev/shm to be transferred from there. NB: both the --bigfiles and --losf
    options only help one-way, one time transfers, not true rsync ops.  So once the files are there, can't use
    -losf / --bigfiles to actually rsync the files.
    - so --losf and --bigfiles would only be for onetime, oneway transfers.
  - have to detect other FSs, ask about them, or demand it be specified as --cachefs=/FS/dir/to/use
  if going to present possibilities: ignore: (from df)
    ^udev
    ^devtmpfs 
    ^tmpfs  (altho this includes /dev/shm)
    ^cgmfs
    ^/dev/loop*
    none
    devtmpfs                          94G     0   94G   0% /dev
    tmpfs                             94G   28K   94G   1% /dev/shm
    tmpfs                             94G  238M   94G   1% /run
    tmpfs                             94G     0   94G   0% /sys/fs/cgroup
- 
- [leave as is for now] Is re-using $ALTCACHE for COMMONDIR a good idea?  Doesn't this confuse things down the line?  What's the rationale for doing this?  Obviously, you can't use COMMONDIR / MH with --altcache. Which is probably a good thing, but it does confuse things.  There are some overlaps in terms of where to place teh fpart chunks tho.


01-12-22
- for fpart, as shipping (from git clone), requires 
  git clone https://github.com/martymac/fpart.git
  then: 'cd fpart; aclocal; automake --add-missing; autoconf; ./configure; make -j4' in dir to make configure.

01-09-22
- can add the inline process:
  tar, lz4 compress, and split in one process
  
  tar + lz4 compression + split to 200M chunks in a pipe.  
    $ tar -cvf - isos | lz4 -c | split -b 200M - bigfiles/isos.tar.lz4_ 

  End up with bigfles/isos.tar.lz4_aa, ab, ac ...
  for it to be time-saving, needs to go to nvme storage, then asap, rsync that sequence.
  since the files are already chunk sized, can do them in sequence with plain rsync in a tight loop
  outside of pfp, but definitely want to be ale to keep track of files INSIDE of pfp.  if wait for 
  FPART_IS_DONE = 1, can just keep appending f.chunks to the list (after setting a var to indicate 
  that there will be more files AFTER $FPART_IS_DONE is set.  maybe unset F_IS_D, and then set it 
  again after all the bigfiles are processed.)
  But that delays sending the bigfiles chunks until the end & further... takes up more space than 
  you'd want on the nvme unless it's semi-infinite.
  Ideally, you want to trickle the .aa, .ab files at the same time as the rest of the chunk sends, 
  and then delete the .ax files asap to make room for more.
  - so really want to write a bigfile chunk to nvme, then start an rsync to move it, then as soon 
  as it's moved, delete it, and start the next one.  or .. 
  - process then as they are discovered into nvme, then after processing 
  in nvme, mv them all back to the same FS where they originated, and do them as chunks 
  (the appended f. files just refer to them as single files). And after the transfer, ssh a script 
  file (that lives in the same dir as the chunk files named 'RECONSTITUTE_<filename_root>.sh' in 
  the same dir) that cats them all together and untars them into place. 
  - the above process uses nvme for fast initial IO, then does slow IO back to the original FS, 
  then regular rsyncing to move the chunks, then on-demand processing to put everything back 
  together.  Assumes that remote clients might not want to expand the files immediately. Hmmm. 
  

12-28-21
- NOTE about bigfiles. fpart currently just adds the size of the incoming files to the sum and 
if the next file is 100G and the chunk size has been set to 10G, the next file exceeds teh chunk 
size and that chunk is completed. That chunk may now be ~10G+100G if the last file is 100G.  
ie, there's no guarantee that the bigfile is processed in any different way.  In order to handle it, 
I need to run a find on the dir:
  find . -type f -size +10G (chunksize)
& then process it into chunks/reassemble it or  

01-04-22
- possible to use ssh to do some kind of reverse conection to request the server to make a 
connection to the client to prevent the necessity of having the server do a connection to the client?


12-27-21
- note about client-side.
- instead of major mods to pfp2 to handle pull, what about writing a small interactive script that's 
    sent to a server requesting that the server push data back?
- would require much less work.
- can request all the options that pfp2 might require (# streams, size of chunks, if not already defined)
- can set up a crude incoming bytecount
- can 
- one thing is how to allow short-time ssh keys to allow the server to push data onto user storage??  
    Not really, altho ssh-agent seems to imply that you can with the -t option.
- hmm - this is more ocmlex than I thought.  What's teh requirement to PULL data via pfp2?
  - have to rearrange the fields and organization to allow local storage targets and remote dir sources:
  pfp2 --yadda --yadda --localdir=/put/data/here hjm@bridgit:/where/we/store/public/data/climate/bydate/2003/12/27/humidity
  which would init a // rsync PULL from [hjm@bridgit:/where/we/store/public/data/climate/bydate/2003/12/27/humidity]
  wouldhave to query the rsync server for max number of parallel rsyncs possible (hmm.  
  rsync rsync://bridgit.mmg.uci.edu
    testpfp         module for pfp test data
    goober          dat store for goober project
    [no max rsyncs possible], but could encode this into the comment section.  But could just make 
    it a high # and then ask ppl to try later.  Also have to agree on the chunk size to allow
)
  - but would have to set a limit of the number of // rsyncs possible
  - 


10-11-21
- [ ] can pre-check hosts with:
  $NOTFOUND = `nslookup $host | grep "server can't" | wc -l `; chomp $NOTFOUND;
  returns 1 for can't find it, 0 for can find it.
  so if ($NOTFOUND), {FATAL()}
- [ ] what about symlinked exe names in checkhost?  should rsync them as well!
- [ ] test that targets like: c07n104=10.110.5.15::pfptest (rsync daemon)
    get returned correctly by parse_rsync_target().  Seems to be failing on ITER
- [ ] this:
    [~/bin/parsyncfp2  --ro='-sla' --verbose=2 --NP=16 --maxload=44 --checkper=10 --commondir=${pfpd}/$USER --hosts='c07n091=10.110.5.15,c07n103=10.110.5.15,c07n104=10.110.5.15' --startdir=${pfpd} data.rnd POD::/mnt/b4csz850/parsyncfp2-test  ]
    is getting translated individually as :
    [10.110.5.15/mnt/b4csz850/parsyncfp2-test]
- [d] should also allow lines like --hosts='send1,send2,send3' ... rechost:path
    when all the senders are sending to one host..?  Or does this complicate things
    too much?  Yeah, too much confusion right now. Make sure that it works as
    advertised RIGHT NOW.
    
- [ ] rethink --losf option for MH.  Can each client use teh g2g approach once the chunk files are in place?
    so that the clients don't have to share data structures with each other..?
    In the SH version, the master took care of eval'ing the chunks and pre-processing the losf chunks.
    In the MH version, should be able to separate the fpart chunking and the pre-processing for transfer
    using the same approach as the SH version.
    - or the master remains alive in the background, monitoring the transfer and providing tarchived chunks 
    to all clients using the g2g approach...?  Or define the amount of free space for buffer and allow the 
    tarchived chunks to approach that limit?
    - once the chunk exists, wc it, if there are too many files in it (&& the total number of waiting 
    tarchived chunks < NP - so have to figure out how to indicate that..), tarchive it and then Q it.  
    But it still requires that coordination of how to not exceed the total # of tarchive chunks.  That's still going to require a lot of stat's if there's no socket communication.
    - maybe it just exists with the SH version as written. Would still be useful.
- [ ] --bigfiles option in MH mode? Master takes care of all the splitting and the clients just act as if
      they're normal chunks?  Then at end, init a remote job to cat them all together again.
      bigdir/big -> split into big.aa big.ab big.ac etc then
      ssh rechost 'cd bigdir; for ii in [list of split bigfiles]; do cat ${ii}.* >> $ii; done'
  - for the above, assume that fpart is waiting until split finishes.
  - if do regular fpart 1st and start split at the same time, may save some time, but makes the interleave
      a bit difficult.  remember, fork splits the exe and all vars up to that point are shared, but not after
      system starts a new exe, with NO shared vars.  So could fork after starting the split
      
  - df | grep -v '/dev/loop\|/run\|/sys\|udev\|Filesystem\|devtmp' | scut -f='0 3 5' 
      will generate the list of usable filesystems to use for buffering:
      on bridgit, it gives:
      /dev/sda1   63177456        /
      tmpfs       16450160        /dev/shm  # where the spaces are tabs.
      /dev/md0    1952396580      /data
                  (1K blocks)
      on hpc3 & stunted it works ok too
      process this into a hash as its proc'ed using the mount as the key and the avail space as the value.
      ignore the device; don't even take the dev
      
  - initial test: if ($MAXBUF > $AVAILSPACE) {FATAL("Available space on device < the BUF size you provided [MAXBUF]")}
  - "split -b $CHUNKSIZE filename filename__" to split
  - "cat filename.* >> filename" to reconstitute.
    - emit a warning that the preproc will take some time.
    - { 'for ii in $BIGDIR/*; do split -b $CHUNKFILE $ii ${ii}__ ; done ' crudely.  Has to check that it hasn't
        exceeded the $MAXBUF size at each iteration.
        $Nfiles = @files = split (glob $BIGDIR/*); # this
        $filecnt = 0;
        while ($cur_buf < $MAXBUF && $filecnt < $NFiles ) { 
          $cur_buf += filesize 
          if ($cur_buf < $MAXBUF && $cur_buf < AVAILSPACE) {split file -> file__aa, etc into temp --startdir subdir}
          else { FATAL( "Exceeded Max space allocated to bigfiles" ) ; }
        }
          
        } # so now at $MAXBUF or run out of files
        if (finished files) {$stillbigfiles = 0;}
        
        
    - this will duplicate the storage until the split files (or original) have been rm'ed, so maybe 
        require a --maxbuf=s #bytes, #[KMGT] to limit the size of the temp duplication??
        
    - the 1st way to do this is to wait for the splits to finish, THEN kick off fpart.  for TB of files, this will take a while, say @ 100M/s R/W a TB will take 167m(!) just to split. 
    - the 2nd way is to kick off fpart and rsyncs, and then back up to do another pfp2 on the bigfiles dir.
    
    - but actually, don't need to wait for fpart to do anything since they're already in chunk size.  so all we have to do is append those files AS chunk files.  BUT, have to keep track of where they are and the original base filenames.
     f.7467 f.7468 f.7469 f.7470 f.7471 [end of fpart chunks]
     AHA!  each fqpn splitfile goes into its own f.# file and get processed as normal, and the splitfile should
     be reconstituted into the correct dir as a split file, but in an optionally different place.  By default
     they splitfiles go in with their parent, and then get deleted when xferred (via a --bigfile-dependent loop)
     ie: the file /home/hjm/bigfiles/bigA_ac -> f.7472 as '/home/hjm/bigfiles/bigA_ac', gets transferred,
     then can be reconst at the end.
     once fpart writes FPART_DONE, the split can start splitting and adding the chunks to the list.
     --bigfiles loop also has to watch for 'BIGFILES_DONE' being written to assure that splitting is done as well.
     so if there are only a few (or no) smaller files to xfer, fpart finishes v fast and the splitting starts.
     could also watch for first few files to start xfering. ie don't wait for the whole split, just the 
     
       splits now added as aliases?
     (in rsync loop)
     if ($BIGFILES) {$CUR_FP_FLE = $CUR_SPLIT_FLE; $CUR_FPI++} # $CUR_FPI isn't a part of the $CUR_FP_FLE but is kept current with it.
     rsync_cmd .. --files-from=$CUR_FP_FLE
     
    - so the --bigfiles option requires:
      --bfdir=/path/to/the bigfile/dir = $BF_DIR
      --bfmaxbuf=xxx[KMGT] = max space to use while splitting the bigfiles (splitting each bigfile will double the space requirements) == $BF_MAXBUF
      --bfsplitdir=/dir/to/store/splitfiles
        [/path/to/bigfile -> split -> /otherfs/path/to/store/splitfiles] = $BF_SPLIT_DIR
        (if !defined $BF_SPLIT_DIR) {write a routine to suggest possible options using df approach above.}
      
    - if there are bigfiles in the fpart tree (much larger than the partition size, should emit a warning and note the --bigfile option)
    - if there's only a --bfdir given, bypass fpart and start up rsyncs as fast as teh splitfiles appear,
      then keep feeding rsyncs in the usual fpart file way (if possible)
    - if there's a 'normal' fparty set of files to process, set up the fpart run, then
    

10-06-21
- [x] add final exit test to see if there are any spare rsync's running that may have
    been left orphaned by rsyncs dying at the remote end.
- [x] the end conditions allow pfp2 to exit while its rsyncs are still finishing.  Need a final loop to 
    wait until those rsyncs are finished.
- [ ] INFO & WARN should have non-host equivs NOT show hosts unless they need to.
- [x] work on the checkperiod/launch new rsync problem
  could use the system clock and only calc bandwidth every checkperiod sec, but check rsyncs every 0.1s
  so ..  YES!  Done!  Finally! but it pointlessly cycles too fast - add sleep ~1 to prevent loadavg from pointlessly rising.
- [x] iron out where all the friggin config and log and cache files are going to live.  
     They should live in ONE PLACE!
- [x] there's a long pause after:
      (hostname) INFO: The detritus from the previous run has been cleared .. continuing.'
      why is that? (solved: 2 sleep 1; not needed.)
  [x] stop double check of the REC host (bridgit) - should be done already - oops - mistakenly 
      commented out that line
  [x] there's a 1-off difference in calculating CUR_FPI between SH and MH - needs to be fixed.
  [x] add a final check if the corresp rsync log is 0 to check for failed rsyncs and re-transmit if so..
      THIS IS REALLY IMPORTANT ON MULTIHOSTS!  Even when it ends correctly, some of the rsyncs can fail to 
      transmit correctly.
  [x] perhaps one way of dealing with failed rsyncs is to check the bandwidth for X successive loops && 
      if it's 0.00 for all of them, kill off the rsyncs with an error message, go directly to the retry.
  [x] and finally kill all the 'rsync --server' procs still running that the user owns.  Just run the pfpstop
      script.
  [x] verify filelists still work in SH and MH modes.
  [x] verify loadlevelling still works in SH & MH modes
  

09-20-21
- added crude tags to src to allow better doc
- finished 1st complete edit of HTML manual.
- decrufted src considerably
- pfp2c now works in SH an MH mode, does crude hoststring munging (but doesn;t use 
    parse_rsync_target() yet,  tho it now works OK.)
- from here on, clean up code, and draw up the block diagram for ITER.
- now allows for executable to be named anything (parsyncfp2, pfp2, pfp2c, turtletooth, etc.) 
    and for that exe to be transferred to SEND hosts during checkhosts() and will set the path 
    to it correctly.
- verbosity now looks like it's working again (tho dunno why it stopped working, tho there 
    certainly was oppo for it during all the losf replumbing)
- modify for allowing the executable to be renamed and stop.
    the losf stuff in MH mode (in pfp2c-losf) will require a better approach.  
- the current pfp2c-losf works fine in SH mode, 
    but falls apart in MH mode bc (OF COURSE) the g2g_*** stuff won't transfer across hosts.  
    It would require a file-based approach or sockets. So instead of using the g2g_-based 
    trackers, return to using f.# but processed as the g.# and s.# files.  They're distinguished 
    not by file name but by wc of the file & the contents 
  (1 file, ends in tar.lz4.pfp).  But, instead of being able to do the lookahead via filenames, going 
  to have to do them via actually reading, evaluating the f.# file.


09-18-21
- the bit about matching md5 checksums works correctly, and should allow user to change program 
  name ie if user changes parsyncfp2 to pfp2 or 'stumblebum', it should continue to work. 
  So detect the name of the calling program and use THAT to update send hosts and to start 
  the program on the send hosts.

09-12-21
- all bits work in single host mode. now starting to test in MH mode.
- using bigben and stunted to bridgit 1st.  There's still a problem with bridgit dropping rsync connections 
  - may need an update and reboot..? 

09-03-21
- in re-factor, GLOBAL VARS ARE IN CAPS.  local my vars are in lc
- distinction between local and global  declare GLOBALS at top, locals as we go.
- why have a startup loop and a main loop?  why not just have the main loop incr up to the NP 
  and then keep going.  that way don't have to do everything 2x.  Next major hacking after the 
  --losf option starts working all the way thru.
  
- the above proc would work like:
  - launch f.0 regardless
  - while (conditions to add rsyncs:
    rsyncs_running <= NP
    
    ) {
    check for new f.#'s
    eval f.#, convert OK to p.#, add to tgz if nec (and leave f.# in place?) or convert to p.# and then
    proc the p.# as needed..? that sounds like it would fit in better. so consume f.#s as they appear and convert
    to p.# which would go into @tgz to be proc'ed later or @g2g to be proc'ed immediately.
    - etc/
  }
  
09-02-21
- check that remotes have lz4(!)

08-27-21
In the process of hashing out the tgz operations, removing the fork() that contains sending off fpart.
fpart runs perfectly well in the bg with a simple system("$fpart_cmd").
That allows the tgz routines to stay in the main process and share variables via @tgz and @g2g as previously.
So the current pfp2 should now include the no-fork (or less forked) version.  Migrated to the pfp2-losf version.


# 08.06.21
# Looks like threads are not a good fit for Perl.  The default perl in most distros does not have
# threads enabled due to poor perf and compatibility with lots of older, popular modules.
# So try to do the remote control without threads.  CAN use SELECT to bounce between control sockets.
# Output from the slave prcesses can still continue to print and that output will still be displayed on the
# screen in roughly the order it gets created (mod transmission times). So that doesn't require any
# socket stuff
# BUT comms to the slaves requires sockets.  It could be 1 common socket that is then SELECTed to
# talk to all of them and 1 socket for each slave to control each one individually. ie ports defined as
# 0 1 2 3 4 5 .. #slaves
# 0 = broadcast (all slaves listening on this)
# 1 talks to slave 1
# 2 talks to slave 2 etc
# <tab> cycles thru all sockets, starts at 0
# or just 1 2 3 4 and 'all' just cycles thru all of them quickly

# Re: reducing time to start on additional rsyncs, use a tight loop to monitor rsync exit values and
# launch new ones right away.  ie I have all the PIDs, so when they disappear, the rsync has finished.
# I use fork to separate fpart from parsyncfp, can I do the same with a monitor system that watches
# the PIDs or the exit values?
#

# 07.31.21
# post-rename, some errors have crept in. .. and solved.
# [ ] new rsyncs are started only during the cycle time (--checkperiod).  That should be the time that the
#       display updates.  new rsyncs should start /immediately/ when the previous one stops.  If not,
#       a checkperiod of a minute will delay new rsyncs from starting until that period ends. This might be
#       addressed by wrapping the startanewrsync code as sub() and then calling it immediately after the
#       running PID ends.  This can be handled in a thread that will immediately go off and start a new rsync
#       while the parent stays in the checkperiod loop. (see p340 LStein book) line ~1711
# 06.22.21
# [x] time to rename back to parsyncfp/pfp. But with version 2 - parsyncfp2/pfp2 to keep separate
#       from original pfp.
#       So rename all instances in this file to parsyncfp2/pfp2 and name file parsyncfp2.
# [ ] work on the --losf (or --zotfiles?) to tar/compress zotfiles before sending.  Define what
#       a chunkfile of zotfiles looks like.  > X files / GB? allow user to define?
# ie what defines losf? define losf as median size of the files in a chunk < 1MB
# [ ] also matching option: '--bigfiles=dir' where that dir has one or more enormous files, which have
#       to be 'split' to the size of '--chunk', the chunks sent over, and then each file reconstituted
#       with 'cat'.
#       so 'split  -b $CHUNK ${dir}/$filename ${dir}/${filename}.' will yield $CHUNK sized subfiles
#       in the same ${dir} named ${filename}.aa
#       ${filename}.ab ${filename}.ac etc
#       and 'cat ${filename}.* > ${filename}' will reconstitute it on the remote end.
# [x] Do a checksum of parsyncfp2s being used to verify that we're using identical parsyncfp2s
#       on different hosts.  ie pass '--mstr_md5=$MD5SUM' to slave processes.
# [ ] Squash the underrun bug. ie when total number of chunks are close to hosts * NP, get premature exits on
#       some of the slaves.

# 06.21.21
# [x] get rid of stderr on ssh commands.
# [x] clarify how/which parsyncfp2 executable is copied to slaves. see above for checksumming flag.

# 06.16.21
# [?] for ZOTfiles, tar the chunkfile listings into a tarball and then send THAT?
#       Can increase speed about 100X for ZOTs (but it depends on how long it takes to tar the ZOTs as well).
#       Can't put rsync at the end of a pipe, but you CAN with nc/tnc since you're calculating
#       the checksum in a subshell in parallel.  Can do a 'wc' on the chunkfiles as a rough estimate of
#       # of files, without having to re-stat the files.  So if wc -l chunkfile > some cutoff, tar the
#       chunkfile and send & untar on the other end as an exit fn().

# 06.09.21
# [x] added autogenerated plot file with feedgnuplot
# [?] rewrite as a straight gnuplot file to avoid having to install feedgnuplot and all the Perl deps?

# 06.08.21
# [ ] start on socket control.
# [x] at iter, checks rsync server 3 times instead of once. fixed inelegantly, but works  See the TODO
# [x] used make_path instead of mkdir to allow recursive dir creation.
# [?] add note that MH, use --commondir to set ALTCACHE rather than --altcache.  OR, is it time to
#       separate them completely?  If someone is using pfp to run simultaneous MH sessions on
#       the same machine, they will collide. Unlikely, but possible.
# [x] autogenerate a plot file for feedgnuplot?

# 06.07.21 (start dating changes)
# [x] looks like UDR is NOT going to help on longhaul networks - about 1/30th the speed of the naked
#       tcp version.  remove until that's debugged. submit bug report
# [ ] wrap the REC hosts in the checkhost sub.  Looks like the 1st host of the SEND=REC pair is
#       checked with checkhost(),  but the second part is still checked inline.
# [x] have only one SEND host write the scroll header. Send the index and have only 0 write them?
#       ie --seqhostseq=$i and test with if ($SENDHOSTSEQ = 0).  But .. will end the printing of headers
#       when that host has finished. not a biggie.
# [x] exit message is still counting rsync-logs wrong: indicated 151 on a 150 and the last was 148
#       so it looks like an off-by-1 plus the stride.
# [x] this came up between 2 messages from cooper, not from any other host.
#       sh: 2: 128.200.182.222: not found    WTF???  haven't seen it in any other run yet tho..
#       due to 2 entries in the /etc/hosts file.  WIll now only take the 1st one.
# [x] util check should only warn of missing utils if the appro interface is found.  ie only mention perfquery
#       if there's an ibx up, only the iwconfig if there's a wireless interface up.
#       integrate into the checkhost() &| first_run_required_utils() subs.
# [x] fixed the different exit points so now have a single one.  Ugly workaround but it seems to work OK.
# [x] --hostcheck should check loadavg and send a warning/pause back if it's higher than what you've set
# [?] MH version at ITER either not report ing suspended PIDs or not suspending them correctly
# [x] write a 'pfpstop' bash script into ~/.pfp before sending off the remote commands so that you can
#       just call it to kill off the currently running pfp on all the hosts (essentially writing the
#       kill script with all the host names filled in)
# [x] Note to user to use 'script' to log output if using single host mode.
# [?] If using pfp in single mode, reject --hostcheck to an rsyncd server? Not clear if this is nec.
#       maybe just give an INFO message?
# [x] Delete all of previous cache as default and add non-user '--date=$DATE' to send to slaves so that
#       all the bits of a send have the same date, not skewed slightly bc of the time diff to ssh and
#       set new $DATE.
# [x] Move all the pfp-related dirs into ${commondir}/.pfp so that they disappear from regular 'ls'
#       output & can be manipulated better. ie remove all rsync logs with 'rm -rf $parsync_dir/*/[fFrps]*'
#       That will catch the fpcache dir, the host dirs, etc without possibly veering off into user dirs.
# [x] WEIRD: if the chunksize is so big so that the # of chunk files is close to the total allocated in the
#       rsync startup (ie total NPs = 2x4 and the total is 8), the second SEND host output may not show up
#       the transfer will work, and the rsync logs will record a successful transfer, but the SEND host
#       output will not show. due to an off-by-one count for whether to traverse a loop.
# [x] reduce nowait time to 1 s vs 3s.
# [ ] does the diff between TCP and RDMA bytes matter much? I used them to distinguish between
#       bytes going to and from BeeGFS (often RDMA) and NFS (TCP) but I'm not sure it matters to most ppl
# [?] add --bigfiles - indicates that the dir provided with --startdir contains huge files
#       which must be split into chunks of size '--chunksize' so that they can be sent in
#       parallel.  Start with single dir.
#       req single dir & uses --chunk. Can't be used with rsyncd server since can't be sure  of
#       ssh/shell access, so can send slices to rsyncd, but can only recombine with regular rsync.
# [x] fix bug where as user root it munges the fp chunk files incorrectly (adds /root to the path)
#       now fixed.
# [x] check that user has a RSYNC_PASSWORD defined when using an rsyncd host.
# [?] weird cut/paste bug in nedit that invisibly buggers the commandline if copied into nedit
#       and then copied/pasted out of it.  No strange characters I can see, but it reliably fails
#       unless I delete an invisible character and then insert a space over it. &^$^%$@^#
# [ ] still have bugs with underrun - if too LITTLE data, things will hang or complete incorrectly
#       and end with blank numbers in the bytes transferred notice. Works OK if there are enough files
#       to pack the pipeline.  See also below for other underrun conditions.
# [x] add per-host tee-logging to the appropriate SEND host subdirs so logs can be checked accurately.
# [N] check POD::module transfers to module-less REC hosts like the POD::/path does for regular paths
#       No - to many options.  Require that modules get named explicitly in the --host lists.
# [] still incomplete: check interfaces and only require utilities that use that interface.
#       ie if you don't have an ib interface, you shouldn't need the ib utils.
#       integrate into the checkhost() sub.
# from: https://www.freedesktop.org/wiki/systemd/PredictableNetworkInterfaceNames/
# * Two character prefixes based on the type of interface:
# *   en — Ethernet
# *   ib — InfiniBand
# *   sl — serial line IP (slip)
# *   wl — wlan
# *   ww — wwan
#     em - embedded on mobo, probably Ethernet
#     bond - could be anything, even different types of IF
# [ ] LONGTERM: continually check send bandwidth and dynamically balance the chunks to make the
#       faster SEND hosts take on more work - ie once the fast nodes have finished, work backwards to
#       send the remaining chunks via fast nodes.  The slow ones will then sync quickly thru the ones
#       that have been sent by the fast nodes. Or signal that all chunks have been sent.
# [ ] LONGTERM: write a FS pre-process to filter on # of files, last mod time, user, etc to tgz sets of
# [x] if launched out of a writable dir, the .scutjunk dir creation fails - fixed in scut
#     files on a per-dir.  This should be separate from, but callable by pfp to reduce the size
#     of the data transmitted. This fn() can use the RSYNC_SKIP_COMPRESS bit from below.
# [x] check /etc/os-release to see what distro that's running and suggest the things to install on
#       1st run, on master of course, and all SEND hosts as well.  REC hosts are much simpler -
#       all they need is a rsync or rsyncd (handled separately)
# [x] in the utility collection, iwconfig should be installed separately due to lib dependencies.
#       maybe stop trying to be clever and request that the compiled deps be installed natively.
#       check at 1st run to make sure all the utils are available?  ie just copy the pfp-specific
#       scripts (pfp, scut, stats). Bundle everything into preflight().
# [x] Cooper still fails with the byte reporting at end
#       fpart, wireless-tools(iwconfig), infiniband-diags (perfquery) are in repositories.
# [x] if an rsyncd connection is requested, just try the rsync rsync://host to check whether
#       the rsyncd is up and responding correctly.  AND check that RSYNC_PASSWORD is set in all the
#       SEND hosts. (you can CHECK rsyncd modules WITHOUT a password, but you can't TRANSFER data without
#       the PASSWORD.. (later) usually.  You CAN transfer if you open everything to the rsyncd server)
# [?] check stats to add bypass of those calcs that require at least 3(?) numbers -> leads to error:
#   Illegal division by zero at /home/hjm/.pfp/stats line 342, <> line 2
#   this is usually the case where the rsyncd overruns the number of allowed  rsyncs running
#   adjust with 'max connections = xx' set to a high number (NP * # hosts at least)
# [x] check rsyncd modules to see if the one you chose actually exists. rsync rsync://gentoo returns
#   list of modules.
# [?] get rid of rsync logs and rsync PIDs as well as fpart chunk files when clean the fpart files
# [?] check if all the files fit into the 1st fpart chunk.  It may just hang and never get to the
#       warning.
# [x] related to above, if # of chunk files = # hosts x NP, then FPART_DONE doesn't get written and
#       the SEND hosts just cycle waiting for it.
# [x] Catch entries with '~' in the dir path and either substitute in $HOME or require full paths.
#     it's the specs on the send side that were causing the problems, not on the REC side, I think.
#     '$HOME' can be used in the command line with no problems, but '~' cannot be.
# [x] Check that the 'wait for fpart to catch up' works on the SEND hosts. yup it does
# [x] Verify that the maxload stuff (suspend/restart) works on the SEND hosts as well.
#       seems to be equalizing OK, but not reporting it correctly.  ie when load goes
#       over maxload, it suspends, but miscounts.  Fixed.
# [x] pfp to rsyncd is fine with the single host version, bc the final argument is passed thru
#       as is & therefore it will be translated appropriately.
# [x] But it's NOT currently translated for the --hosts line so
#     it will screw up. In both cases, it needs the envvar RSYNC_PASSWORD to be set and exported:
#     ie [export RSYNC_PASSWORD=yoursupduperpassword], to match the one set in the remote
#     /etc/rsyncd.secrets file
# [ ] There's still a weird pause (by watching bandwidth) between the initial batch of rsync starts
#     and the rest.  Also, for each rsync startup, there is a short startup for each one, so it's
#     more efficient to have fewer 'bigger' rsyncs than a zillion small ones.  Hence optimizing the
#     chunk size is quite important.
# [x] remove the ROUNROBIN code - they don't seem to be interested in pursuing this and I'm not sure
#     it will work correctly anyway.  Multiple interfaces on the same network will resolve to one at the
#     switch/router.  You can bond to increase bandwidth over multiple interfaces
# [x] commented out the check for 'if ( $TARGET =~ /~/ )' to prevent targets from using the tilde.
#      autocorrected to
# [x] use Chris Rapier's subs to replace the `ls` and system(rm) calls to
# [x] rsync all the scripts/programs from the master to all slaves if missing; place in
#       ~/.pfpbin, add that to the PATH, execute from there, and then delete afterwards.
#       pfp parsyncfp2 udr scut stats fpart ethtool scut stats ip iwconfig perfquery.
#       Not going to be perfect since there may be some lib inconsistencies..
# [x] check to see if using .ssh/config file and WARN that getent probably won't work with a
#      hostname in that file.  or work around that problem?
# [x] Add option to prefix remote commands with the right PATH. fixed with
# [?] add the following as a compression option?
#      export RSYNC_SKIP_COMPRESS=3fr/3g2/3gp/3gpp/7z/aac/ace/amr/apk/appx/appxbundle\
#      /arc/arj/arw/asf/avi/bz2/cab/cr2/crypt[5678]/dat/dcr/deb/dmg/drc/ear/erf/flac\
#      /flv/gif/gpg/gz/iiq/iso/jar/jp2/jpeg/jpg/k25/kdc/lz/lzma/lzo/m4[apv]/mef/mkv/mos\
#      /mov/mp[34]/mpeg/mp[gv]/msi/nef/oga/ogg/ogv/opus/orf/pef/png/qt/rar/rpm/rw2/rzip\
#      /s7z/sfx/sr2/srf/svgz/t[gb]z/tlz/txz/vob/wim/wma/wmv/xz/zip
#      rsync --skip-compress=$RSYNC_SKIP_COMPRESS .....
#      And how to offer it with the user-defined --rsyncopts which will replace the default options?
# [x] do multihost and single host versions put things in different places? Not really. The syntax
#       of the commands can lead to a top-level dir-addition when using the
# [?] couple of times, saw that one end of the rsync would die and leave the other end hanging.
#       is this an anomoly or something that pfp is doing? Related to observation below?
#       if this is a true bug, have to catch and fix.  this does not seem to be repeatable.
# [?] sometimes the rsyncs on the REC host dies off before the sends are finished...?
# [?] need to track all the rsync PIDs on all the hosts and be able to kill them all if nec.
#     Definitely some files left to transfer.  Is this due to leftover rsyncs on the Master
#     that interfere
# with the slaves or target??   When all machines were rsync-idle, it completed correctly.
# have to check with large sends!
# [x] add --hostcheck to verify that SEND/REC hosts are rsync-quiescent.
# [x] see line 1320 - the slave doesn't know that fpart is finished.  Need a file named:
# 'fpart-is-finished' when it is.
# [x] rationalize all the fpart_running variables into 1 ..
#       $fparts_already_running  $FPART_RUNNING  $MYFPARTRUNNING
#       change to $OTHERFPRNG  and $MYFPRNG
#       differentiate between the fpart launched by THIS pfp and all fparts.
#       Warn about other fparts running, but track only THIS fpart for cues to quit.
# following are user-UNdocumented/invisible for send hosts
# [x]  --FPstart ($i, $i<FPstride)  (for 2 hosts, FPstart=1 for the 1st host, 2 for the 2nd, etc)
# [x]  --FPstride ($nbrhosts) (from --hosts parsing, for 2 slaves 1st slave sends 0,2,4,6,etc,
#       2nd sends 1,3,5, etc )
# [x]  --skipfpart (for send hosts, self-explanatory, chunks supplied by master)
# [?] add roundrobin handling for multiple interfaces from a given list.  ie
#     ie. --roundrobin="192.168.0.4,192.168.0.5,192.168.0.7,192.168.0.9,192.168.0.24,192.168.0.134,192.168.0.78"
#     --roundrobin isn't compatible with multihost tho.
# [x] rsync logs have to be stored in separate hostname dir. So instead of
#     ~/.parsyncfp/rsync-logfile-11.30.39_2020-12-03_9, it has to be
#     ~/.parsyncfp/[$HOSTNAME]/rsync-logfile-11.30.39_2020-12-03_9
# [x] convert messages to be prefixed with $HOSTNAME:
# [x] add hostname to scrolling output - temp solution, need a better one - just send to log?
# ie $HOSTNAME WARN: so that if it's NOT the multihost version can just blank the hostname.
# [x] add options:
# 1st 3 are visible flags
#   --hosts="s1=r1:/path1 s2=r2:/path2 s3=r3:/path3 s4=r4"
#   --udr (no args, for using the udt/udt/udp setup); added to utility check to see if avail
#   --commondir [s] for the common, shared dir
# also need to add a home NFS system to test locally for the common
# [x] add POD::/path target processing to append onto all unadorned REC hosts

# [x] separate required and recommended utilities and check for them separately.
# [x] Fix fpart to allow files with spaces in the top level spec
# [?] Add realtime bytes transferred to scrolling output?
# [?] use STDIN to allow output of 'find', etc to provide the files to rsync with the --fromlist opt
#     ie, use 'if (-t STDIN)' to detect STDIN.  This actually will require pfp to take the STDIN and
#     then write it to a file and then pass that file to fpart.  So this is something of a kludge.  It
#     would be best to pass the STDIN handle directly to fpart, but this doesn't look possible (easily),
#     altho [IPC::Run3] or [IPC::Open3] might allow this.  https://metacpan.org/pod/IPC::Run
# [?] insert an option to allow rsync's weird/elegant/idiosyncratic '/' suffix behavior for those
#     who really want it.  --risb = 'rsync idiosyncratic slash behavior'
#     if there are '/' on the dir spec, allow them to pass thru without mods (usually
#     pfp trims trailing '/'s
# [?] check rsyncoptions ssh port change if poster replies
# [x] check whether there's any IB on the system and bypass any IB-related code/questions.
# [x] done check to make sure if high NP and low # of chunks cause feedback lines to be skipped.
# [x] done add bit of code to sum all the bytes transmitted from all the rsync logs
# and present them both as bytes and MB, GB, TB at exit. ie in bash:
# [x] done: rare condition where there are suspended rsyncs at end.
#     Have to check whether there are suspended PIDs and UNsuspend them to finish correctly.
# [x] done: !! debug to find out why suspended/restarted rsyncs don't complete correctly. !!
# [ ] - option for bytes IN or OUT.  Usually bytes go out and that's what's shown, but sometimes
#         the transfer is coming from a network FS to a local disk and then you want bytes IN.
# [x] done:- issue WARNING when the fpart chunk fle are greater than some #.  If the chunk size is set too
#         small, there will be som many chunk files generated that the 'ls' can't handle them.  So
#         either catch when the # goes very high or change the way that pfp handles them.