assignment4

TE annotation is a tedious iterative process. Here is the general workflow:
1. Generate a library of putative TEs (RepeatModeler) --> generates a .fas file.
2. Blast that library against your genome (Blast) --> generates a .out file.
3. Use the blast output to get extract hits from the genome (extractAlignTEs.pl) --> generates a lot of .fas files. Some of the .fas files from extractAlignTEs are .muscle.fas, aka aligned files with a CONSENSUS that has been generated by the person doing the work.
5. Collect all CONSENSUS sequences from each individual .muscle.fas file into a single file and degap them to be re-blasted by repeating step 2. 

Automating as much as possible will save weeks of work for upcoming projects. 

This week's task:
Use python to:
1. run blast: blastn –query <TE library> -db <genome> -out <arbitrary name of outfile> -outfmt 6 (is to make it tab delimited)
2. run extractAlignTEs.pl: extractAlignTEs.pl –genome <genome> –blast <blast output file> –consTEs <TE library> --seqBuffer <# of bps flanking TE> --seqNum <# of closest blast hits> --align 

Starting files are as follows:
cPer_rn.fa.gz
extractAlignTEs.pl
starting_library.fas
Can be found at: /lustre/work/daray/extractAlign_assignment