MGMT process

From Community

Jump to: navigation, search

This page includes information about MGMT simulations for the paper by

Erik Barry Erhardt and Laura Salter Kubatko

"The Effect of Incomplete Lineage Sorting on the More Genes or More Taxa Debate"

Systematic Biology - Manuscript ID USYB-2007-069

Submitted: 9 May 2007

Revision request: 30 Jul 2007

Resubmission planned date: end of October 2007?

Contents

[edit] Directory Structure

[edit] user root directory

 ~/
  • these two files together calculate the node load on specified machines in the cluster
 nodeload.bat
   #!/bin/bash
   # path to load average finding script
   # should simply print the number of CPUs that are free
   la_cmd=~/calc_free_cpus.pl
   # Loop 1 - 10
   for i in $(seq 1 10)
   do
       # construct node name lc1, lc2, ... lc10
       node="lc${i}"
       # run load average cmd on remote node and put results into local var
       fmla=$(ssh ${node} ${la_cmd})
       # print something like: l1 3
       echo "$node $fmla"
   done
 calc_free_cpus.pl
   #!/usr/bin/perl -w
   # calc_free_cpus.pl
   use strict;
   # We have 4 CPUs in a computer
   my $cpus = 4.0;
   # find the five minute load average from the uptime command:
   my $five_min_la = `uptime`;
   # trim leading cruft
   $five_min_la =~ s/^.+load average: //;
   # take the second number (5 minute average):
   $five_min_la = (split(/,/, $five_min_la))[1];
   # remove any additional whitespace
   $five_min_la =~ s/\s//g;
   #print "$five_min_la\n";
   my $potential_processes = int($cpus - $five_min_la + .05);
   print "  * The five minute load average is $five_min_la, so \n";
   print "      there's room for $potential_processes more processes\n";

[edit] program development area

 ~/dissertation/summer2005/progs/devel
  • Used to generate script files to run simulations for more genes/more taxa study
  • This is the work horse!
 makefile_scriptgen
 scriptgen_all
 scriptgen_all.c
 scriptgen_all.o
 scriptgen_all_bak20070227_len6.c
 scriptgen_all_bak20061224_len5.c
  • Used to interleave seq-gen and paup code in order to have different paupcode after each seq-gen simulation
 makefile_riffle
 riffle-seqgen-paup
 riffle-seqgen-paup.c
 riffle-seqgen-paup.o
 riffle-seqgen-paup-linux
  • support code
 com.c
 linpack.c
 post_prob.h
 ranlib.c
 ranlib.h

[edit] executable directory

  • executable files are placed in the exe directory, and compiled under both linux and unix.
  • setting the OS in scriptgen_all.c will select which version (unix/linux) to run
 ~/dissertation/summer2005/progs/exe
  • coal executables
 coal-linux
 coal-linux-20050823
 coal-unix-20050711
  • paup executables
 paup-linux
 paup-unix
  • post_prob executables
 post_prob_linux
 post_prob_unix
  • riffle executables
 riffle-seqgen-paup-linux
  • script_paup scripts (NOT USED; 2 lines of script to get the length of time of execution?)
 script_paup_linux
 script_paup_unix
  • seq-gen executables
 seq-gen-linux
 seq-gen-unix


[edit] simulation directory

  • Each simulation series gets an updated name and its own directory
  • len is short for length, and the initial submission of the paper used results from len6 (March 2007)
 ~/dissertation/summer2005/progs/len6
  • creates the batch file to create and population all the simulation subdirectories L1L1G15 (...)
 mgmt_len6_splitup.m
 mgmt_len6_splitup.bat
  • all the "nohup *_run_*.bat" commands to simplify submission on individual cluster nodes
 torun.txt
   cd /home/erike/dissertation/summer2005/progs/len6
   nohup ./mgmt_len6_run_L0.001L0.7G15.bat &
   nohup ./mgmt_len6_run_L0.001L0.7G15-2.bat &
   nohup ./mgmt_len6_run_L0.001L0.7G30.bat &
   nohup ./mgmt_len6_run_L0.001L0.7G60.bat &
  • where the screen output is dumped from nohup command
 nohup.out
  • runs this simulation, called from commands in torun.txt
 mgmt_len6_run_L1L1G15.bat
  • simulation status
 outL1L1G15.out
  • a series of commands that provides overall process status, r is short for readme.txt
 r
   grep -c "lSam" out*15-2.out out*15.out out*30.out out*60.out
   wc *15-2/*temp*.phy *15/*temp*.phy *30/*temp*.phy *60/*temp*.phy
   ~/nodeload.bat
   top


[edit] simulation run

  • a series of directories are created using mgmt_len6_splitup.m for the individual simulations to run
 ~/dissertation/summer2005/progs/len6/L1L1G15 (...)

[edit] simulation results

  • the final result files are copied from the L1L1G15 directories to results for summarization
 ~/dissertation/summer2005/progs/len6/results


[edit] scriptgen_all.c

The commented header on this file shows what the process flow is (though it is possible this is slightly out of date). It has a bunch of switches throughout to control execution. Almost all the filenames are assigned to variable names, so are easily customizable.

Some variables or settings to be aware of:

 char swUnixOrLinux = 'L'; /* U or L : Unix/Linux switch controls which executable version to use,
                                       eg. paup-unix or paup-linux (each compiled under their platform) */
 char swEXEdirectory = 'E'; /* E or . : Use exe directory or the current directory for coal/paup/seq-gen programs */
 char    BatchName[] = "len6";                   /* name of batch run, used in filenames */
  • many of the simulation settings are in this function
 initialize_global_variables() { ... }
  • creates the paup format species tree
  • case 1 = symmetric, 2 = asymmetric, 3 = variable branch length on an asymmetric tree
 write_species_tree_files()


[edit] Process Flow

[edit] Preparation

[edit] scriptgen_all

  1. Backup scriptgen_all.c as scriptgen_all_bakyyyymmdd_len#.c
  2. Modify scriptgen_all.c for next simulation
  3. Compile and test scriptgen_all
    1. copy executable to a test directory
    2. execute for several sets of inputs
    3. check output files

[edit] directory and script files

  1. Create directory ~/dissertation/summer2005/progs/len#
    1. copy scriptgen_all executable to new len# directory
    2. copy helper files from previous len# directory and update (eg., replace len6 with len7)
      1. mgmt_len6_splitup.m
      2. mgmt_len6_splitup.bat
      3. torun.txt
      4. r
  2. Results
    1. mkdir ~/dissertation/summer2005/progs/len#/results directory
    2. copy previous matlab_len6_treedist_results_all.m and len6_regression.sas files to results directory, update to len7
      1. sas file is not currently used
    3. copy previous results.tex, also
  3. Current directory is now ~/dissertation/summer2005/progs/len#
  4. Update mgmt_len6_splitup.m based on simulation settings
    1. creates initial batch files mgmt_len7_run_L#L#G#.bat for each simulation
    2. creates mgmt_len6_splitup.bat for creating directories, etc.
  5. Run mgmt_len6_splitup.bat
    1. this creates directories and runs scriptgen_all to populate with scripts

[edit] Execution

[edit] simulation submission

  1. Submit mgmt_len7_run_L#L#G#.bat files on selected cluster nodes
    1. make plan for which simulations to run on which nodes
    2. ssh to each node
    3. copy/paste selected lines from torun.txt based on plan to submit simulations
      1. will start process and create/append to len7/nohup.out

Note: on atar.wpi.edu this is lc1, lc2, ..., lc10 (lc = linux cluster, 4 cpus each, lc7 and lc10 not working)

[edit] progress

  1. Check on simulation progress using commands in file "r"
    1. may need to go to individual nodes occationally to check on abnormalities indicated by returned information
  2. As simulations complete, check files for indication of correct completion
    1. outL1L1G15.out (...) indicates global status of each simulation
    2. detail is avilable under ./L1L1G15 directories in output files

[edit] Results

  1. Copy ./*/paup_treedist_results_*.out files to ./results/.
  2. Summarize results and create plots by running matlab_len6_treedist_results_all.m
  3. Write a great paper, get published, and live happily ever after

[edit] Simulations

[edit] len7 simulation

[edit] Settings

  • 2 trees with 14 species:
    • one asymmetric
    • one symmetric
  • 4 branch length combinations for each tree:
    • 1 short branch near tips
    • 1 short branch near root (with better justification for why we'd do this)
    • 1 short at tips and 1 short at root (but may remove this)
    • no short branches
  • 3 levels of lineage sampling: (rationale - big improvement from 1 to 2, minor improvement after that)
    • 1 lineages sampled per species
    • 2 lineages sampled per species
    • 5 lineages sampled per species
  • 4 methods of estimation:
    • concatenation
    • minimize deep coalescences (implemented in MESQUITE)
    • BEST (Bayesian Estimation of Species Trees)
    • Maximum Tree (as implemented in STEM - my program)

[edit] Flow of simulation

  1. Model tree
  2. generate gene trees *and tabulate*
  3. generate sequence data
  4. subsample taxa and tabulate included branches
  5. *estimate gene trees*
  6. estimate species trees *with each of four programs*
  7. compute ASymDiff *for results from all programs*

key: new modification to scripts from len6 to len7 at *these points*

Additional discussion points:

  • For short branch at tips of tree, concatenation can be statistically inconsistent, but we expect AsymDiff to be small when many taxa are sampled
  • Minimize Deep Coalescences and Maximum Tree methods assume gene trees estimated without error
    • can look at impact of this assumption
  • BEST is computationally intensive for large numbers of taxa
    • does it improve accuracy over the other methods and in what conditions is it most helpful
Personal tools