Coding period will begin next Monday. It's time to work.

Tree construction module design

The first task of this project is to implement a tree construction module providing three basic tree construction algorithms(UPGMA, NJ and MP). I'll name this module TreeConstruction. Classes design are as follows:

  1. TreeConstructor: basic class for all tree constructors.
  2. DistanceTreeConstructor: This class accepts a DistanceMatrix to create a constructor object and provide two methods, upgma and nj, to construct and return a Tree object. Though we can construct the distance tree directly from a MSA, I think it's better to separate different responsibilities into different classs or methods.
  3. ParsimonyTreeConstructor: This class accepts a MSA to create a constructor object and provide a mp method to construct and return a Tree object. Two assistant methods __parsimony_score and __nni will be used to calculate the parsimony score and to do the Nearest Neighbor Interchanges to search the best tree.
  4. DistanceMatrix: This class accepts a name list and lower triangle matrix to create the object. Some built-in methods __getitem__, __setitem__, __delitem__, __len__ and a insert method will be implemented to assist distance tree construction.
  5. DistanceCalculator: This class accepts a MSA to create the object. Two methods dna_distance and protein_distance can be provided to calculate DNA and protein distances respectively and return a DistanceMatrix object, and two assistant methods dna_pair and protein_pair to calculate pairwise distance.

First week work plan

  • Implement the DistanceMatrix first so that the distance based method can be worked on later. For an object dm of the DistanceMatrix, the expected functions are:

    • dm[1], dm['name']: to get or set the distances related to taxa of the index '1' or the 'name';
    • dm[1,2], `dm['name1','name2']: to get or set the specified distance;
    • del dm[1], del dm['name']: to delete one branch.
    • dm.insert('name', distances): to insert a taxa with related distances.
    • Those functions will be used in UPGMA and NJ algorithms.
  • If there is enough time, try to implement DistanceCalculator. The works include:

    • check and identify the Alphabet of the MSA (why it's SingleLetterAlphabet() no matter what the sequences are?);
    • choose and prepare scoring matrices for dna and protein;
    • write distance methods for dna and proteins.
    • write tests for distance calculation.

Problems and Challenges

I'm sure the DistanceMatrix class can be completed this week. So it won't affect the works for the next few weeks.

For the DistanceCalculator, I estimate it will consume too much time on test design and data preparation.

One problem is how to identify the alphabet of the MSA so as to decide which distance method to use. Let the user define?

Another one is which scoring matrices we should choose. Provide all and let the user select?

Maybe we can implement or improve the DistanceCalculator later if we extent this too much.

Conclusion

Work out the DistanceMatrix and try the DistanceCalculator.