Coding period will begin next Monday. It's time to work.
Tree construction module design
The first task of this project is to implement a tree construction module providing three basic tree construction algorithms(UPGMA, NJ and MP). I'll name this module TreeConstruction. Classes design are as follows:
TreeConstructor: basic class for all tree constructors.
DistanceTreeConstructor: This class accepts a
DistanceMatrixto create a constructor object and provide two methods,
nj, to construct and return a Tree object. Though we can construct the distance tree directly from a
MSA, I think it's better to separate different responsibilities into different classs or methods.
ParsimonyTreeConstructor: This class accepts a
MSAto create a constructor object and provide a
mpmethod to construct and return a Tree object. Two assistant methods
__nniwill be used to calculate the parsimony score and to do the Nearest Neighbor Interchanges to search the best tree.
DistanceMatrix: This class accepts a name list and lower triangle matrix to create the object. Some built-in methods
insertmethod will be implemented to assist distance tree construction.
DistanceCalculator: This class accepts a
MSAto create the object. Two methods
protein_distancecan be provided to calculate DNA and protein distances respectively and return a
DistanceMatrixobject, and two assistant methods
protein_pairto calculate pairwise distance.
First week work plan
DistanceMatrixfirst so that the distance based method can be worked on later. For an object
DistanceMatrix, the expected functions are:
dm['name']: to get or set the distances related to taxa of the index '1' or the 'name';
dm[1,2], `dm['name1','name2']: to get or set the specified distance;
del dm['name']: to delete one branch.
dm.insert('name', distances): to insert a taxa with related distances.
- Those functions will be used in UPGMA and NJ algorithms.
If there is enough time, try to implement
DistanceCalculator. The works include:
- check and identify the
SingleLetterAlphabet()no matter what the sequences are?);
- choose and prepare scoring matrices for dna and protein;
- write distance methods for dna and proteins.
- write tests for distance calculation.
- check and identify the
Problems and Challenges
I'm sure the
DistanceMatrix class can be completed this week. So it won't affect the works for the next few weeks.
DistanceCalculator, I estimate it will consume too much time on test design and data preparation.
One problem is how to identify the alphabet of the
MSA so as to decide which distance method to use. Let the user define?
Another one is which scoring matrices we should choose. Provide all and let the user select?
Maybe we can implement or improve the
DistanceCalculator later if we extent this too much.
Work out the
DistanceMatrix and try the