Today’s post is about something completely unrelated and different from the stuff I usually post. It is about an university assignment that I found really interesting and it was a lot of fun trying stuff out. It’s basically about writing a genetic algorithm for the 2DHP problem, which is a extremely simplified model for examining protein folding. You basically have an array of proteins, which can be black or white. The black ones don’t like water, so they should be in the middle of the resulting conformation. In this model it’s super easy to calculate how good a conformation is: If there are 2 black protein neighbors, but they are not sequence neighbors, you increment the score by 1. You can see a little example in the picture below. The blue circles mean you’d increment the score by 1 because of that. In the red circle, the 2 black proteins are sequence neighbours, so you don’t increment the score.
What a genetic algorithm is
The genetic algorithm (GA) should, in a way, represent the nature and the evolution. You start with a population (first generation) and then calculate the “fitness” (score mentioned above) of every individual in it. This fitness is used to determine which individuals get selected for the next population (second generation). In order to mix these up and also get some new candidates/individuals, you modify (mutate) and “crossover” them. In the end, the better individuals win and get better and better until you find a very good solution to your problem. The cool thing about it is that you could use the genetic algorithm for pretty much any problem, as long as you can get the fitness function right. However, you can specialize your GA to get better results, which is also what we did.
What we did
In order to represent the conformation (protein folding), we could use a simple 2D array and just create elements in it. However, since the genetic algorithm randomly mutates them, this wouldn’t be a very good solution. We can minimize the possible search space with another encoding. We basically have a list of elements which have directions. These are relative directions to their current position. There are 3 possible directions: Left, Straight, Right. By modifiying them it is super easy for the GA to create new conformations. Another really important aspect of a GA is the selection strategy that decides which individuals get to the next generation. In our project we’ve implemented 4 different strategies. To mention the most important ones: We have a simple roulette wheel selection and also a tournament selection.
Testing the genetic algorithm
To test how good your GA is and how it compares to others, there are certain benchmark sequences. These have, for example, 20, 50 or 100 elements and they also measured the maximum energy of each. So you can easily add one of these sequences to your GA and see what an energy value you can get out of them. Below is a sample sequence. 1 means black (hydrophobe) protein, 0 white (hydropolar) protein.
SEQ20 = "10100110100101100101";
In order to visualize what happens during the algorithm you can create a fitness graph showing the average and/or maximum fitness of the generations. Below is a sample graph that shows the benchmark sequence SEQ50, running through 500 generations. You can see the other values in the image. You can also see that there’s not much going on after ~200 generations anymore.
You can go to the GitHub repository to see the full thing. You can also download an executable of the project. The Visual Studio 2012 solution is also included so you should be able to compile it and try it out very fast. It was developed using C++ and SFML. You can also customize the algorithm by changing some of the parameters:
setUp(int maxGeneration, int populationSize, std::string &chain, float mutationRate, float crossoverRate, Selection *selection, std::string& file = (std::string)"average.txt");
- maxGeneration: How many generations to iterate through
- populationSize: How many individuals a generation has
- chain: What benchmark sequence to use (SEQ20, SEQ36…)
- mutationRate: How many individuals are being mutated every generation (in %)
- crossoverRate: How many individuals are being crossovered every generation (in %)
- selection: What selection method to use
- file: Logfile for saving the values of each generation
Below is a good sample for getting good results with the benchmark sequence SEQ50, but they can vary a lot.
setUp(250, 5000, SEQ20, 0.08, 0.30f, new TournamentSelection(2.0, 80.0f));
In order to test those parameters, we’ve also created a color matrix by incrementing them and let the algorithm run for some time – but for now, this should be enough. I hope you can see how interesting this topic is and how much fun it can be to try out different values and strategies.