Week+of+9-27-09

Back to GN's Logbook

I met with Dr. HG and discussed what the project would specifically entail. With many options, we decided that a more sturdy background would be required. Due to a shortage of time, the meeting did not last as long as we had hoped, so we settled on my getting started with a tool common in many bioinformatics projects - the programming language known as Perl. Dr. HG lent me a textbook (James Tisdall's Beginning Perl for Bioinformatics) so as to allow me to get familiarized with the language and how it relates to and is used in bioinformatics. Over the week, I got through the first four chapters without much of a problem, given my computer science and biology background. The first chapters were mostly introduction, getting people who were unfamiliar with one of both of the subjects to be brought up to speed. I will shortly (probably tomorrow) modify this page to include a brief summary of these chapters. And here's the summary:

Chapter 1: Biology and Computer Science
The author begins by explaining how biology and computer science overlap and can benefit from one another - how evolution, neural networks, annealing, etc can be modeled on the computer, and how useful analogies can be drawn between the two rapidly changing fields. The next topic is a review of the biology used. The bases, backbone, strands and their orientation, the double helix of DNA - all of these are relatively common biological ideas and are defined elsewhere in this logbook, so I will move on for the moment.

Chapter 2: Getting Started With Perl
Chapter 2 deals with Perl, the programming language that the book uses. As this may be new to biology majors (and certainly new to this logbook), it requires a greater degree of explanation. The programming style that the book uses is called imperative programming. this poses a slight problem for me, as while I have a large amount of programming experience, it is in a different language and style, the object-oriented Java (similar to C++). Some useful definitions that the author lays out include: __Computer program:__ basically, i'ts a set of instructions that the computer translates into machine code and executes. Machine code, of course, is binary - zeros and ones representing true and false or yes and no. __Programming language:__ a set of defined guidelines for writing programs. Based on the save file type and the language itself, the computer follows different mechanisms for translating into machine code, but it is always via that program's compiler, or interpreter. The author then devotes a section to discussing why Perl is the common choice in bioinformatics. For instance, Perl has a relatively low learning curve, and is not difficult to master. In addition, it is directly compatible with many of the file types that are used in bioinformatics. Another benefit is rapid prototyping - that is, a problem can be addressed quickly, and just a few hours of programming can yield a working, if not rudimentary, program. Perl can also be run on virtually all computers today, and can be transferred (such as from Windows to Mac) with relative ease. Perl also runs relatively quickly, though is admittedly slower than other languages, such as C. The way to edit and save files in Perl is to use a text editor and save as an ASCII file only.

Chapter 3: The Art of Programming
The third chapter deals with some basics of programming, which while review for me are likely not so for most others in bioinformatics. So I'll cover the book in a little more detail. The first point that the author makes is a good one: reading the text can only take one so far. At a certain point, one needs to actually do some programming and get some hands-on experience. Another key point is to save at regular intervals (to avoid losing data due to a crash) and to keep at least some of the earlier versions of one's work. That way, one can specialize the program and keep the general form, or go back and see where things went wrong. Another essential component is to learn how to interpret error messages. The Perl compiler is more than a little wordy - the error messages can go on for pages. The key is to look at the top one or two error messages, fix those, and try again. More often than not, the compiler gives extraneous information, so fixing the top few errors will be enough. The next (and most dreaded) necessary skill is debugging. This refers to the process of fixing logic errors, not syntax errors. These errors are ones that the compiler won't catch, and will only show up when one gets a nonsensical or otherwise erroneous result. The most useful tool for finding these errors (which sometimes take an inordinate amount of time to find and solve) is the debugger, a tool that allows one to follow the program's actions step by step, which often allows one to spot the error. Putting in print statements (which allow one to follow variables as the program continues, for instance) is also a good tactic. One important thing to keep in mind is that programming is a problem solving process. Many times, one will be solving problems that one actually created, but that cannot discourage the programmer. One must recognize that programming takes time, effort, and critical thinking skills. In addition, one can take cues from other programs and ask the advice of more expert programmers. Beyond that, if one is trying to solve a common problem, it is possible that someone else already has! there is an abundance of freely available source code, or open source, that one can download and often use with little or no modification (although one must be wary of copyrights). Perhaps the best place to find programs such as this is the Comprehensive Perl Archive Network (CPAN) website, at @http://www.cpan.org/