November+2012

**__PubMed__**
HbA1 is the hemoglobin alpha 1 gene. Searching in PubMed brings up 28 different articles relating to health and overall there are 20,303 different articles pertaining to the gene and if you look at the gene specifically in homo sapiens, there comes up a page just for the gene in humans. Hemoglobin is a protein coding gene that codes for alpha-globin. The gene is found on chromosome 16 in the alpha-globin locus. Locus just means the region where a gene on a chromosome can be found. HbA1 is part of one of four subunits that make up hemoglobin which is the protein in red blood cells that carry oxygen throughout the body. Oxygen that attaches to heme molecules in the hemoglobin subunits gives blood its bright red color. The gene is found on the short arm of the chromosome found at position 13.3 which is about 226000 bp to 27,000 bp.



If you look at the far left, HbA1 is located toward the bp 218640 and ends where the red arrow Hba1 is.

A problem that stems from HBA1 is alpha thalassemia which is caused either from a mutation or a lack of an allele or multiple alleles. It can stem from being very severe and effects many children and fetus. Fetus who suffer from this are either still born or die into early childhood. Though some who just lack one allele can be a silent carrier and never suffer from the disease. Two alleles missing can lead to minimal complications.

__**OMIM**__
The website OMIM, which is a branch of NCBI (national center for biotechnology information), stands for Online Mendelian Inheritance of Man. It is a database of human genes that are responsible for diseases. The website was developed by John Hopkins staff.

For OMIM, the # symbol is used to denote that the article talks about a mutation in the gene. It does not talk about the locus though. A + is to denote that the article will be a general overview of the gene and its phenotype. And an * symbol represents an entry for just the gene. The four symbols (#,%,*,+) help to separate and classify each entry in OMIM for research purposes.

__NCBI RefSeq __
RefSeq is a source that contains easy access to non-redundant sequencing. When HbA1 is searched, 50 concise searches come up in Entrez. The differences with RefSeq are that is includes sequencing only and not additional information and is more concise compared to the spectrum of the whole ncbi website. Using RefSeq, you can tell the length of the sequence of the gene, how many exons are present, when it was last updated, the chromosome it can be found on, and how to download the gene sequence. The RNA sequence is 576 bp and is found on chromosome 16. Exons are located at 388..576. An exon is a part of DNA or in this case RNA that contains the coding information for a protein or peptide sequence. The last update in RefSeq was on November 18, 2012. And an easy way to download the sequence from NCBI is by taking the locus number to be used in a computer program. Most programs can refer back to a database that they link to if they use or have bioinformatic tools.

**__UniProt__**
Uniprot stands for Universal Protein Resource, it provides protein sequences by publishing them into the public domain on their website. UniProt is separated into two different section, reviewed and unreviewed. The reviewed sections by UniProtKB/Swiss-Prot are annotated by a star and the unreviewed are left starless. When search Hba1 specifically, there are only 123 queries available, very small compared to the massive size of NCBI. Then I accessed the entry HBA_HUMAN which took me to the description of the hemoglobin subunit alpha. The entry has the protein existence of HbA1 at "Evidence at protein level" which means that there is clear experimental evidence that the protein exists. It then describes the protein's function as being involved in the carrying of oxygen to the peripheral tissues of the lungs. I then searched Q86YQ5_HUMAN which is HbA2 in humans. The protein existence of this protein is at the "Evidence at transcript level" which means that the protein does not have concrete evidence to exist but is believen to exist based on substantial evidence such as northern blots etc. The function description of the protein is identical to that of HBA_HUMAN.

‍media type="custom" key="21858276"‍ ==‍__An Overview of Hemoglobin__‍ ==

http://sickle.bwh.harvard.edu/hbsynthesis.html



The website listed above is a resource off of Harvard's website that gives an overview and in depth review of hemoglobin synthesis. I will begin with a brief description of what I read in the article about the overview of Hemoglobin.

Hemoglobin is found in erythrocytes or red blood cells. In each erythrocyte, about 250 million hemoglobin molecules are present. Hemoglobin in general is an iron-containing protein that picks up oxygen in the lungs and carries it to peripheral tissues to ensure the health of cells. Then it also, after dumping the oxygen to the peripheral tissues, binds CO2 and returns it back to the lungs in order to be exhaled and the process to begin again. Hemoglobin is made of two proteins or subunits, alpha (HbA1 & HbA2) and beta. But before birth, beta is not expressed but instead gamma is present. Hemoglobin is made from the instructions encrypted in DNA. Usual individuals possess four genes that code for the alpha protein or subunit or chain and two that code for the beta chain. The chains are produced in the exact same amounts and an imbalance can produce mutation of hemoglobin. The chains bond and work together in developing red blood cells and then stay together for the entirety of the red blood cells life. As I mentioned before, mutations can occur in the process of hemoglobin, in this occurrence disorders such as sickle cell disease or thaassemia arise. There are currently several hundred abnormal or variant hemoglobins. Though most function normally despite being abnormal, some cannot function and produce diseases and disorders like sickle cell disease and anemia.

Defects in the normality of the gene causes abnormal production of one set of the chains, either beta or alpha. If a alpha globin cannot produce enough alpha gene, but the beta will continue to create enough beta chain. The imbalance will lead to not enough alpha chain being produced and the imbalance is called "thalassemia".The disruption can occur from a mutation of the HbA1 gene. Thalassemia is an inherited disorder of the recessive parent genes that comes from abnormal or defective red blood cells. There are two different types of thalassemia which are either beta thalassemia or alpha thalassemia which is based on either an abnormality in alpha globin subunit or beta globin subunit accordingly.

In alpha thalassemia, which includes HbA1 and HbA2, there is a decrease in the production of alpha globin chain which leads to an excess in beta globin. This excess creates an imbalance called hemoglobin H. Thalassemia leads to abnormal oxygen delivery to peripheral tissues. There are four genes that code for alpha globin. If a child inherits 4 of the mutated genes, the child cannot live outside the womb and will die upon birth.

The hemoglobin inside red blood cells are made up of two parts, heme and globin (hence the name hemoglobin). Heme is a cyclic molecule that consists of N, C, and H atoms that have a Fe2+ ion as their center. The iron ion center is held in place by four nitrogen molecules holding them in place. The hemoglobin in the cell is like a solution, a membrane inside the cell holds the liquid in. The actual formation of hemoglobin is made up of two dimerized pairs of one alpha chain and one non alpha chain. The two dimers combined create a hemoglobin tetramer which is the functional form of hemoglobin. One dimer alone will not be able to carry oxygen and deliver it properly so the dimers form to create a tetramer. A dimer is simply two identical molecules that form together.

‍‍ HbA1 can be found on Chromosome 16 at the 3' end. The alpha globin chain is encoded there and multiple genes are expressed there. Also, pseudogens are expressed at each site. A pseudogen is a section on the chromosome which is an imperfect (or disrupted) copy of the functional and proper gene. In each cell, since there are two chromosome 16, there are four copies of alpha globin compared to there being only one beta chain in the 11 chromosome leaving 2 beta globin in each cell. Each one of the four genes creates about one-fourth of the alpha globin chains needed for hemoglobin synthesis. The area where alpha can be found, is called the alpha globin locus which also includes the locus control region (LCR). LCR is needed for the optimal gene expression of alpha globin chains. And it can also be seen from the figure on the left that zeta is found in this region also. The zeta gene is only expressed during the first few weeks of embryogenesis on chromosome 16. Then after zeta disappears, alpha globin takes the place of zeta

Hemoglobin changes from the time of embryogenesis to birth/infancy to adulthood. During embryogenesis on chromosome 11, 2 gamma genes and an epsilon gene are present. And The two gamma genes form hemoglobin F or fetal hemoglobin. Then after infancy, the fetal hemoglobin passes and adult hemoglobin is present which is composed of two alpha genes and two beta genes - Hemoglobin A. And also on chromosome 11, delta is shown which commits a very small amount of hemoglobin to both adults and children. Delta is located between gamma A and beta on chromosome 11. The Delta gene is composed of two alpha genes and two delta genes.[| It produces approximately 3% of hemoglobin in adults.]

Hemoglobin's ability to release oxygen
Hemoglobin releases oxygen because of the influence of its environment, pH and CO2 levels. The peripheral tissues surrounding the lungs are acidic compared to the very oxygen rich tissues of the lungs. This acidic environment is what allows the hemoglobin to function properly. And the environment also allows for the reaction between water and CO2 molecules to take place:

CO2 + H2O -> HCO3 + H+

This reaction allows CO2 to be released from the lungs back into the air and then for the second transfer of oxygen to the lungs. This is called the Bohr Effect. The Bohr effect was discovered by Christian Bohr and relates how if levels of acid are higher or pH levels are high, more CO2 will be released from blood. And the Bohr Effect is all based on proton attraction or affinity. If there are low levels of oxygen it strengthens the attraction for hemoglobin to bind with protons.

As illustrated above, oxygen from the lungs eneters the erythrocyte and combines with hemoglobin molecules which can then be released to peripheral tissues.

Alpha Thalassemia
Alpha Thalassemia is caused from the deletion of either HbA1 or HbA2 alleles. It can cause either Bart hydrops fetalis syndrome of HbH disease. Hb Bart syndrome is caused from the deletion of all four alpha-globin alleles. Babies who have Hb Bart syndrome do not live much past infancy or are still born. The deletion of all four alleles leads to no oxygen being carried to surrounding tissues which causes swelling of the liver, severe anima and other symptoms. If someone has one of the four alpha-globin alleles, they have HbH disease which causes mild to moderate anemia with a few other symptoms. Those with HbH usually live into adulthood. The deletion of half of the alpha-globin alleles leads to being a carrier of HbH disease but having little or no symptoms. And the lose of just one alpha-globin allele is an individual who is a silent carrier of alpha thalassemia. The carriers have no symptoms or signs of carrying alpha thalassemia.

MATLAB Connection
The next step in the exercise is to access the nucleotide sequences for //Escherichia coli// and //Thermoplasma volcanium// by downloading them off NCBI. The ID numbers are in order "NC_00913" and "NC_002689'. The exercise allows any of the following programs: Python, Perl, R, Matlab or Octave. It is most easily done using a high level of language, so I will of course be using Matlab. In order to access the genome sequences, I will use the program Sequence Viewer. In order to use the Sequence Viewer though on the new Matlab version, I had to look it up on the mathlab website

The following link, http://www.mathworks.com/help/bioinfo/ug/importing-viewing-and-exploring-a-nucleotide-sequence-using-a-graphical-interface.html, is the website to the above picture. It describes how to use the Bioinformatics Toolbox and the Sequence Viewer specifically. It was helpful in describing what I needed to do and I was able to see the differences between the 2011 version of MATLAB and the 2012 version.

I could not find NC_000913 in NCBI and Matlab would not recognize it as a locus so I instead looked up escherichia coli on NCBI and used the locus from the first link available which has the locus GQ214053.

The first problem listed under the programming section of my exercises is to find out the nucleotide, dinucleotide and trinucleotide frequencies. A nucleotide is is made up of a nucleoside that is bonded with a phosphate. A nucleotide codes mostly for DNA, a dinucleotide which is two nucleotides bonded together codes for NAD+ and a trinucleotide is a codon made up of three nucleotides. A: 21.2% C: 28.2% G: 29.7% T: 21%

The second question asks the G-C content of the sequence which is 5214-4949 or 29.7% to 28.2% which is responsible because G code for C and C codes for G so their content needs to be similar in percentages in order for there to be a complement sequence that binds properly. And the amount of times G-C occurs in the sequence is 1630.



For the sequence of thermoplasma volcanium, NC_002689, the G-C content is larger than 50000, the G content is 317147 and the C base count is 315483. The frequency of each nucleotide is as follows:

A: 30.2% C: 19.9 % G: 20% T: 29.9%

‍As seen in the figure above, each function is mentioned in the section Sequence Viewer in the top left-hand corner. Underneath the functions, is the Base count and percentages of each nucleotide. And under that is the Map view of the sequence overall. In the center is the sequence itself with all codons being shown. It also mentions at the top the locus number and the actual name of the nucleotide sequence. This is the complete genome of Thermoplasma volcanium which is 1584804 bp which is quite large because it is the entire genome of the bacteria. Under Sequence Viewer there are 9 different sections to look at. The first being the sequence as a whole. Under that is ORF, open reading frame, which is how amino acids bind to the sequence starting at the first codon in both directions. There are usually 6 different ORF, 3 forward and 3 backward. The start codon is ATG, the stop codons are one of the three following: TAA, TAG, or TGA. It calculates the length of each of the ORFs. Under that is a full translation of the sequence. And under that is an annotated CDS, Coding DNA Sequence. And then is the CDS with translation which translates with the Coding DNA sequence. Then there is the complement sequence which is used for when the double helix is split and another set of nucleotides binds to it. And then reverse compliment sequence which is used for things like PCR. and then the features and comments gives credit to the creators of the sequence viewer and references the database in which the code came from.

‍