Saturday, 25 July 2020

What causes cancer?

This picture shows the acquisition of mutations through the lifetime of a cell as it divides - from a single celled fertilized egg through to becoming a cancer cell. The lines and symbols show the timing of the somatic mutations acquired by the cancer cell and the processes that contribute to them, and are explained in detail in the text below.
Only around 5% of cancers are due to constitutional mutations in single high-risk genes. The majority of cancers are caused by an accumulation of somatic mutations over the lifespan of an individual.
Imagine a single-cell fertilised egg. Over time, as a human develops from this fertilised egg into an embryo, a baby, a child and then an adult, hundreds of thousands of cell divisions (mitoses) will occur.
Each time a single cell divides, it must undergo replication of all 3 billion bases contained within the genetic code, followed by chromosome segregation and cell division.
The intrinsic machinery of the cell which controls DNA replication and cell division is not perfect, and every time a cell goes through this process, some errors will occur, introducing mutations into the DNA. This is represented by the yellow line above.
As we develop, we are also exposed to environmental and lifestyle factors, such as ultraviolet radiation from the sun, or cigarette smoking, which have the potential to induce mutations into our DNA. This is represented by the blue line above.
The mutations introduced into the DNA by these intrinsic and extrinsic processes can occur anywhere in the genome.
Most of the time, these mutations will not affect the ability of the cell to divide normally. But occasionally they will affect important genes which control cell division.
If a mutation causes a cell to lose control of the normal mechanisms regulating cell division, cell death or DNA repair, then uncontrolled, unstable, cell growth can occur, leading to tumour formation.
This type of mutation is known as a “driver” mutation because it “drives” tumour development. Driver mutations are represented by the stars in the cells above. The other mutations, which do not affect the ability of the cell to replicate normally, will also be passed down through the cell lineage alongside the “driver” mutations.
These are known as “passenger” mutations and are represented by the circles in the cells above.
Cells which have lost the ability to correctly regulate cell division due to driver mutations can accumulate new mutations rapidly as the cell loses the capacity to repair and regulate DNA replication and chromosome segregation, leading to a mutator phenotype represented by the red line above.
Cancer is the result of these abnormal cells, which grow uncontrollably and can invade other tissues. Chemotherapy may be given to kill the cancer cells. In some cases, mutations will occur which prevent the cell from dying in response to chemotherapy, leading to chemotherapy-resistant recurrence of cancer.
Source:- futurelearn.com

Errors in recombination

Structural abnormalities

Prior to segregation in meiosis, homologous chromosomes pair and genetic material is swapped.
This is known as recombination and is key to ensuring offspring have different genetic traits to their parents.
However, if homologous chromosomes (chromosomes of the same pair) misalign or non-homologous chromosomes pair, a number of different structural chromosome abnormalities may result including chromosome deletions/duplications; translocations and chromosome inversions.
Recombination Example of Recombination during Meiosis Click to expand

Deletions or Duplications

This refers to the loss (deletion) or gain (duplication) of genetic material. A deletion or duplication is called interstitial when it occurs in the middle of the chromosome and terminal when it occurs at the tip of the chromosome.
Deletion and Duplication Example of interstitial deletion and duplication Click to expand
The phenotypic (clinical) effect of a deletion or duplication depends on the genes involved. Deletions are generally more likely to have a phenotypic effect than duplication. Large deletions are likely to be lethal.
One of the commonest microdeletion syndromes is Di George syndrome, involving deletion of a region of chromosome 22 called 22q11.
Di George syndrome can cause a variety of problems, including heart abnormalities, defects of the palate, problems of immunity and calcium control. Children and adults often have a characteristic facial appearance.
Child with Di George syndrome Child with Di George syndrome Click to expand
With the advent of array testing, an increasing number of recurrent microdeletion and microduplication syndromes are being recognised. This is discussed further in Week 3.

Translocations

A translocation describes when a portion of one chromosome is transferred to another chromosome.
Translocations can be balanced or unbalanced depending upon whether there is a net gain or loss of genetic material. They are broadly classified into Reciprocal or Robertsonian translocations.
a. Reciprocal translocations arise when any two chromosomes swap non-homologous segments. A carrier of a balanced reciprocal translocation may have offspring with an unbalanced translocation i.e. trisomy of one of the translocated segments and monosomy of the other.
Reciprocal translocations Example of Reciprocal translocation Click of expand
b. Robertsonian translocations describe when two acrocentric chromosomes are “stuck together”. Acrocentric chromosomes are chromosomes where the centromere is very close to the end and include chromosomes 13, 14, 15, 21, 22 and Y.
Robertsonian translocations Example of Robertsonian translocation Click to expand

Inversions

An inversion is when a section of the chromosome has broken away, twisted around 180° (i.e. inverted end to end) and re-inserted into the chromosome. If this section spans the centromere, it is called a pericentric inversion. If the inversion does not include the centromere, it is called a paracentric inversion.
Inversion Example of Inversion Click to expand
Usually, inversions are not associated with any loss or gain of genetic material and so a carrier is asymptomatic (unless a critical gene is disrupted at the breakpoints which is rare).
They may only become aware of their carrier status if they have a child with an unbalanced arrangement or they have chromosome investigations for infertility or recurrent miscarriages.
Source:- futurelearn.com

Thursday, 23 July 2020

Epigenome

Derived from the Greek “epi” meaning “above”, the epigenome describes modifications to the genome that do not affect the DNA sequence but determine whether genes are switched on or off where and when they are needed.
If the genome is analogous to the script of a play, the epigenome is the interpretation of the play by the director and actors.
Two components of the epigenome are DNA modification and chromatin remodeling.
DNA modification describes the addition of chemical compounds to the DNA bases. Methylation of DNA is the commonest modification.
Methylation at the start of a gene usually switches the gene off. Chromatin remodeling describes how changes to the structure of DNA can affect gene expression.
The DNA is wound around histones in order that it can be packaged efficiently into the nucleus as chromosomes. Gene expression can depend on how tightly the DNA is bound to the histones: when it is loosely bound genes are accessible to transcription factors and can be expressed but if it is tightly bound transcription cannot occur.
The tightness of the histone/DNA binding is determined by chemical modifications (again methylation amongst others) that occur to “tails” that protrude from the histone molecule.
Chromatin remodeling Chromatin remodeling Click to expand

Heterochromatin is tightly packaged DNA and genes within heterochromatin are less likely to be expressed than genes in less tightly packaged areas of DNA known as euchromatin.
The epigenome is dynamic and responsive to external stimuli. It is believed that certain external stimuli can cause abnormal DNA modifications which, in turn, disrupts normal gene expression.
For instance, we now know that a defective epigenome can contribute, along with genomic mutations, to the development of cancer.
A different group of disorders that arise from abnormalities of the epigenome are the imprinting disorders. Imprinting is an epigenomic phenomenon whereby certain genes are expressed depending upon their parent-of-origin, i.e. whether a gene is “switched on” or “switched off” depends on whether it was inherited from your mother or your father.
In certain regions of the genome, there is a clustering of imprinted genes which are regulated by imprinting control centres.
Abnormal imprinting patterns are responsible for a few rare disorders including (amongst others) Beckwith Wiedemann syndrome (increased tongue size, abdominal wall defects, earlobe creases/pits and an increased predisposition to developing certain childhood tumours); Prader Willi syndrome (intellectual disability, obesity and hyperphagia) and Angelman syndrome (intellectual disability, seizures and characteristic facial appearance).
Epigenomics is a growing field and one, about which, we still know relatively little.
However, the study of the epigenome is a rapidly expanding area and, like our burgeoning knowledge of the genome, our increasing ability to interpret the epigenome is likely to transform our ability to both diagnose and treat rare and common diseases.
Source:- futurelearn.com

Wednesday, 22 July 2020

History Of DNA

It all started way back in the late 19th century when a German biochemist discovered that the nucleic acids, DNA and RNA, consisted of long chains of subunits known as nucleotides.
Each nucleotide is made up of a base, a sugar and a phosphate. DNA has deoxyribose as its sugar and the bases can be adenine (A), guanine (G), cytosine (C), or thymine (T).
However, it wasn’t until 1943 that an unassuming American scientist, Oswald Avery, proved that DNA carried genetic information. He was pooh-poohed at first – most people thought that it was proteins that carried genetic information, and that DNA was just a boring collection of bases.
Black and white photograph of Oswald T Avery Photograph of Oswald T Avery (Click to expand)


However, soon, despite the fact that much of the world was still at war, his discovery was accepted. The scientific spotlight turned to DNA.
Still, DNA’s exact structure remained a mystery.
During the war, an Austrian biochemist named Erwin Chargaff fled the Nazis to America. Chargaff read Avery’s work and immediately focused the work of his laboratory toward the study of DNA. In 1950, Chargaff discovered that the bases A and T and C and G always occurred in a 1:1 ratio, suggesting that they were paired in some way. But this finding remained largely unknown.
By the early 1950s, the race was on to determine the structure of DNA. The American team was led by Linus Pauling at Caltech, and was widely tipped to be the favourite to find the structure first.
In the meantime, two British teams, one based at King’s College London, and another, at Cambridge, worked hard to find the answer.
The Cambridge team was led by two young scientists, American research fellow, James Watson and graduate student Francis Crick. They tried to pinpoint the structure by making physical models to narrow down the possibilities.
On the King’s team were Maurice Wilkins and Rosalind Franklin, two scientists who had a notoriously difficult relationship. The King’s team were taking a more experimental approach than the Cambridge scientists, looking at X-Ray diffraction images of DNA obtained by Franklin.
X-ray diffraction was a tool that allowed scientists to determine the structure of crystalline molecules by the way they scattered X-Ray beams. Franklin was a world expert in crystallography and pioneered the use of this technique to look at complex crystallised solids. She determined that there were two forms of DNA; the crystalline form and the ‘wet’ form, dissolved in water.
In 1951, Watson took a day trip to London to attend a lecture of Franklin’s, in which she presented her initial findings on her photographs of DNA.
He raced back to Cambridge and relayed what he remembered of the lecture to Crick. The pair then used this information to build a new model of DNA; a triple helix, with the bases on the outside of the molecule. Excited, they invited Franklin and Wilkins to their laboratory to test the structure against Franklin’s pictures.
Photograph 51: X-ray diffraction image of the Double Helix Photograph 51: X-ray diffraction image of the double helix
© Kings College London

It was wrong. Embarrassing and wrong. Their head of lab told the humiliated pair to stop DNA research. Was DNA as a helix dead?
Maybe not. Over in California, Pauling was building his own models. He asked to see Franklin’s pictures but Wilkins, keen not to hand them over to a competitor in the race, told him they were not ready to share. Nonetheless, in early 1953, Pauling announced that he had discovered the structure of DNA.
Watson panicked. Pauling was his closest rival. Had he got there first?
He studied Pauling’s structure. It was also a triple helix. Watson knew this was wrong and breathed a sigh of relief. But they still couldn’t relax. Pauling would find out his mistake soon enough. Watson and Crick would still have to hurry if they wanted to beat him.
Back in London, Franklin continued to study her X-Ray diffraction pictures. By January 1953, her preliminary findings were that DNA in its wet form did show the characteristics of a helix. However, in her typically cautious style, she was not ready to share these findings.
She wanted to confirm them first. Before she could, and apparently without her knowledge or consent, Wilkins, growing frustrated and impatient, showed her results to Watson.
From there, Watson and Crick took a big conceptual step. They suggested that the DNA molecule was made up of two chains of nucleotides, each in a helix, as Franklin found, but with one chain that went up and another that went down. This is what we now call the double helix. T
hey used Chargaff’s finding about the 1:1 base ratios to add to the model, determining that matching base pairs A and T and C and G interlocked in the centre of the double helix, keeping a constant distance between the chains.
They went on to show that each strand of DNA was a template for the other so that DNA can replicate without changing its structure. This explained one of life’s great mysteries: how genetic information can be inherited.
The double helix structure of DNA fit the experimental data perfectly and the scientific community accepted it almost immediately. It was probably the most important biological work of the last century and it forms the basis for the evolving field of genetics and genomics.
In 1962, Watson and Crick won the Nobel Prize for physiology/medicine, sharing it with Wilkins. By then, Franklin had sadly died of ovarian cancer, possibly as a result of her work with X-Rays.
Now, few people know her name. She, along with others such as Chargaff and Avery who contributed much to the discovery of the double helix, died without recognition.
Source:- https://www.futurelearn.com/courses/the-genomics-era

Saturday, 18 July 2020

The race for the human genome

One complains the other is ruthless and avaricious; the other accuses its rival of bureaucracy and inefficiency. Like two warring siblings, in some ways they rely on each other, in others, well, they just get in each other’s way.
The mapping of the human genome is the perfect illustration of this conflict.
The idea of mapping the human genome was born in the US, in the mid 1980s. It was an ambitious proposal, to sequence the entire 3 billion DNA bases that make up the human genome and find all the genes contained therein. In fact, some thought it was impossible.
But imagine the possible benefits, advocates argued. What good could be done if we knew in detail the structure, organization and function of the genome?
How could it be done? Why, the same way you’d eat an elephant. A little bit at a time.
Using little bits of chopped-up DNA and using littler genomes. What if they started with sequencing a yeast genome, and a worm’s. That would be excellent practice for the greater task.
As the seed of the idea grew, funding was sought. In the U.S, the National Institutes of Health and the Department of Energy lent their support. James Watson, a founding father of the new genomic landscape, came on board and gave the wacky idea significant credibility.
The project had big ideas; it welcomed collaborators from all over the world. After all, it wasn’t simply the ‘American’ genome that would be sequenced; the human genome belonged to us all.
Shouldn’t we all strive to understand our shared molecular heritage? The UK was next to come on board, with the Medical Research Council pledging £11 million to the cause. Other countries followed suit; Japan, China, France and Germany joined the US and the UK in forming the initial HGP team (known as the International Human Genome Sequencing Consortium). 1
The work was divided up between 20 institutions in these countries and a total of $3 billion of public funds was poured into the project.
The Human Genome Project (HGP) officially launched in 1990 with the aim of completing the work within 15 years. One of its express principles was that of publishing the sequencing information entirely freely within 24 hours of its completion.
This would mean that scientists from whatever background – whether in academia or in industry, could benefit rapidly from their findings.
Such lofty goals; such a beautiful example of cross-cultural co-operation. The HGP was like the Benetton advert of science.
The Consortium’s public sector harmony, however, did not go down well with everyone.
Opposition came in 1998, in the form of Craig Venter, scientist and founder of what was to become Celera Genomics. He was unimpressed by the work of the HGP.
Why was it taking so long, he asked? Why was it using so much public money?
He could do the same job quicker and cheaper. And he planned to patent up to 6000 genes before releasing their sequences. Venter’s aims and values were, let’s say “different”, to those of the HGP.
The best part? Venter proposed to sequence and assemble the entire human genome in just three years, finishing in 2001.
However, the HGP was already considering accelerating its work, and with increased funding, they were able to. The race was on.
But how was Venter able to be so confident that he could yield results so quickly? It was all down to his approach. The HGP used a technique called ‘hierarchical shotgun sequencing.’
This involved breaking DNA down into overlapping fragments of around 150,000 base pairs. Each fragment was inserted inside a bacterial artificial chromosome (BAC) and cloned.
It was then possible to see where the fragments overlapped without knowing the actual sequence. The overlapping sections were then used as a guide to create a contiguous map.
This process alone took six years. After that, Sanger sequencing was used to sequence each cloned fragment.
This approach, though rigorous and time-consuming, minimized the chances of misassembly, which was a real risk as the human genome has so many repetitive sections.
Venter, on the other hand, planned to use a strategy called ‘whole genome shotgun sequencing.’ This effectively skipped the mapping and cloning phase entirely.
Instead, the DNA was broken into fragments of varying sizes and sequenced directly. The assembly was done by finding regions of overlap between the sequenced fragments.
From the moment Celera was a contender, both groups worked furiously, sequencing and assembling all over the world. Initial hopes for a collaboration between the two groups quickly faded as Celera insisted that data would have to be locked away for five years.
Talks were abandoned. There was a fair amount of public mudslinging. One cried ‘vanity,’ the other, ‘red tape.’ Public versus private. You know how it goes.
However, in 2000, the Consortium faced governmental pressure to resolve their conflict with Venter. Some suggested that the upcoming US presidential election encouraged the calls for a happy resolution.
After all, it was becoming embarrassing: what should have been celebrated as one of humankind’s greatest achievements was descending into a huge row.
Eventually, on 26 June 2000 at a White House Gala, it was announced that both sides had completed their own working draft of the human genome sequence and would work together to publish soon. The race would be a three-legged one for the final leg. Détente was declared.
In February 2001, both groups published their findings simultaneously. Working drafts of over 90% of the human genome were now available. The HGP had delivered its major aim four years ahead of schedule.
The race had ended in a tie.
Did the competition change the outcome, or would the HGP have been on track to finish early anyway? Was it the collaborative approach itself that accelerated the pace of research, rather than any private sector interference? It’s impossible to know.
What is clear, however, is that the HGP laid the groundwork for many important discoveries; the identification of disease-causing genes as well as advances in sequencing technologies. As Francis Collins, the director of National Human Genome Research Institute noted in 2001, the genome can be thought of as a book with multiple uses: “It’s a history book - a narrative of the journey of our species through time. It’s a shop manual, with an incredibly detailed blueprint for building every human cell.
And it’s a transformative textbook of medicine, with insights that will give health care providers immense new powers to treat, prevent and cure disease.”2
Nonetheless, one thing seems certain. While the Consortium’s collaborative, data-sharing approach was unusual and inspiring, the human drive to compete is with us to stay. And those arguing siblings, Private and Public will probably never willingly embrace - unless Mother tells them to.
Source:- Futurelearn.com

Gene sequencing: From Sanger sequencing to next generation sequencing

In the mid-70’s, a scientist called Fred Sanger developed a DNA sequencing method, eponymously known as Sanger sequencing, which revolutionised molecular biology.
Unravelling the genetic code allowed a vast breadth of scientific applications to take place, from basic science through to translational applications such as diagnostic testing and targeted drug therapy, and enabled the Human Genome Project.
Improvements over the years to Sanger’s original method allowed scientists to sequence sections of DNA up to around 600 bases in length. However, because scientists could only sequence one small section of DNA at once, the length of time, and cost, required to sequence whole genomes became a huge limiting factor.
Next generation sequencing (NGS) methods have since been developed which have solved this problem by allowing hundreds of thousands of fragments of DNA to be sequenced at the same time - known as massively parallel sequencing.
Fundamental to both Sanger sequencing and NGS is the principle of using the DNA to be sequenced as a template for DNA synthesis, reading which nucleotide is incorporated, and hence deducing the original complementary template sequence.
But the fact that you can read many fragments in parallel during NGS had transformed the speed of the sequencing, and hence its potential applications, both in the clinical and research settings.
Sequencing the human genome during the Human Genome Project using Sanger sequencing cost $1billion and took 6-8 years to complete. Using current NGS technologies it is possible to sequence an entire human genome in just 1-2 days at a cost of around $1000.
Over to you
In the 1970’s, Intel cofounder Gordon Moore noted that transistors were becoming smaller so fast that every two years twice as many could fit onto a computer chip. He used this observation to model the projected increase in computer processing power and speed, which became known as ‘Moore’s Law’.
During the era of Sanger sequencing, DNA sequencing costs and speed roughly followed Moore’s law, however with the advent of NGS technologies, the rate of progress increased exponentially, a dramatic deviation from Moore’s law, as shown below.
Graph showing Cost per Megabase of DNA Sequence Cost per Megabase of DNA Sequence

Sunday, 12 July 2020

RNA analysis in molecular diagnostics

Most genetic testing is carried out on deoxyribonucleic acid (DNA); mutations causing human disease can usually be found by DNA sequencing or other methods of analysis and DNA extracted from white blood cells is largely representative of that found in all the other cells of the body.
However, there are certain situations where analysis of RNA rather than DNA is preferable.
Before we describe one situation where it is preferable to analyse RNA than DNA (splicing), it is useful to review the central dogma of molecular biology.
The central dogma states that DNA codes for messenger ribonucleic acid (mRNA), which in turn codes for protein, the functional product of genes. After the mRNA has been made from the DNA template (transcription) it must be processed to remove the stuffer sequences (the introns) located between the exons, in a process called splicing.
This is a highly regulated process and relies on particular sequences of bases around the exon/intron boundaries, the splice sites.

Mutations occurring in splice site regions

Mutations are frequently identified which affect conserved sequences in splice site regions but it is hard to be certain what the exact effect will be on the splicing process. Possible outcomes are:
• Exon skipping where an exon is absent from the final mature mRNA. This could cause a frameshift effect, or just lead to exon exclusion.
• Inclusion of intronic sequence. This could cause a frameshift effect, or lead to additional amino acids present in the protein.
• Activation of a ‘cryptic splice site’, where the effect of the mutation is to change the sequence so that it mimics that of a naturally occurring splice site.
• No effect on splicing.
It is not possible to anticipate the effect of splice site mutations from the DNA sequence. In order to find out the effect of these mutations, and therefore how likely they are to cause disease, the RNA itself can be analysed.

RNA analysis

RNA analysis can use similar sequencing technologies to that used for DNA analysis, but there are several factors which make this a more taxing undertaking.
Some genes are active (‘expressed’) in most of our cells, so the RNA and protein products of these genes can be extracted from the blood, like DNA.
However, many genes are only expressed in particular tissues so in order to obtain RNA from these genes an alternative source may be needed (relatively easy from skin or saliva, less so from internal organs and the brain).
Furthermore, even if a suitable source of RNA is available, RNA is much less stable than DNA and is prone to being attacked by RNase enzymes, so speed and care are needed during sample handling and RNA extraction.
This inherent delicacy of RNA can be circumvented to some extent with the use of proprietary sample stabilisation products such as PAXgene™ blood collection tubes and RNALater™.
An additional problem with RNA analysis is that in some cases, if the mutation being sought introduces a ‘stop’ codon, a proof-reading system targets the RNA molecules for degradation in a process known as nonsense-mediated decay.

cDNA sequencing

In order to use the same techniques to analyse the sequence of RNA bases which are used for DNA analysis, the RNA is converted back to complementary DNA (cDNA) using a viral enzyme known as ‘reverse transcriptase’.
This reverse transcription reaction uses primers in a similar manner to a PCR reaction; primers can be gene-specific, so that only cDNA from a gene of interest is produced, or can be generic, either complementary to the poly-A tail of the mRNA molecule, or short random primers can be used which will bind in multiple places along the RNA molecules.
Once the cDNA has been made it is much more stable and can be used as a template for subsequent Sanger or next-generation sequencing assays. Sequence data can be aligned to a reference sequence and scientists will look for missing exons or intronic material, indicating that the normal splicing process has been disrupted.

Minigene assays

Minigene assays are a useful tool for looking for splicing changes when an RNA sample is unavailable, for instance when the gene of interest is not expressed in a tissue which is easily obtained, or if the patient is deceased.
DNA from the patient is extracted and PCR is used to amplify the gene of interest, including any mutations present in the patient’s sequence. This PCR product can then be cloned into a vector, a circular DNA molecule, and is transfected into cultured cells.
The vector includes the necessary signals needed for the cells to express the gene and the RNA can be extracted from the cell culture and reverse-transcribed and sequenced as if from the patient’s own tissue.

Genome-wide association studies (GWAS)

SNP genotyping is the process of finding out which of the four nucleotides are present at a specific location in our genetic code. And SNP arrays are used to genotype thousands to millions of SNPs at the same time. One of the reasons for undertaking a SNP array is to do a Genome-wide Association Study, or GWAS. And we will discuss GWAS in this short tutorial. The main topics covered in this tutorial are the type of variation identified by GWAS, the purpose of a GWAS, and then to review GWAS design, i.e., which SNPs we're going to type, or look at, and what are discovery and replication phases.

Let us first briefly revise the different types of genetic susceptibility to disease. On the x-axis of this graph, we see allele frequency, or rather how rare or common a specific genetic factor is in the population. On the y-axis, we see effect size. Effect size represents how likely a person is to develop the associated genetic condition if they inherit the causative genetic variant. For most of this course, we have been discussing ways to identify high-risk single gene or chromosome disorders, which are individually rare in the population, but which, if present confer a high likelihood of a person developing a genetic condition. These variants fall into the top left-hand corner of this graph.

Genome-wide Association Studies aim to identify genetic variation at the bottom right-hand side of the graph, specifically SNPs. These disease associated SNPs are common in the population. In some cases, up to half of the population may carry a different nucleotide at a single genomic position. However, the risk of developing the associated condition if you carry the specific SNP is low, perhaps only 1.1 to 1.4 times that of a person who does not carry the variant. If the effect sizes of these variants are so low, then why we interested in them at all? There are two main reasons.

Firstly, that these low-effect variants can sometimes give interesting insights into the biological pathways underlying disease, which were previously unknown, and this can occasionally give researchers new therapeutic targets. Secondly, even though each single variant has a low effect size, we know that for a given disease, let's say breast cancer, there are many of these lower-risk variants associated. In breast cancer, over 70 disease-associated SNPs have been identified. And it's been shown that the individual SNP risks multiply together if you carry more than one to increase the risk of disease. So a person who carried all 70 breast cancer associated risk SNPs would have a high risk of breast cancer, even though each individual SNP was low risk on its own.

This information can, therefore, be used to stratify women into breast screening groups, depending on how many of the risk SNPs they carry. As SNPs are much more common than rare genetic variants, testing also has potential to help more people across the population, and could be used in public health programmes. Ultimately, it is hoped that if we can use SNPs to identify high-risk individuals, we might even be able to prevent certain types of disease. Over the next few slides, we will discuss how GWAS are undertaken. At the most basic level, GWAS are case-control experiments. We take a single SNP.

Here, for example, we see a C to a G substitution, and we can look to see whether the G nucleotide is more common in cases than when compared to controls. If it is, and we can prove this statistically, then we can say that people with the G nucleotide are more likely to get a certain condition than people with the C nucleotide. Sound simple? Great. But unfortunately, here's where it gets complicated. We know of at least 10 million SNPs in the human genome. Testing all 10 million SNPs in 1,000 cases and 1,000 controls would cost around $10 billion for each disease. So that's out of the question. Somehow, we have to cut down the numbers SNPs we look at.

So how do we decide which of those 10 million SNPs to test for? Luckily, each of these 10 million SNPs are not inherited independently of each other. SNPs which are close together in the genome are more likely to be inherited together than SNPs which are further away from each other in the genome. This is because SNPs which are close together are less likely to have a recombination event occur between them. And if you're not sure what a recombination event is, now might be a good time to pause the video, and just have a quick revision session of when and why recombination occurs.
The fact that some SNPs are almost always inherited together means that we can use a smaller number of SNPs to tag the total number of SNPs. And we can do this because we know which SNPs are usually inherited alongside other specific SNPs. And therefore, we can use those as a proxy for the others. Confused? Let's go through it slowly. In this schematic, eight SNPs are present, and the six lines of genetic code represent six different alleles. In SNP position one, either an A or a G can be present. In SNP position two, either a C or a T can be present. In SNP 3, either a G or a C can be present. And so on.

However, we can see that the eight SNPs are not inherited independently of each other. If SNP 1 is represented by an A, then SNP two and SNP three are always a C and G respectively. If SNP 1 is represented by a G, then SNP 2 and SNP 3 are always a T and a C, respectively. This sequence of SNPs which are always inherited together is known as a haplotype. And the fact that the SNPs are inherited together means that there in linkage disequilibrium with each other, or they're linked SNPs. So those first three SNPs form a block, and we'll call that Block One. This linkage disequilibrium breaks down when we get to SNP Four.

It doesn't matter whether you have the ACG haplotype or the GTC haplotype, you can inherit either an A or a G nucleotide at position four. However, we can see that SNPs 4 and 5 are in linkage disequilibrium with each other. An A nucleotide in SNP Four position is always inherited with a G SNP Five, and a G in Position Four, with a C in position Five. So we can call this Block Two. There is one further block of linkage disequilibrium in this diagram. SNPs 6, 7, and 8 are only present as one of two haplotypes, TAT or ACC, but either of these haplotypes can be present, no matter which of the two previous haplotypes are present.

And this is Block Three. You can see now that we can use three SNPs to give the whole eight SNP haplotype across this region of the genome. These three SNPs are known as tagging SNPs, and this allows us to genotype a much smaller number of SNPs at a lower cost to undertake our Genome-wide Association Study. Luckily for us, a huge amount of work identifying tagging SNPs across the whole genome has already been undertaken. This graph is a linkage disequilibrium map. At the very top, the black line indicates a region of the genome, and the lines on the map indicate specific SNPs at a given location in that region. The SNPs are identified by their RS numbers.

The colour boxes in the pyramid indicate how likely it is that one SNP will be inherited along with another, i.e., that there will not have been a recombination of them between them. If a box is blue, it means it is highly likely that two SNPs will be inherited together. Whilst orange and yellow boxes indicate that it is less likely that two SNPs will be inherited together. And each box in the pyramid represents a relationship between the two specific SNPs that linked by that triangle. For example, this box indicates how likely it is that SNP RS1882478 and SNP RS2285647 will be inherited together.

You can see that these are the two SNPs at the extremes of this genomic region, to say they are far apart in the genomic region. Unsurprisingly, this box is orange, indicating that they are unlikely to be inherited together. This box indicates how likely it is that SNP RS1922243 and SNP RS2373588 will be inherited together. They are in a big block of blue, indicating this region is highly likely to be inherited as a whole chunk of genome, or like one of our blocks seen in the previous slide.

If you like, you can now pause the video and see if you can work out why the tagging SNPs have been placed where they have across the region, and then comment on this with your fellow learners below. Now we have decided which SNPs to look in our GWAS, around 300,000 tagging SNPs, normally. Let us consider the design in more detail. Most GWAS have a two or even a three step process. The first step is the discovery phase, and it genotypes a large genome-wide SNP panel in a smaller number of cases and controls.

The purpose of this step is not to prove that a single SNP is statistically associated with disease, it is to generate good candidates for followup in a larger series of cases and controls. The best hits are the SNPs with the highest statistical association with disease. And this has to be a very low p value, p less than 5 times 10 to the minus 5, to compensate for the fact that we are looking at so many SNPs at once. Therefore we're likely to generate a large number of false-positive hits just by chance.

The second stage in a GWAS is a replication stage where a smaller number of candidate SNPs are genotyped in a larger case-control series, to try to prove statistical association with disease. The replication phase may be repeated more than once if necessary. This is the most cost-effective method for identifying disease associated SNPs. And the multi-step design is required because early GWAS hugely underestimated the high level of statistical significance required for a variant to be consistently associated with a disease. This robust-replication approach ensures the association is much more likely to be real. This slide shows how data from the discovery phase is represented. It's known as a Manhattan Plot.

Along the x-axis are the chromosomes, and along the y-axis is the minus log of the p-value. The higher up you go along the y-axis, the less likely it is that the association of the SNP with the disease occurred by chance. Each dot represents an individual SNP. So in this study, there are a number of SNPs present in a region on Chromosome 9, which look highly statistically likely to be associated with disease. These SNPs will be taken forward into the replication study, where they will be looked at in a larger case-control series.

Here is an example GWAS from a real study undertaken in 2007, and published in the journal Nature. In this study, the initial discovery phase typed 266,722 SNPs in roughly 400 cases and 400 controls. Of the SNPs, 13,023 were chosen to be typed in a larger case series of approximately 4,000 cases and 4,000 controls. A further replication study of 31 SNPs in approximately 24,000 cases and 24,000 controls resulted in six SNPs reaching statistical significance for disease association. You can see that as you reduce the number of SNPs being genotyped, you can increase the number of individuals in a case-control study, without the study costing a prohibitive amount, as well as increasing the statistical robustness of the study.

And this is the premise of modern GWAS. In this tutorial, we have looked at the type of variation identified by GWAS, the role of tagging SNPs, and the concepts of linkage disequilibrium, and discussed the purpose of GWAS and the basic design of a GWAS study. There's a lot of information in this tutorial, and you may wish to repeat this video, and pause it, make your own notes, and do your own background reading at the same time. In the next couple of steps, you will have a chance to review a GWAS paper, and then hear from a researcher involved in that study about her experience.

SNP arrays

We have discussed the vast amount of variation contained within the human genome, and two different ways in which this variation can be identified - the use of array CGH to identify copy number (dosage) variants and the use of next generation sequencing to read the code of the DNA itself.
There is another method of identifying variation in DNA which is commonly used in the research setting, known as SNP genotyping. SNP genotyping is usually carried out using SNP arrays, which can genotype millions of SNPs at once.
As stated previously, single nucleotide polymorphisms (SNPs) are the most common type of variation in the human genome, and represent the substitution of a single base with another.
At least 50 million SNPs have been identified in the human genome and they can be found across the entire genome. It is possible to look at these single points in the DNA code to see whether a person has a SNP in that position within their DNA.
Look at the code of a gene from the family below:
The father has two copies of a G nucleotide at position 5 and the mother has two copies of a C nucleotide at position 5.
SNP genotyping and genome-wide association studies (GWAS) © St George’s, University of London
Their children inherit one copy of this piece of genetic code from their father and one copy from their mother. They have one copy of the G nucleotide at position 5 and one copy of the C nucleotide at position 5.
SNP genotyping and genome-wide association studies (GWAS) © St George’s, University of London
Instead of reading the sequence of the gene, SNP genotyping answers the question: “What nucleotide is present at position 5?”.
In this case the answer would be:
Father = G/G
Mother = C/C
Daughter = G/C
Son = G/C

In this case, the father is HOMOZYGOUS for this SNP (i.e. same “G” base on both alleles, the mother is HOMOZYGOUS for this SNP (“C” base). The daughter and son are HETEROZYGOUS for this SNP (i.e. different bases (C/G) on each allele).
SNP arrays can identify the specific nucleotides present at millions of different positions across the genome where SNPs are known to exist.
This technology can be used in different ways in both clinical diagnostic and research settings, but here we will concentrate on one use: Genome-wide association studies (GWAS).
GWAS are used to identify whether common SNPs in the population are associated with disease. This can be done by undertaking a case:control study to see whether a specific SNP is more common in people with a specific condition, compared to those without the condition.
Note that a SNP that is found to be associated with a disease may not in itself be disease-causing, but may instead be “linked” to the disease causing variant, so that it works as a marker of disease.
This is known as being in linkage disequilibrium with a disease-causing variant, and will be discussed further in the next few steps. This “tagging” SNP can then be used as genomic marker to identify nearby genes or variants that have a role in the biological pathways underlying disease pathogenesis.
Take our position 5 SNP above. The purple circles represent the “G” nucleotide and the white circles represent the “C” nucleotide.
Here we can see people with a condition (cases) are more likely to have the “G” nucleotide than people without the condition (controls) who are more likely to have the “C” nucleotide.
A test can be done to see if this difference is statistically significant. If it is, then the “G” nucleotide is said to be associated with that specific disease.
SNP genotyping and genome-wide association studies (GWAS) Figure 1: case - control study
© St George’s, University of London

GWAS look at hundreds of thousands of SNPs across the whole genome, to see which of them are associated with a specific disease. Whilst many thousands of SNPs have been found to be associated with many different diseases, the actual level of increased risk caused by individual SNPs is almost always low, usually between 1.1-1.4 times.
The low level of increased risk of disease conferred by individual SNPs means that SNP data is not currently of use clinically. However, this is the main type of test currently undertaken by direct-to-consumer genetic tests freely available over the internet.

Over to you

1) People from different ethnicities have different numbers of SNPs in their genomes. Some SNPs are only present in one ethnic group, and not present in others.
Why might this be?
How might this information be used for anthropological purposes?
2) SNPs are identified through “SNP genotyping”.
Source:-

Rare Diseaes Some examples of rare diseases are  Spinal Muscular Atrophy ,   Osteogenesis imperfecta ,   Achondroplasia   or   Rett Syndrome...