A Comparison of kinetoplastid guide RNAs (gRNAs) versus T cell receptor recombination signal sequences, and immunoglobulin recombination signal sequences (RSS) using a G:U intermediate

Alice C. Lichtenstein, M.S

aclsnippets@gmail.com

 

Introduction

 

Kinetoplasts, infectious agents of diseases like sleeping sickness (Chagas disease)  have an unusual single mitochondrion.  It contains  thousands of DNA minicircles that each transcribe a guide RNA (gRNA) as well as  roughly 10 -50 DNA maxicircles that have an abbreviated (pre-edited) form of at least 2 rRNA and 18 structural abbreviated  gene sequences. (Gao, G. et. al (2001))  In order to make an mRNA that is useable, or that can be translated from the pre-edited genes, overlapping gRNAs are aligned along the pre-edited gene sequences and uridines are inserted into the mRNA that will eventually be transcribed and translated. (Horvath et al., 2000), (Gao et al. (2001)). 

It is an involved process that this author will not attempt to explain and who will only use the overlapping gRNA sequences that create translatable mRNAs described in the Gao G. et. al. (2001) paper.

In the simplest description, the gRNA network, when aligned along the edited mRNA, appears like a series of  "bridges" whose ends overlap.  Essentially the overlapping gRNAs act as a template for uridine insertions into the mRNA that will be used for transcription and translation.

 

As done in previous Snippets installments, this author used G:U complement sequences to make comparisons between guide RNAs and Recombination Sequence Signals.

 

Methodology and some Basics about V(D)J recombination

 

Using a G:U complement

To make the comparisons between the recombination signals and gRNAs, a G:U complement was made to the gRNAs following certain "rules" :

All the nucleotide sequences were considered as 'strings of beads" that had no other meaning, for the present, as to function or location, or where found or species etc. They were analyzed in both the 5-'3' and 3'-5' direction.

T and U were used interchangeably and an uncommon nucleotide like di-hydroxyuridine (D) was considered to be uridine because the only sequence interaction used in this analysis was complementation. 

Proteins were not considered in any of the methods.

The method used to compare gRNAs with recombination signal sequences was done similarly to the sequence described below. (The anti-PAS sequence of a retroviral primer is used as an example to show how to generate a G:U complement, and has nothing to do with gRNAs or recombination signal sequences.)

In all the graphics for results, the gRNA sequence of interest (white squares) is shown with a complement composed of only G's and U's, (gray squares).

A match is made by generating a reverse complement to the G:U complement/intermediate as follows.

 

Generating a G:U Complement and a Reverse Complement for Matches

Anti-pas sequence of Retroviral primer C C A G G G
G:U complement G G U U U U
Reverse complement C or T or U C or T or U A or G A or G A or G A or G

Using this methodology,  some of the following "matches" for CCAGGG were

Anti-pas sequence of Retroviral primer C C A G G G
G:U complement G G U U U U
homology C C A G G G
match T C A G G G
match-Telomere repeat humans, mice T T A G G G
match C C G G G G
match C T A G G G
match C T G G G G
match-Telomere repeat Tetrahymena T T G G G G
match to G:U complement G A T T T T
match to G:U complement A A C C T C
match to both T T T T T T

 

The fact that a match can be found  between the retroviral primer anti-pas site and the human and tetrahymena telomere repeat does not necessarily mean that there is any known biological relationship between the two.

It is important to note that a match can be made to both the positive strand (the sequence of interest in white squares) and its G:U complement in gray squares  (see last line of above chart of results). This author made the arbitrary decision that split sections of a match made between the both the sequence of interest and its G:U complement had to be at least 4 or 5 nucleotides in length.

In essence, one is substituting a purine for another purine or a pyrimidine for another pyrimidine to make a match. One could use the letters "Y" for pyrimidine or "R" for purine instead of  G and U  but this author liked G and U. Even the numbers 1 or 0 could be used provided they were each defined as a specific set of nucleotide bases (ie 1 stands for A or G and 0 stands for C, U, or T).

Unlike homologies, there can be more than one match for any given sequence of interest.

One author, working with homologies used an 80% similarity as a criteria, and this author has kept that at the back of her mind when subjectively making matches.

This author used a number of papers from the 1980s because she found that using large genomic databases had the following problem. She would locate a sequence in the large database and then when she went back to find it, the database had been updated and the sequence was not at the same location. (The nucleotide numbers were different). So, if the database had a manuscript reference for the original sequence and the original sequence was still the same as the one in the database she used the original paper as a reference.

This author avoided consensus sequences and used actual sequences directly from the literature whenever possible.

 

 

VDJ Recombination

In order  to simplify the explanation of recombination signals vs. kinetoplast guide RNAs this author will concentrated on the somatic VDJ recombination event that created immunoglobulin heavy chain (IgH) genes in B lymphocytes and similarly the recombination event that mades T Cell Receptors (in thymocytes?). For a further explanation of how this occurs I refer you to any basic immunology text.  For the moment, I will elaborate on the somatic rearrangement used to make  the Immunoglobulin Heavy Chain (IgH), although the literature says that the genes for T Cell receptor beta and T Cell receptor delta are "fashioned" in the same way.

An immunoglobulin gene cluster in homo sapiens contains a family of V (variable) genes, a family of D (diverse) genes and a family of J (join) genes. There is also a series of introns and exons that create a C (constant) region that is part of the final IgH gene sequence, but since the C region does not involve a recombination signal (RSS), constant region sequences were not used in comparisons..

Each final IgH  gene has one V gene sequence, one D gene sequence and one J gene sequence that are "brought  together" and held together by conserved  hepatamer and nonamer sequences separated by twelve or twenty-three nucleotides .These are recombination sequence signals (RSSs) whose heptamer and nonamer sequences serve as binding sites for the proteins that will split one V, D or J gene from DNA and ligate them together to form a VDJ sequence..

Selected D and J gene sequences are joined first, followed by a V gene ligated to the D-J sequence.  This series is also proposed for the T Cell Receptor beta (TCRß), and T Cell receptor delta (TCRs).

The Immunoglobulin kappa and lambda light chains and T Cell receptor genes TCR alpha and TCR gamma do not have a D gene component and so have only a V-J rearrangement.  Hence, the whole mechanism is usually written V(D)J recombination.

As mentioned before, the RSSs come in two sizes, one with 12 nucleotides between the heptamer and nonamer conserved sequences and another with 23 nucleotides between them. Below are various graphics showing the different configurations for VDJ and VJ recombination.

RSSs are the binding sites for the proteins that assemble the selected V(D)J genes.

 

 

 

 

VJ gene rearrangements

 

 

 

 

 

 

 

 

Additionally, it should be noted that the immunoglobulin nonamer sequence has some variety from the most frequently reported ACACAAACC: ACAAAAACC, GCAGAAACC, TCAGAAACC, etc (Yu, K. et. Al (2002))

For T cell Receptor results this author used an octamer conserved sequence rather than a heptamer sequence, because after looking at the list from Rowen, L. et. al (1996) (figure 2) she thought the sequence conservation was CACAGYYY where Y was a pyrimidine and not CACAGTG.  However, when making comparisons, I used the 8-base sequence reported for each V gene RSS..

 

 

Results

First some comments on the gRNA sequences used for comparisons from the Gao, G. et. al. paper. Although the authors displayed  the overlapping gRNA sequences and the subsequent expanded mRNA as well as the aligned amino acids, they did not say what protein or part of a protein was being coded. Other literature on this topic was confusing for this author, so the results below will not attempt to show any correlation of edited mRNA with protein translation and will be simply be confined to matches between gRNAs and RSSs.

Matches for heptamers, octamers and especially nonamers, seemed to occur mostly in the overlap regions of  gRNAs. 

When making matches, it became obvious that that certain immunoglobulins and certain TCR sequences were very similar. This was not surprising considering that they both use RAG1 and RAG 2 proteins that bind to conserved heptamers and nonamers for recombination, Unfortunately, although there were references describing the use of RAG1 and RAG2 proteins for immunoglobulin somatic DNA rearrangements, this author couldn't find a specific reference for their use in T Cell receptor somatic rearrangements. Most of the papers assumed that this was the case.

 

 

Highlights  of gRNA vs RSS comparisons through a G:U complement

If we look at the results in detail,  all the matches for heptamers, octamers and nonamers for both Immunoglobulin and T cell receptor RSS sequences were found under the overlap regions (bolded letters) of gRNAs used, In fact some of the matches extended to overlaping gRNA sequences.

The three matches to gRNA gG4-V below show the variety of comparison results using a G:U complement. (Unlike homologies, you can have more than one match for a given sequence of interest. ) The most convincing  result is for IgHV1-3 as it is derived entirely from the sequence of interest and no part from the G:U complement itself. If one accepts that "CCC" is excised from the IgHV1-3 sequence, it is also a good match for the IgH V1 family series as reported in Yu, K. et. al.

The IgKJ-2 RSS match is consistent with the fact that the conserved heptamer and nonamer sequences for this gene are the complements of CACAGTG and ACAAAAACC. and there are 23 nucleotides in between them.

The results for gG4-V also include a match for a TCRB V gene RSS that is consistent with the prevailing opinion that T cell receptor somatic recombination uses the same proteins as Immunoglobulin recombination.

 

 

 

The result below for gND9-VI is extremely interesting as it includes an actual gene, IgHD1-1, and uses an overlapping second gRNA as part of the match.  If you look at the graphic for V(D)J recombination below, two complementary heptamers flanking a D gene sequence is consistent with the match for a VDJ recombination.  To see if there was a match for a nonamer in the 5' direction of the IgHD1-1 gene, this author continued along the 5' overlap (UUAUUGAUAAGUGU) using gND9-VII (not shown) to see if there was a match for a nonamer sequence 12 nucleotides 5' to the 5' heptamer. .There was a match, for the 5' nonamer, but not the 12 nucleotides in between.

 

So, the "highlights" graphics show  that matches can be made to all three immunoglobulin gene families as well as T cell receptor genes and that sometimes the matching alignments extend along the overlapping gRNAs. Many other matches were found, and a laundry list of other matches is shown below to demonstrate  the variety and number of matches that could be made. Again, one can see that there is a similarity in some cases between TCR sequences and immunoglobulins, between various V gene families and, of course, within gene families themselves.

Unfortunately, none of them are perfect, (e.g showing a complete match between an RSS and gRNA from the strand of interest only) but there are thousands of gRNAs and many V, D, and J gene RSSs. This huge number of candidate sequences for comparison raises the question again that will the "perfect" results be random events, and if perfect matches are found, does it mean anything?

But first, the laundry list of some other matches that were made.

 

Other matches between gRNAs and RSSs through a G:U complement

 

 

 

 

 

 

 

 

(Author's note: Since the laundry list above proved that I could make matches, I decided to try a different approach and looked for only matches between heptamer and nonamer conserved sequences vs gRNA sequences )

 

Heptamer and Nonamer Conserved sequences vs gRNAs

 

For each gRNA, I found what could be a heptamer sequence (through a G:U intermediate) and then counted 12 or 23 nucleotides further along the gRNA with the following results that are presented in no special order below.: I did not approach this particular project in a methodical way other than to look at the sequences reported in Gao, G. et al's figures. In the gND9 series there were two conserved sequences 8 and 9 bases apart.

These results seem more promising to make a preliminary case for matches between gRNAs and RSSs as they are mostly derived from the sequence of interest only and hint that gRNAs have matches at a similar distance of 12 or 23 nucleotides between conserved heptamers and nonamers.  .The two results for the gA6 series are notable in that each match is to the G:U complement directly. Unfortunately only three sequences in this series were reported. It might be useful to compare these particular gRNA sequences to those of a J gene RSS.

One cannot make any correlation between the length of an RSS and a gRNA series using the results below.

 

 

 

Is/Was there a 12mer embedded in the 23 mer sequence?

 

The paper by Yu,K. et al reported a list of RSS sequences that were used by RAG1 and RAG2 proteins as binding sites. (figure 1) during recombination of IgH V genes. The most frequent one is shown below. This author found, In the 23 nucleotide sequence, another nonamer, spaced eleven nucleotides 3' to the heptamer, uncovered by using a G:U complement. One cannot say much about this find other than it exists

 

 

Discussion

 

There are many problems with defending all the above results

The first problem is that creating a subjective "best fit" or best alignment for matches could also be wishful thinking. When making a match, one is essentially substituting a purine for a purine or a pyrimidine for a pyrimidine in the match and that can either be from the sequence of interest (white squares) or the G:U complement (gray squares) . Thus, statistically speaking, the generated matches (in this case, recombination signals for immunoglobulins or T cell receptors) are composed of purines or pyrimidines that have a 50:50 chance of being at their individual locations if one approaches the comparison in a purely statistical matter. . The only thing this author can say is that the resulting matches that were accepted had to have more than 4 or 5 matching bases in a row. Also, the heptamer sequences, and especially the nonamer sequences were conserved in the gRNA overlap regions if one used a G:U intermediate.

The second problem, and probably why no perfect match was found, is that even though there is only one mitochondrion per kinetoplast, it contains thousands of DNA minicircles which generate thousands of different gRNAs.   So, the results below are far from perfect, when trying to match thousands of gRNAs of different sizes with the sequences of many T cell receptors and immunoglobulin gene families. 

The fact that there was a12 or 23 nucleotide constraint between heptamers and nonamers in immunoglobulins was very useful in trying to locate candidate gRNAs for comparison. 

The third problem was that there are many T cell receptors that use recombination signal sequences: TCR alpha, TCR beta, TCRgamma and TCRdelta as well as immunoglobulin kappa and lambda light chains and Heavy chains and only one person (me) making the comparisons. So, not all the gRNAs or immunoglobulins or T Cell receptors were screened for matches. There might be perfect matches out there but this author didn't find them given her limited resources..

Another big problem is that comparisons are only as good as the sequences used. Many of the TCR and Immunoglobulin V, D, and J gene sequences from the nucleotide databases are predicted. In fact, this author suspects that they are predicted wherever a search engine finds a conserved heptamer sequence, but I don't know for sure how locating the various genes is or was done. Thus it was hard to validate any of the sequences. Also, V, D, and J gene sequences were re-named over the years so it was hard to track them. In one instance, the TCRB RSSs from Rowen, et al's paper (1996) were found in a 2003 version of the TCRB locus from ncbi's database, but a 2017 version of the same locus was hard to navigate. Thus most of the sequences used in this Snippets installment came from Rowen et al (1996) (TCRB sequences figure 1)  and Yu, K. et al (2002) papers (IgH V gene sequences figure 1). RSS sequences for kappa chain, lambda chain components came from the ncbi databases.

There is another question that  keeps nagging at me. Mitochondria are usually the remnants of another organism that got "trapped" in another and formed a symbiotic relationship with its host. Sometimes, sequencing the remaining DNA in the mitochondria can give a clue as to what the original organism was. In the case of kinetoplasts, with their single mitochondrion that contain thousands of minicircles,  that replicates only once along with their host it would be difficult, but extremely useful to find the original symbiont(s). Is there a currently known organism that replicates using catenated minicircles? .Is there a colony of single cells that replicate in tandem? etc.

Another thing, the fact that the kinetoplast mitochondrial mRNA is going from a small size to a larger size is reminiscent of and organism coming out of some defensive sporulation.

If believing the matches and looking at the results through an evolutionary perspective, was there an original set of sequences that were eventually used by both the immune system and very primitive organisms?

 

 

 

.

References

Corelli, R., (1993) Trypanosoma bruceii minicircles encode multiple guide RNAs which can direct editing of extensively overlapping sequences, Nucleic Acids Research, vol 21, pp 4313 - 4320

Gao, G. et. al. (2001), Guide RNAs of the recently isolated LEM 125 strain of Leishmania tarentolae: an unexpected complexity, RNA vol 7, pp 1335-1347

Horvath, A., et. al., (2000), Translation of the Edited mRNA for Cytochrome b in Trypanosome Mitochondria, Science, vol 287, pp 1639 - 1640

Rowen, L. et. al. (1996) The Complete 685-Kilobase DNA sequence of the human beta T Cell Receptor Locus, Science vol 272, pp 1755-1762

Yu, K. et. al., (2002), The Cleavage Efficiency of the Human Immunoglobulin Heavy Chain VH Elements by the RAG complex, Implications for the Immune Repertoire, JBC,  vol 277, pp 5040-5046