Computational Biology, GlaxoSmithKline Medicine Research Centre, Gunnels Wood Road, Stevenage, SG1 2NY, UK

Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK

Abstract

Background

Microarrays are commonly used to investigate both the therapeutic potential and functional effects of RNA interfering (RNAi) oligonucleotides such as microRNA (miRNA) and small interfering RNA (siRNA). However, the resulting datasets are often challenging to interpret as they include extensive information relating to both indirect transcription effects and off-target interference events.

Method

In an attempt to refine the utility of microarray expression data when evaluating the direct transcriptional affects of an RNAi agent we have developed SBSE (Simple Bayesian Seed Estimate). The key assumption implemented in SBSE is that both direct regulation of transcription by miRNA, and siRNA off-target interference, can be estimated using the differential distribution of an RNAi sequence (seed) motif in a ranked 3' untranslated region (3' UTR) sequence repository. SBSE uses common microarray summary statistics (

Results

SBSE has been evaluated using five diverse human RNAi microarray focused investigations. In each instance SBSE unambiguously identified the most likely location of the direct RNAi effects for each of the differential gene expression profiles.

Conclusion

These analyses indicate that miRNA with conserved seed regions may share minimal biological activity and that SBSE can be used to differentiate siRNAs of similar efficacy but with different off-target signalling potential.

Background

RNA interference (RNAi) is an evolutionary conserved mechanism that has been observed as a key component of many cellular development and differentiation processes

It is generally accepted that miRNA regulate gene expression at the post-transcriptional level via translation arrest and mRNA cleavage in association with the RNA-induced silencing complex (RISC)

Microarray technologies provide an unbiased snap-shot of the cellular transcriptional activity, and they are often employed to investigate both the functional and biological characteristics of miRNA and siRNA in various cell-lines, under varying physiological conditions

To address this issue a variety of computational approaches have been developed. For example, a number of algorithms have been used to computationally predict miRNA targets

**Additional supporting SBSE plots and comparative Sylamer plots**.

Click here for file

In an effort to improve on these current limitations we have engineered an alternative and possibly more sensitive 'seed' estimation method that utilises a Bayesian likelihood approach to estimate the probability that a 'seed' motif is over-represented within a differentially expressed gene profile. Significant enrichment scores are interpreted as evidence of 'direct' RNAi and provide a relative estimate of the magnitude of such activity. SBSE has been evaluated using a number of diverse RNAi microarray datasets, several of which are reported here. Analysis of a miRNA time-study allowed us to visualise the transient nature of miRNA directed events and indicates that SBSE could be used to determine the optimum timing of a post-transfection investigation of the direct miRNA transcription effect. Furthermore, our analyses indicates that miRNA with conserved seed regions may share minimal RNAi activity, and that SBSE can be used to differentiate otherwise equivalent siRNAs via estimates of their respective unique miRNA-like off-target profiles.

Results

How the SBSE algorithm was implemented is summarised in cartoon format (See Figures ^{®}. The described microarray datasets were selected as representatives of the diversity of RNAi investigations that would most likely be encountered in a 'typical' RNAi analysis. The 3'UTR human sequences necessary for estimation of the query (seed) motif enrichment were parsed and repetitive nucleotide motifs masked (available as Additional File

A cartoon representation of the SBSE analysis procedure

**A cartoon representation of the SBSE analysis procedure**. **Panel A: **The top row illustrates how each of the six observations (a-f), represented by coloured blocks, are sequentially evaluated by the algorithm. The bottom row represents the -log(p-value) associated with each hypothesis update. Column (a) represents the first evaluation, column (b) the second evaluation **Panel B: **Each of the six differentially expressed genes, sorted from most up-regulated to most down-regulated (**U**) of the plot summarises the -log of the estimated p-values. The optimum partition of data is indicated by a faint vertical dashed blue line (**i***) emerging from the point of the most significant p-value. The right-most section (**R**) of the plot also summarises the -log of the estimated p-values associated with each hypothesis update. The faint horizontal blue line (**j***) indicates the most significant p-value and indicates those transcripts considered important in our estimate of i*. Both the uppermost and rightmost plots use the same scaled axes and may be used to best partition the data for further focussed analyses. In this theoretical expression profile, the most significant differential distribution of the miRNA seed motif is best estimated using data from the top four transcripts and, by inference, any direct miRNA effect restricted to the transcript represented by column six which is located to the right of i*, the largest enrichment score. Note that the order in which each observation is incorporated into the analysis is dictated by the absolute ranked vector and that for large and normally distributed datasets the main section of the summary plot will form a triangle as the algorithm processes the data from most to least dysregulated transcript.

**Parsed and masked 3'UTR human sequences necessary to complete a SBSE estimate**.

Click here for file

**Processed Affymetrix datasets described in this report**.

Click here for file

Case study 1

E-GEOD-6207 comprised 14 Affymetrix GeneChip^{® }Human Genome U133A Plus 2.0 cel files. In this study

- UAAGGCAC

First the ranked, differential expression profile of each respective time point, relative to the 0 hour array, was queried for enrichment of a "GCCTTA" motif (

Plots summarising the estimated location of the direct target transcripts of hsa-miR-124 16 hours post-transfection

**Plots summarising the estimated location of the direct target transcripts of hsa-miR-124 16 hours post-transfection**. For each plot the x-axis represents the differential expression dataset organised by fold change and is ranked from most up-regulated (left) to most down-regulated (right). The central body of the plot represents how the algorithm traversed the dataset with green vertical lines highlighting a perfect match between the target 3' UTR and the RNAi seed sequence, and blue vertical lines the absence of perfect match. The characteristic triangle emphasises the broadly 'normal' distribution of the dataset (**A **and **B **trace the enrichment score and attempts to locate the most significant partitioning of the data throughout the analysis procedure. Note that the maximum enrichment score is indicated by "**A **and **B **also describe the enrichment score, that is, but in this context summarise how the estimate of the enrichment score fluctuates as sequential data is processed. See methods section for further details. **(Panel A) **The data input was the differential expression profile as determined by the LIMMA statistical model. Note that SBSE estimates that the most significant grouping of hsa-miR-124 direct transcript targets are located to the right of the vertical line and are included amongst approximately 15% of the most down-regulated transcripts. **(Panel B) **The equivalent analysis to that described for panel A, but with the expression profile input shuffled. Note that the previous hsa-miR-124 signature has been abrogated and that there is now an absence of a significant estimate or partitioning of the data.

Analysing the data as described gave enrichment profiles that indicated significant enrichment of the query motif in isolation (

Composite profile plots

**Composite profile plots**. To contextualise how the top scoring hsa-miR-124 hexameric seed query (^{^6}) unique hexameric seed queries. The resulting information is then intuitively represented by composite plots of the, previously described, enrichment score. The composite plots succinctly summarise the estimated enrichment scoring of all 4096 unique hexameric (seed) queries. Each of the four **Panels****A-D** represent the analysis of a separate time-point following hsa-miR-124 transfection. In each instance the x-axis represents the ranked transcripts (**Panels****B** and **C**. Furthermore, in **Panels B** and **C** the maximum enrichment score is the query sequence. The selected data clearly summarises how the hsa-miR-124 motif gains in prominence in each of the post-transfection samples and becomes undetectable if the differential expression profile is shuffled (**Panel D**). In particular, note how the overall enrichment score of the datasets fluctuates post-transfection. Initially (**Panel A**) all data forms a homogeneous body with no enrichment score above 40. 16 hours post-transfection (**Panel B**) a large number of up-regulated AT-rich transcripts are obvious (indicated by the blue arrow). After 24 hour post-transfection (**Panel C**) this collection of up-regulated transcripts are no longer apparent.

Another feature of the data was that of significant fluctuations in the observed enrichment scores of a large number of AT-rich motifs (indicated with a blue arrow in Figure

The differential expression profiles of each respective time point was queried with a variety of motifs that encompassed the 5' hsa-mir124 seed region. From the resulting plots it was observed that the hexamer GCCTTA generated the maximum enrichment score of 320 and that the 24 hour post-transfection expression profile was the most enriched for the complement of the hsa-miR-124 seed motif (Figure

This dataset was also used to investigate the effect of binning an expression dataset. A wide range of bin sizes (

Case study 2

Six Affymetrix GeneChip^{® }Human Genome U133 Plus 2.0 cel files were retrieved from database entry E-MEXP-875. This dataset was generated to investigate the effects of FAM33A RNAi knockdown on the gene expression profile in a lung carcinoma cell line

- UGUACA

- AAAUCA

Composite plots summarising the enrichment scores of all 4096 unique hexamer nucleotide queries indicated that enrichment peaks were associated with the down-regulated transcripts of both siRNA differential expression profiles (Figures

Comparison of two FAM33A siRNA transfection studies

**Comparison of two FAM33A siRNA transfection studies**. The first column (**Panels A** and **C**) represent respective composite plots summarising the enrichment scoring of all possible hexameric queries given each of the FAM33A siRNA differential expression profiles (See panel headers for specifics). As before, the highest scoring hexamer is coloured turquoise while specific query motifs are coloured red. The x-axes again represent the bin number. The second column (**Panels B** and **D**) summarises the analysis of each differential expression profile with the highest scoring motifs, of the two FAM33A siRNAs, both of which were identified using the respective composite plots (**Panel B** and 0-35 in **Panel D**)

When each highest scoring hexamer was further investigated it was readily apparent from the respective graphical summaries that the FAM33A_2

Case study 3

E-MEXP-456 consists of six Affymetrix GeneChip^{® }Human Genome U133 Plus 2.0 cel files. In this investigation the effect of an siRNA knock-down (

- UGUAAA

A potential drawback to this dataset is that few of the transcripts are significantly dysregulated at this time point. Hierarchical clustering does not clearly differentiate between control and treatment samples, while a volcano plot reports that only 70 transcripts demonstrate a >1.5-fold change in expression with an associated p-value of <0.05 (Additional File

siRNA knock-down of miR-30a-3p

**siRNA knock-down of miR-30a-3p**. (**Panel A**) A composite plot summarising the enrichment scoring of all possible hexameric queries given the associated hsa-miR-30a-3p differential expression profile. The complement of the hsa-miR-30a-3p seed motif (TTTACA) is highlighted in red. All other higher scoring hexamer profiles represent uncharacterised AT-rich hexamers. Also note the enrichment scores of greater than 100 with a number of AT-rich motifs among the up-regulated transcripts located on the left of the plot. **(Panel B) **This plot summarises the enrichment profile of the highest scoring hexamer (highlighted in turquoise in the composite plot of **Panel A**) following hsa-mir-30a-3p transfection. SBSE analysis estimates this motif (TAATTT) to be enriched in the up-regulated transcripts and the maximum enrichment score is therefore located to the left of the plot.

Case study 4

Six Affymetrix GeneChip^{® }Human Genome U133 Plus 2.0 cel files were retrieved from database entry E-GEOD-16097. This dataset included 3 replicates of a control transfection and 3 replicates of an siRNA transfection designed to knock-down the human BAHD1 transcript in HEK293 cells

- GAACUA

A composite plot summarising the profile estimates of all 4096 unique nucleotide hexamers found no evidence of significant BAHD1_2 or BAHD1_3 complement siRNA seed motifs using nucleotide queries that encompassed both the sense and anti-sense siRNA strands (not shown). However, a modest but most significant enrichment score was observed with the BAHD1_1 siRNA motif GAACTA (Figure

Knock-down of BAHD1 with an siRNA cocktail

**Knock-down of BAHD1 with an siRNA cocktail**. (**Panel A**) Composite plot summarising the enrichment scoring of all possible hexameric queries given the BAHD1 siRNA transfected differential expression profile (as described in the main text). As before, the query sequence is indicated by a red line and in this instance is also observed as the highest scoring hexamer. The x- and y-axes represent bin number and enrichment score, respectively. The most significant score that can be attributed to any of the siRNA transfection pool is GAACTA which is a complementary match of the 5' end of the negative strand of siRNA BAHD1_1 and indicative of off-target interference by the this strand. **(Panel B) **This plot summarises the enrichment profile of the highest scoring hexamer (as highlighted in red in the composite plot of **Panel A**) within the differential expression profile of the BAHD1 siRNA transfection dataset. SBSE analysis estimates the GAACTA motif to be enriched in the down-regulated and by inference is acting in a miRNA-like repressive manner. Both

Case study 5

Previous case studies indicated that our Bayesian estimate scores in combination with composite profiles can be used to differentiate siRNA of the same target specificity by comparison of their respective predicted off-target profiles. An extension of this observation is that transfections comparing reported miRNA orthologues should, in principle, produce similar differential expression profiles (^{® }Human Genome U133 Plus 2.0 cel files, four of which were control replicates (pCDNA3.1), four transfected with hsa-miR-155 and four samples transfected with the KSHV-miR-k12-11 miRNA, a proposed orthologue of hsa-miR-155

- UUAAUGC

Comparing miRNA orthologues

**Comparing miRNA orthologues**. Composite plots summarising the enrichment scoring of all possible hexameric queries given the respective hsa-miR-155 (**Panel A**) and KSHV-miR-k12-11 (**Panel B**) differential expression profiles. In each instance the highest scoring hexamer is highlighted in turquoise while the highest scoring seed hexamer is highlighted in red. Note how the hsa-miR-155 profile is dominated by AT-rich transcripts and that there is minimal enrichment of the miRNA target motif. The paucity of significant motifs in the KSHV-miR-k12-11 plot highlight that few transcripts are differentially expressed relative to the control dataset. Also note that the seed region is conserved in both miRNAs.

When equivalent motif queries derived from the KSHV-miR-k12-11 seed region (

- UUAAUGC

Discussion

The key assumption implemented within SBSE is that both 'direct' miRNA down-regulatory events and siRNA off-target interference can be accurately assessed via estimates of 'seed' motif enrichment in a ranked sequence population. Enrichment estimates are calculated using common microarray summary statistics and a weighted Bayesian analysis of the ranked sequence space. Each estimate is presented as a simple, but intuitive graphical summary to facilitate an understanding of how the RNAi event under investigation may have dictated the observed differential expression profile. The approach is particularly attractive in that it requires minimal assumptions about either the method of inhibition, or the characteristics of the transcript targets (

Our analyses indicate that a SBSE approach can be used to infer the optimum timing, magnitude and likely location of 'direct' RNAi events. Analysis of a hsa-miR-124 post-transfection time-series

Of particular interest to our group is the ability to better understand and minimise siRNA off-target effects, as such events may either limit the utility of a siRNA being used as a therapeutic agent or compromise interpretation of functional knock-down studies. Current opinion is that such undesired responses are the result of innate immune responses

Conclusion

Common microarray summary statistics combined with a simple Bayesian analysis have proven sufficient to estimate the magnitude of the direct RNAi transcription effect. Our analyses indicate that SBSE can be used to infer the optimum timing, magnitude and likely location of 'direct' RNAi events, and is sufficiently sensitive to differentiate siRNAs of similar efficacy but with different off-target signalling potential.

Methods

The SBSE algorithm was implemented as an R script ^{® }Linux and Microsoft Windows XP.

**SBSE R scripts and README.txt required to execute a SBSE estimate**.

Click here for file

Concept

Assume that as part of a miRNA functional characterisation effort we have determined the, relative to control, fold-changes of six gene transcripts. These respective expression values are used to generate a simple ordered (

A second list is derived from the ordered fold-change data, but in this instance the data indicates the absolute ranking of each differentially expressed transcript and for our illustration takes the form

On completion of the analysis the estimated probabilities are plotted alongside a simple graphical representation (See Figure

The following assumptions are made with regard to the dataset: (1) that the 3'UTRs represented the full length transcript, and (2) that only one query (seed) match per 3'UTR was of relevance to the transcript repression mechanism (3) that RNAi transcript targets are down-regulated post-transfection.

Algorithm details

Let "D" denote our ordered binary list of length N (the total number of gene transcripts). In this list a one corresponds to the ^{th }gene's UTR having a seed sequence match, while a zero corresponds to the ^{th }gene's UTR not having a seed sequence match. Now let "A" denote our absolute data list, also of length N. Initially consider the top j transcripts of list A. This value is used to extract and partition the

Now let _{i, j }
_{1...j }(the "left") and D_{j+1...N }(the "right") can be best explained by the distribution of the miRNA seed motif on either side of this division. Let

Given the above definitions we define an updating mechanism that allows the complete dataset to be traversed and our hypothesis to be incrementally evaluated. First, an initial estimate is assigned to the hypothesis _{
i,j-1}, that is, the probability that our hypothesis is correct given that we have observed D_{
j-1 }(

1) transcripts. This probability is updated for each subsequent A[

Note that P{H_{0}
_{
1...j-1}} = 1- P{H_{
i, j-1}|D_{
1...j-1}} is the probability that the differential expression at this division cannot be explained by the observed distribution of the miRNA seed motif (

Should the next most differentially expressed transcript of encode a miRNA seed motif then a logical assumption is that the probability of the next transcript falling to the left of the division is the current ratio of seed sequences matches to the left of the division to the total number of seed sequences matches.

Further, the probability of the next transcript falling to the right of the division is the current ratio of seed sequences matches to the right of the division to the total number of seed sequences matches observed.

However, if the next transcript does not have a seed sequence match, then the probability of the next transcript falling to the left of the division is the current ratio of transcripts without a seed sequence match to the left of the division to the total number of transcripts without a seed sequence match observed.

By extension, the probability of this transcript falling to the right of the division is the current ratio of transcripts without a seed sequence match to the right of the division to the total number of transcripts without a seed sequence match observed.

Under our null hypothesis all of the observed differential expression is assumed to be independent of the miRNA seed motif distribution. Hence the probability that the next transcript falls to the left of the division is the ratio of transcripts to the left of the division to the total number of transcripts.

Likewise, under the null hypothesis the probability that the next transcript falls to the right of the division is quite simply the ratio of transcripts to the right of the division to the total number of transcripts.

The P{H_{
i, 0
}} is given an initial value of 0.5 and equation 1.1 updated until the dataset has been traversed. On completion the highest value of _{* }genes at division _{*}. The estimated probabilities, P{H_{
i, j
}}, for each D[

Miniscule values

Our Bayes formula while mathematically correct, may prove problematic as both P{H_{i, j}
_{1..j
}} and P{H_{0}
_{1...j
}} become miniscule for large datasets. It is therefore pragmatic to work on a logarithmic scale, that is:

And likewise,

Binning

The algorithm as defined requires N(N-1) iterations of formula 1.1 to complete an analysis. Given that a typical microarrays dataset summarises the expression data of several thousand genes, it is a valuable option that the dimension of the dataset be reduced to enable a more rapid execution of the analysis. One proven approach is to group the ordered gene list into M bins and apply equation seven for M(M-1) iterations. Under this scenario most of the utilised formulae remain unmodified. However, the functions used to estimate P{D_{1...j
}|H_{
i, j-1}} and P{D_{1...j
}| H_{
0
}} must be updated to accommodate this additional option. This can be achieved as follows. Let _{j}
_{
i, j-1}, the probability that we will observe _{1...j
}|H_{
i;j-1}} = p^{
x
}q^{
y
}, where

If the bin falls to the right of the division, then P{D_{1...j
}|H_{
i;j-1}} = p^{x}
^{y}

Under the null hypothesis P{D_{1...j
}|H_{0}} = p^{
x+y
}, where

should the bin fall to the left of the division and

should the bins fall to the right of the division.

Composite hexamer plot

The summary plot of the query (seed) distribution (See Figure ^{6}) unique hexamer nucleotides and plot each of the resulting estimates on a composite graphical representation. Such plots allow a simple and succinct graphical representation of how our estimate of a given hexameric nucleotide query motif compares relative to all other hexameric sequences (see Figure

Datasets

To develop and validate SBSE public microarrays datasets were retrieved from the EBI's ArrayExpress

Brief summaries of selected case studies used in the development and evaluation of SBSE are as follows:

(1) The E-GEOD-6207 dataset is comprised of 14 Affymetrix GeneChip^{® }Human Genome U133A Plus 2.0 cel files. In this study hsa-miR-124 was over expressed in HepG cells and RNAs extracted at time points 0, 4, 8, 16, 24, 32, 72 and 120 h post-transfection

(2) Six Affymetrix GeneChip^{® }Human Genome U133 Plus 2.0 cel files were retrieved from E-MEXP-875. This dataset was originally generated to investigate the effects of FAM33A RNAi knockdown on the gene expression profile of a lung carcinoma cell line

(3) The E-MEXP-456 dataset consists of six Affymetrix GeneChip^{® }Human Genome U133 Plus 2.0 cel files. In this investigation the effect of siRNA duplex knock-down of the human miR-30a-3p miRNA precursor was evaluated in HepG2 cells in an attempt to identify hsa-miR-30a-3p target transcripts

(4) The dataset E-GEOD-16097 is comprised of six Human Genome U133Plus 2.0 cel files

(5) The E-GEOD-9264 dataset is comprised of 12 Affymetrix GeneChip^{® }Human Genome U133 Plus 2.0 cel files. Four of these were control replicates (pCDNA3.1), four samples transfected with hsa-miR-155 and four samples transfected with the KSHV-miR-K12-11 miRNA, a proposed ortholog of hsa-miR-155

Abbreviations

(RNAi): RNA interfering; (miRNA): microRNA; (siRNA): small interfering RNA; (SBSE ): Simple Bayesian Seed Estimate; (3' UTR): 3' untranslated region

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PW conceived the project. MP implemented the algorithm. MP and PW performed the analyses. PW wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was funded by the GlaxoSmithKline Medicine Research Centre. We would like to thank Anton Enright for the inspiration and useful discussion that helped us implement our method and to Dr. Philippe Sanseau for his advice and comment while preparing the manuscript.