^{*}

Conceived and designed the experiments: HB. Analyzed the data: HB TG. Contributed reagents/materials/analysis tools: MF TG. Wrote the paper: HB TG.

The authors have declared that no competing interests exist.

Single nucleotide polymorphism (SNP) arrays are important tools widely used for genotyping and copy number estimation. This technology utilizes the specific affinity of fragmented DNA for binding to surface-attached oligonucleotide DNA probes. We analyze the variability of the probe signals of Affymetrix GeneChip SNP arrays as a function of the probe sequence to identify relevant sequence motifs which potentially cause systematic biases of genotyping and copy number estimates.

The probe design of GeneChip SNP arrays enables us to disentangle different sources of intensity modulations such as the number of mismatches per duplex, matched and mismatched base pairings including nearest and next-nearest neighbors and their position along the probe sequence. The effect of probe sequence was estimated in terms of triple-motifs with central matches and mismatches which include all 256 combinations of possible base pairings. The probe/target interactions on the chip can be decomposed into nearest neighbor contributions which correlate well with free energy terms of DNA/DNA-interactions in solution. The effect of mismatches is about twice as large as that of canonical pairings. Runs of guanines (G) and the particular type of mismatched pairings formed in cross-allelic probe/target duplexes constitute sources of systematic biases of the probe signals with consequences for genotyping and copy number estimates. The poly-G effect seems to be related to the crowded arrangement of probes which facilitates complex formation of neighboring probes with at minimum three adjacent G's in their sequence.

The applied method of “triple-averaging” represents a model-free approach to estimate the mean intensity contributions of different sequence motifs which can be applied in calibration algorithms to correct signal values for sequence effects. Rules for appropriate sequence corrections are suggested.

Genomic alterations are believed to be the major underlying cause of common diseases such as cancer

The microarray technology utilizes the specific affinity of fragmented DNA to form duplexes with surface-attached oligonucleotide probes of complementary sequence and subsequent optical detection of bound fragments using fluorescent markers. The measured raw probe intensities are subject to large variability, and depend not only on the abundance of allelic target sequences, but also on other factors such as the sequence dependent probe binding affinity. The successful correction of raw probe signals for such parasitic effects is essential to obtain exact genotyping estimates. It requires identification and understanding of the main sources of signal variation on the arrays.

The main purpose of this paper is to analyze the variability of probe signals of Affymetrix GeneChip SNP arrays as a function of the probe sequence and to identify relevant sequence motifs which significantly modulate the probe signals. Such sequence motifs constitute potential building blocks for improved calibration methods which aim at correcting probe signals for sequence effects.

The discovery of characteristic sequence motifs using SNP arrays is also important in a more general context: DNA/DNA duplex formation is the basic molecular mechanism of functioning not only of SNP arrays but also of other array types such as re-sequencing

The presented analysis takes special advantage of the probe design used on GeneChip SNP arrays. Particularly, this technology uses 25meric oligonucleotide probes corresponding to a perfect match for each of the two allele sequences. In addition, a mismatch probe is synthesized for each allele to detect non-specific binding. Combination of this information with the target composition of fractionated genomic DNA used for hybrization on the arrays enables us to deduce the base pairings in the probe/target complexes producing a particular probe intensity. Making use of the hundreds of thousands signal values per SNP array allows us to extract specific intensity contributions of selected short sequence motifs of two-to-four adjacent nucleotides via appropriate averaging. The obtained motif-specific intensity contributions characterize the stability of the involved base pairings which include all relevant combinations of canonical Watson-Crick and mismatched pairings. Finally, the systematic analysis of different sequence motifs such as triples of adjacent bases allows us to identify those which account for significant signal variations.

We previously performed an analogous chip study using intensity data of expression arrays to characterize base pair interactions in DNA/RNA hybrid duplexes

The paper is laid out as follows: Section 2 sets out the method and, particularly, explains the classification criteria used to assign the probe intensities to different interaction modes. In Section 3, we analyze different factors which affect the probe intensities such as the number of mismatches, the optical and non-specific background, signal contributions due to different sequence motifs such as different base triples, single and tandem mismatches and their positional dependence along the sequence. In addition we assess symmetry relations of the motifs, their decomposition into nearest neighbor terms and compare the results with thermodynamic nearest neighbor parameters characterizing DNA/DNA interactions in solution. In Section 4 we discuss the stability of different mismatches and discover the possible origin of the “poly-G” effect. Finally, we suggest rules for selecting appropriate sequence motif to adequately correct the probe signals for sequence effects which might serve as the basic ingredient of improved calibration methods.

SNP arrays intend to determine genotype and copy numbers of hundreds of thousands of bi-allelic single nucleotide polymorphism- (SNP) loci in one measurement. Let us specify each SNP by the alternative nucleotides in the sense DNA-strand of allele A and allele B using the convention B_{A}/B_{B}, where B_{A}/B_{B}∈{A/C, A/G, A/T, C/G, C/T, G/T} stands for one of six SNP types considered on GeneChip SNP microarrays. These SNP types are either complementary (cSNP: A/T, C/G) for substitutions of complementary nucleotides or non-complementary (ncSNP) otherwise.

On Affymetrix 100k GeneChips, each allele is interrogated by ten perfect match (PM)-probes, the 25meric sequence of which perfectly matches the genomic target-sequence at the selected SNP position (see

(a) Each SNP (for example [C/A]) is probed by 25meric probes of complementary sequence. Different offsets δ of the SNP position relative to the middle base (mb) of the probe sequence are used. In addition, each PM probe is paired with one MM probe the middle base of which mismatches the target sequence (not shown). (b) The allele-specific probes intend to detect the respective targets via allele-specific binding which however competes with cross-allelic hybridization of targets of the alternative allele (see also the reaction equation Eq. (6)). (c) Both hybridization modes give rise to four different types of probe/target duplexes formed by the two allele-specific probes. The figure shows the respective base pairings for a selected SNP-triple which consists of the SNP [C/T] and its nearest neighbors. Mismatched non-canonical pairings are indicated by crosses. (d) Each box includes one probe-quartet which consists of two PM/MM-probe pairs interrogating either targets of allele G = A or targets of allele G' = B and vice versa (i.e. G = B and G' = A). Only targets of one allele are assumed to be present as in the sample. They hybridize to the probes of both allele sets forming either specific or cross-allelic duplexes, respectively. The three selected probe quartets differ in the offset δ of the SNP position (see arrows and part a of the figure) relatively to the middle base of the probe. The different combinations give rise to different numbers and positions of mismatched pairings which are indicated by the bulges. Their number varies between #mm = 0 and #mm = 2 in dependence on the probe type, hybridization mode and offset position. Complete probe-sets use 10 probe quartets.

Each PM-probe is paired with one mismatch (MM)-probe of identical sequence except the middle base which intends to estimate the contribution of non-specific background hybridization to the respective PM-probe intensity. Note that the mismatched pairing noticeably reduces specific binding of the respective target to the MM probes compared with the respective PM-probe. The middle base is substituted by its Watson-Crick complement as standard (for example A↔T) except for the probes interrogating cSNPs with offset δ = 0, i.e. in the middle of the probe sequences. The non-complementary replacements A↔G and T↔C are realized in this special case to avoid inter-allelic specific binding to the MM (see below).

Taken together, each allele of each SNP is probed by a set of 20 PM/MM probe pairs. These, in total 40 probe split into two sub-sets of 10 probe pairs for each allele which we will term ‘allele-set’. Each allele-set consists of probes with the SNP interrogation position placed at the sense and antisense strands and moving the 25meric probe sequence up and down the target sequence with respect to the SNP locus by different offsets to improve the accuracy of genotyping and copy number estimates.

Both allele sets use the same offset positions. Therefore each particular offset, δ, is probed by one probe pair for each allele. These four probes (i.e. two PM/MM-pairs) addressing each offset position make up the so-called probe-quartet referring to the same 25-meric segment of the target genome (see

SNP microarrays are hybridized with fragmented genomic DNA representing the targets for the probes attached on the chip surface. Let us consider one SNP locus of a heterozygous genotype: The hybridization solution of genomic DNA consequently contains targets of both alleles A and B. The hybridization reactions can be described by three coupled equations for each probe,

In the allele-specific hybridization mode (called S-mode) the probes bind the target which they intend to detect via duplex formation of the type P-A•A and P-B•B, respectively. In the cross-allelic hybridization mode (C-mode) the probes bind targets of the alternative allele in duplexes of the type P-A•B and P-B•A, respectively. The considered probes also bind non-specific genomic fragments not referring to the selected SNP. Such non-specific duplexes are of the type P-A•N and P-B•N where N subsumes all non-specific target sequences with non-zero affinity to the selected probe.

In the S-mode the PM probes completely match the target sequence whereas in the C-mode the PM-sequence mismatches the target at the SNP position. The respective MM probes mismatch the target either only at the middle position (S-mode) or at both the middle and the SNP position (C-mode). The respective base pairings are specified below.

The measured intensity of each probe represents the superposition of contributions originating from the three hybridization modes, and from the optical background caused by the dark signal of the scanner and by residual fluorescent markers not attached to target-fragments,^{PM,N}≈I^{MM,N}). We combine both contributions into one mean background intensity

Three types of targets compete for duplex formation with each probe in the general case considered in Eq. (1). In the special case of homozygous genotypes only targets of one allele are present in the hybridization solution. As a consequence, the types of competing targets per probe reduce to two ones, namely non-specific and either allele-specific or cross-allelic targets. Particularly, the probes targeting the present allele hybridize specifically (homozygous-present probes) whereas the probes interrogating the alternative allele hybridize in the cross-allelic mode (homozygous-absent probes), i.e.^{P,C} = 0 and I^{P,S} = 0, respectively (see

In this section we specify the base pairings formed in the probe/target duplexes at two selected sequence positions, namely that of the SNP- and that of the middle-base of the probe sequence. The SNP position is shifted by the offset δ with respect to the middle base. SNP- and middle-base are consequently identical for δ = 0.

In the specific hybridization mode the PM probes perfectly match the respective target-allele forming Watson-Crick (WC) pairings along the whole probe sequence including the two selected positions (

The assignment of the specific and cross-allelic hybridization modes to the six probed bi-allelic SNP types B_{A}/B_{B} (see above) and the two probe types (P = PM, MM) provides the full set of 16 possible base pairings in the probe/target duplexes at their SNP- and/or middle-position (see

The number and the type of the mismatches are not specified in probe/target duplexes formed in the non-specific hybridization mode. Nevertheless, the sequence effect can be described in terms of the properties of canonical WC pairings

As discussed in the previous subsections, the probe/target duplexes are characterized by the hybridization mode (h = S, C, N) and a series of probe attributes: probe-type (P = PM, MM), probe sequence and middle base (B_{13} = A,T,G,C), strand direction (d = s, as), SNP type (B_{A}/B_{B}), and SNP offset (δ = −4,…,0,…,+4). Each particular combination of the hybridization mode with a set of probe attributes unambiguously determines the interaction mode between probe and target. It is characterized by

the base pairing at the SNP position and at the middle position, which includes all 16 pairwise combinations of nucleotides, 4 of which form WC pairings and 12 of which are mismatches;

Watson-Crick pairings at the remaining positions of the probe sequence;

the mutual shift between the middle and the SNP base by up to four bases in both directions (δ);

different numbers of mismatches per duplex varying between #mm = 0 (for P = PM and h = S) and #mm = 2 (P = MM and h = C, only δ≠0);

different relative positions of paired mismatches (#mm = 2) which are either separated by at least two WC pairings (|δ|>1) or form tandem-mismatches (|δ| = 1).

The design of SNP GeneChips thus enables us to study how these interaction modes affect the probe intensities in a systematic way. Vice versa, the probe intensities are related to the amount of bound DNA-targets which, in turn, depends on the stability of the duplexes and thus on the binding constant of the respective interaction mode. Knowledge of the binding constant and of the interaction mode then allows us to compute the genotype call and copy number of a given SNP.

Intensity-data of the 100k GeneChip SNP array and supplementary files were downloaded from suppliers website (

The data are further filtered to remove probe intensities which are dominated by nonspecific hybridization by more than x^{P,N}>0.2 (Eq. (5)). These selection criteria are chosen from the hook plot of the chip data which is briefly described in the supporting text (see

The intensity data are corrected for the optical background intensity and for residual non specific hybridization before further analysis as described in

We previously used the so-called ‘triple-averaging’ approach to estimate the effective strength of base pairings in probe/target duplexes on GeneChip expression arrays

Let us define the standard triple as the string of three consecutive bases (xBy) in 5′→3′-direction of the probe sequence (x,B,y∈A,T,G,C) where the nearest neighbors (x, y) of the central base B form Watson-Crick pairs in the duplexes with the targets. The position of the triples along the probe sequence was chosen in such a way that its central base (B) agrees either with the middle base (mb) or with the SNP base (see

So-called triple averages of the intensity are calculated as log-mean over all probes within the classes defined by the interaction group of the central base (Ab = At, Aa, Ag or Ac), by the triple motif xBy at offset position (δ = −4,…,0,…, +4) and by the number of mismatches per duplex (#mm = 0, 1 or 2)

The triple sensitivities are defined as the deviation of the triple-averaged intensity from an appropriately chosen mean value over all triples (see below and

Special selection criteria for triples with one flanking mismatch and of tandem mismatches are given in the scheme shown in

The specific and cross-allelic hybridization modes include perfect matched and mismatched probe/target duplexes with up to two mismatched pairings at the SNP- and/or mb-position (see

Averaged log-intensities for probes of different mismatch-groups and offset-positions, (panel a) and mean effect of the number of mismatches (#mm) on the observed intensity (panel b). Panel a: Mean probe intensity (averaged over all probes with a given SNP offset, see the arrow in the schematic drawing in the right part for illustration) as a function of the offset-position of the mismatch with respect to the middle base (δ) for different number of mismatches per probe/target duplex (#mm = 0…2). Virtually no significant effect of the offset-position was observed for single mismatches within the relevant range |δ|<5. Contrarily, the mean intensity decreases with increasing separation between double mismatches (#mm = 2) where one is located in the centre of the probe (middle base, mb) and the second one at offset position δ. Note that both mismatches merge into one for δ = 0. The homozygous-absent data (P-G'•G) were separately calculated for the three groups of mismatches, Aa, Ac and Ag: The respective curves are almost identical. Panel b: Relative decrease of the mean probe intensity as a function of #mm (symbols). The curves are calculated using Eqs. (9) and (10). The data are shown in logarithmic (left axis, upper data) and linear (right axis) scale without (open symbols) and with (solid symbols) background correction.

The SNP base of each probe forms a WC pairing in P-G•G duplexes (hp-mode). The respective averaged intensities per SNP position are consequently pseudo-replicates of different sub-ensembles of probes referring to the same interaction mode, namely perfectly-matched (PM-G•G) or single-mismatched (MM-G•G) probe/target duplexes (see the schematic drawings in panel a of

In P-G'•G duplexes (allele absent/ha-mode) the SNP base forms a mismatched pairing. The averaged intensities consequently refer to the shift of the mismatch relative to the middle base. For the PM probes (PM-G'•G) the position of the respective single mismatch only weakly affects the mean intensity in the relevant range of SNP offsets (panel a of

In contrast, the MM-probes form two mismatches in the homozygous-absent mode (MM-G'•G) at the SNP- (for |δ|>0) and at the middle position. Both mismatches are separated by (δ−1) WC pairings in-between. The observed mean intensity decreases with increasing distance between the mismatches (panel a of

The presented results show that the number of mismatched pairings per duplex (#mm) is the most relevant factor which affects the mean intensity of the probes (see the horizontal lines in _{duplex}(#mm) denotes the respective mean association constant of probe/target duplexes with #mm mismatches; x is the fraction of WC pairings in the duplex and γ is a fit-constant depending on the hybridization conditions.

An alternative, simple “mismatch”-function results from the assumption of additive contributions of each base pairing, _{duplex} is related to the free energy of duplex stability and δε is its mean incremental penalty (in units of logK_{duplex}) if one substitutes one WC pairing by a mismatch. This approach predicts an exponential decay of the intensity as a function of the number of mismatches, _{duplex}(0)/25, which has the meaning of a mean additive contribution of one WC pairing to log K_{duplex}(0). Panel b of

The PM probes form exclusively WC pairings in homozygous-present PM-G•G duplexes. We calculated log-mean intensities for all these duplexes containing a certain base (B = A,T,G,C) at each position k = 1…25 of the probe sequence to study the positional effect of WC-base pairings over the whole sequence length (see lines in panel a of

Panel a: Single base data of allele-specific (S-mode) PM and MM probes. Each data point was calculated as log-intensity average over all probes of the considered class with the indicated base at position k of the probe sequence. It is associated either with WC pairings or with mismatched pairings at the middle base (mb)-position of the MM. These mismatches give rise to markedly larger variability of the intensities than the WC pairings do at the remaining positions. Panel b shows the positional dependence of the sensitivity (deviation of the log-intensity from its mean over all probes of the class) of cross-allelic PM probes (C-mode) with different offsets of the SNP. The base at the SNP position forms a mismatched pairing which shifts along the sequence according to the offset. Note that the mismatch-values are averages over all groups (Aa, Ag, Ac; see

Also the homozygous-present duplexes of the MM-probes, MM-G•G, form predominantly WC pairings except the middle base which forms mismatches of the Aa-interaction group. The single base averaged intensities of these mismatches vary to a much larger degree about their mean compared to the WC pairings (see the arrow in panel a of

Panel b of

To estimate the effect of longer sequence motifs we calculated intensity-averages of probes possessing “homo”-triples, i.e. runs of three consecutive bases of the same type at a certain sequence position (see panel c and d of ^{−0.2}−10^{−0.4}≈0.6−0.4 compared with the mean intensity for most of the sequence positions. In contrast, the mean effect of a single G is almost negligible. The GGG-effect essentially disappears at the mismatch position in the middle of the probe sequence (see panel d of

Comparison of panel c and d of

It is known that the sequence profiles are sensitive to factors such as the optical background correction and saturation

In the next step we neglect the positional dependence of probe intensities and address the sequence-specific effect of base pairings in triple motifs centered about the middle and SNP base of the probes.

The triple averaged and background corrected intensities were used to calculate the 64 triple-sensitivity values for each of the four interaction groups, (Eq. (8)). Particularly, we selected the homozygous-absent PM probes (PM-G'•G) with one mismatched pairing at SNP position and used the base-triples centered about the middle base (At-group) and about the SNP base (Aa, Ag, Ac group, see

The triple values are calculated using (Eq. (8)) and ranked with increasing sensitivity for each center base B forming matched (group At) and different mismatched (groups Aa, Ag and Ac) pairings with the target as indicated in the figure by upper (probe) and lower (target) letters. The sensitivity-values are calculated relative to the total log-average of all single-mismatched probes of the chip. Sub-averages of the interaction groups (see arrows) and of the central base pairings are shown by vertical solid lines. The vertical dashed lines indicate the standard deviation of the triple values about the central-base related mean (see also

Most of the sensitivities of the At-group (WC pairings) relatively tightly scatter about their mean indicating an only moderate sequence effect. The ‘GGG’-triple however strongly deviates from this rule; it causes a relatively large intensity penalty: One ‘GGG’-motif give rise to the reduction of the intensity on the average by a factor of about 10^{−0.2}∼0.63 compared with the mean intensity. The triples considered refer to offset positions |δ|≤4 about the middle base. The full positional dependence of ‘GGG’ (^{+0.1}∼1.25. We will discuss this puzzling result below.

The substitution of the central WC pairing by mismatches considerably increases the variability of the triple data. The mean variability of each interaction group was estimated in terms of the standard deviation of all 64 combinations of each group (^{−0.25} = 0.55 and ∼10^{+0.25} = 1.8. This result generalizes the trend which is illustrated in

Interaction group |
At | Aa | Ag | Ac |

Base pairings (Watson Crick or mismatches) | WC pairings: At, Cg, Gc, Ta | self complementary mismatches: Aa, Cc, Gg, Tt | self paired mismatches: Ag, Ct, Ga, Tc | cross paired mismatches: Ac, Ca, Gt, Tg |

0.04±0.001 | 0.12±0.0005 | 0.13±0.001 | 0.09±0.0005 | |

0.03 | 0.11 | 0.07 | 0.05 | |

0.05 (0.02) | 0.10 (0.08) | 0.08 (0.06) | 0.06 (0.05) | |

0.02 (0.02) | 0.02 (0.02) | 0.03 (0.02) | 0.01 (0.01) | |

0.04 | 0.03 | 0.02 | ||

0.02 (0.033) | 0.015 (0.047) | 0.013 (0.044) | ||

0.06 (0.07) | 0.06 (0.10) | 0.05 (0.08) | ||

0.05 | 0.08 (0.055) | 0.05 |

variability estimates are separately calculated as standard deviation for each Ab-interaction group: SD = √<Δ^{2}>_{Ab}.

variability of the triple averages with respect to the group-mean: Δ = Y_{Ab}(xBy)−<Y_{Ab}(xBy)>_{Ab}; it estimates the variability of interactions due to the choice of the triple; the standard error refers to the variability of the probe level data of each interaction group.

variability of the triple averages after 3′/5′-transformation: Δ = Y_{Ab}(xBy)−Y_{Ab}(yBx).

variability of the triple averages after complementary-transformation: Δ = Y_{Ab}(xBy)−Y_{Ab}(x^{c}B^{r}y^{c}); the values in the brackets are obtained after omitting the GGG-motif.

variability of the residual values after reduction of the model rank NNN→NN: Δ = Δ^{res}_{Ab} (see Eq. (17)).

variability due to flanking mismatches: Δ = Δ^{flank}_{Ab} (see Eq. (15)).

variability due to quadruplet motifs with tandem mismatches (xBB'y)/(yB'Bx) with B∈Aa and B'∈Aa,Ag,Ac. The SD were calculated with respect to the average over the three groups (Δ(xy) = <Y_{Ab}(xBB'y)>_{BB'}−<<Y_{Ab}(xBB'y)>_{BB'}>_{Ab} and Δ(BB') = <Y_{Ab}(xBB'y)>_{xy}−<<Y_{Ab}(xBB'y)>_{xy}>_{Ab}) and with respect to the total mean over all couples (values in the brackets; (Δ(xy) = <Y_{Ab}(xBB'y)>_{BB'}−<<Y_{Ab}(xBB'y)>_{BB'}>_{Ab,xy} and Δ(BB') = <Y_{Ab}(xBB'y)>_{xy}−<<Y_{Ab}(xBB'y)>_{xy}>_{Ab,BB'}).

The mean sensitivity over all triples with a given middle base B provides a measure of the average stability of the respective mismatched pairing Bb (see the red lines in ^{r}b^{r} in symmetrical DNA/DNA interactions, i.e. Y_{Ab}(Bb)≈Y_{Ab}(B^{r}b^{r}) (for example for Tc→Ct and Ac→Ca). Note that, in contrast, DNA/RNA interactions are asymmetrical in solution

Comparison of the mean sensitivity values for each central pairing of all three mismatch-groups provides the following ranking of the stability of mismatched pairings:

Other authors report similar rankings of the stability of single-mismatches in DNA/DNA-oligomer duplexes which are obtained from hybridization studies on surfaces (microarrays or special solid supports) or in solution:

Basic agreement of the reference studies with our ranking is highlighted using bold letters. Accordingly, the consensus-ordering of the array-studies comprises Ct, Ca, Cc as low stability mismatches; Ag, Tg, Tt as high stability mismatches and Gt and Aa at the intermediate position. A major difference between the previous rankings occurs for Gg which is the least stable in the study of Naiser et al. ^{6}) exceeds the probe number used in previous studies by about three orders of magnitude (10^{3} ^{3}

Note also that the reported references

The triple sensitivities shown in ^{c} = T) and bond-reversals for the more general situation which includes also mismatched pairings (e.g. A^{r} = G and A^{r} = A for mismatches of the Ag and Aa groups, respectively).

Perfect 3′/5′-symmetry of the triple sensitivities (i.e. Y(xBy) = Y(yBx)) is expected if the base pairings are independent of their nearest neighbors. Stacking interactions between adjacent nucleotides however make an essential contribution to the stability of DNA/DNA-duplexes

In contrast, the complementarity relation keeps the strand direction unchanged. Perfect complementarity of the triple sensitivities (i.e. Y(xBy) = Y(y^{c}B^{r}x^{c})) is expected if both interacting strands are physically equivalent and if their reactivity is not selectively perturbed by parasitic reactions such as intramolecular folding and/or bulk dimerization

The triple sensitivities, Y(xBy), of each interaction groups are ranked in decreasing order and shown by thick lines. For each base-triple three sensitivity values are shown according to Eq. (14) to reveal 3′/5′-asymmetry, Y(yBx), and complementarity, Y(y^{c}B^{r}x^{c}), respectively (symbols are assigned in the figure). The abscissa labels indicate the xBy-triple. The letter-triples in the boxes indicate special triples the sensitivity values of which reveal considerable asymmetry, for example xBy/yBx/y^{c}B^{r}x^{c} = TCG/GCT/AGC of the At-group. Note that GGG-motifs are highly non-complementary in all four interaction groups. Note also the markedly different widths of the scattering funnels of the different interaction groups given by their standard deviation (see dotted lines and also

Both, 3′/5′- and complementary asymmetries roughly behave in parallel. They are, by far, smallest for the At-group and largest for the Aa-group which agrees with the ranking of the variability of the triple sensitivities between the groups. Also the SD values roughly agree (see

Hence, the effect of the central mismatch of the Aa-group is obviously most modulated by stacking interactions and complementary asymmetries among the considered groups causing largest variability of the associated probe intensities. Note that just this type of self-complementary mismatches was selected to design MM probes on microarrays of the GeneChip-type. Our results suggest that this design seems suboptimal because it is associated with a relatively high variability of mismatch stability. The effect introduces additional noise into the MM intensities which intend to correct the PM signals for background contributions.

Examples for symmetry relations are explicitly indicated in ^{c}B^{r}x^{c} are given within the boxes, the abscissa labels indicate the xBy-triple only): For example, the combination AGC/CGA/TCG taken from the At-group shows marked 3′/5′-asymmety beyond the limits of the mean scattering funnel. The data clearly show that the by far largest complementary asymmetries are associated with triple-G motifs in the probe sequence for all interaction groups (see solid triangles surrounded by the circles). They make a contribution of up to 50% to the mean variability of the respective interaction groups (

The context of adjacent WC pairs considerably modifies the effect of the central mismatch: For example, the ratio of two triple-sensitivities with a central Cc-mismatch (Aa-group) flanked either by two C's or by two A's is about Y(CCC)/Y(ACA)|_{Aa}≈10^{+0.2}∼1.6 whereas the respective intensity ratio for the triples with a central Cg-pair (At-group) is only I(CCC)/I(ACA)|_{At}≈10^{+0.1}∼1.25.

To generalize this result we average the triple sensitivities of each mismatch group over the central base,

Mean sensitivity values were calculated as averages over triple sensitivities shown in

Tandem mismatches occur in homozygous-absent duplexes of the MM-probes (MM-G'•G) with SNP offsets δ = +1 and −1 (see

The quadruplets were analyzed in terms of independent duplets of the WC-couples xy (part a), of tandem mismatches BB' (part b) and of mixed NN-couples xB/Bx and yB'/B'y (part c and d). Note that B refers to the Aa-group whereas B' to the Aa-, Ag- or Ac-group (see legends in the figure). Along the x-axis the respective pairings are ordered with decreasing mean sensitivity which is averaged over the three groups Aa, Ag and Ac of B' (see the thick decaying curve). Part a and b: The central tandem mismatches formed by B and B' cause considerably larger scattering than the adjacent WC pairings formed by x and y. The thin dotted curves running parallel to the thick line illustrate the standard deviation of the dots about their mean (see also

We calculate the sensitivities of all possible combinations for each of the three possible options of B' (referring either to the Aa-, Ag- or Ac-group) using the background-corrected intensities relatively to the mean log-intensity of the probes with two mismatches (#mm = 2) with at least one WC pairing in-between, Y_{Ab}(xBB'y) = log(I_{Ab,#mm = 2,|δ| = 1}(xBB'y))−<log(I)>_{#mm = 2,|δ|>1} (see also part a of

The average values of the obtained sensitivities of the tandem mismatches are positive (see the horizontal dashed lines in part a and b of

The 16^{2} possible quadruplet combinations were reduced to 2×16 values for each of the three possible pairings of B' by calculating the average either over the edging WC pairings xy or over the mismatches BB', <Y_{Ab}(xBB'y)>_{xy} and <Y_{Ab}(xBB'y)>_{BB'}, respectively. We consider all 16 combinations of xy and BB' in xBB'y because both members of each couple are not equivalent (B'∈Aa, Ag, Ac and B∈Aa). The obtained values thus characterize the effect of the edging base couples xy (part a of _{Aa}(xB)+Y_{Aa}(Bx)>_{B'y} and ½<Y_{Ab}(B'y)+Y_{Ab}(yB')>_{xB} , respectively (see part c and d of

The couples of edging bases x and y cause considerable smaller variability of the probe sensitivities than the couples of adjacent mismatches (compare part a and b of

Part a of

Part b of

Alternatively, we decomposed the quadruplets with the central tandem mismatch into two consecutive NN-terms as described above (

Triples with flanking mismatches of the type w(xBy)m (B∈At; “w” and “m” denote a WC- and a mismatched pairing, respectively, i.e. w∈At and m∈Aa,Ag,Ac) were selected according to the scheme shown in ^{c}B^{c}y^{c})m (the superscript “c” denotes the WC-complement). These, in total four options (for example (CGT)m, m(TGC), (GCA)m, m(ACG)) are averaged to provide the mean effect of the flanking mismatch adjacent to ‘y’ and ‘y^{c}’ on the selected triple.

The respective probes with flanking triples are selected according to ^{c}B^{c}y^{c})m and that at the upper axis m(yBx)w/m(y^{c}B^{c}x^{c})y. The thick line refers to the total mean over all three mismatch groups m∈Aa,Ag,Ac. The excess values are consistently positive and negative for adjacent y = A,T and y = C,G, respectively.

In analogy with the NN free energy contributions in models describing the stability of DNA/DNA-oligonucleotide duplexes in solution (see _{Ab}(xBy), into two nearest neighbor (NN) terms, Y_{Ab}(x_{Ab}(

We first examined the adequacy of the decomposition (Eq. (16)) in terms of the residual contribution_{Ab}^{res}(xBy) = 0. Especially the propensity of selected sequence motifs for intramolecular folding of the probes and/or the targets and also for the formation of special intermolecular complexes are expected to involve longer runs of subsequent nucleotides causing deviations from the additivity assumption (Eq. (16)).

The symbols refer to the mismatched interaction groups. The triples are ranked with decreasing residual contributions of the At-group. The horizontal dashed lines mark the average standard deviation of the data about the abscissa. The two NNN-lists indicate the largest positive (left list) and negative (right list) residual-values of the At-group. Note that triple GGG provides by far the largest (negative) residual contribution (see red circles). Positive contributions are obtained for triples containing the couple ‘GG’ which indicates that the respective NN-terms underestimate their contribution to the triple sensitivities.

However, motifs containing couples of adjacent GG are prone to positive deviations from additivity indicating that the respective GG-term systematically underestimates the contribution of two adjacent guanines to the triple term. On the other hand, runs of three guanines, ‘GGG’, give rise to the strongest negative residual terms of all interaction groups. The triple sensitivities Y_{Ab}(GGG) are negative for all interaction groups (see

The NN-terms are calculated via decomposition of the triple terms using SVD (Eq. (16)) where the base couples are ordered with respect to the centre base B of the triples. The base couples are indicated as abscissa labels x^{r}x^{c}. NN-terms related to ‘GG’-motifs are indicated by red circles. They strongly deviate from the complementary condition.

The 32 NN-couples of At-groups can be further reduced to 16 NN-terms making use of the symmetry-relation Y_{At}(X_{At}(_{At}(XY) = 0.5⋅(Y_{At}(X_{At}(_{At}(CC)−Y_{At}(GG)>0.06 indicates that the complementarity between CC and GG is clearly disrupted. On the other hand, the sensitivity values of the remaining complementary couples (XY/Y^{c}X^{c} = AA/TT, CT/AG, TC/GA, AC/GT and CA/TG; see full and open symbols) are relatively close each to another (mean difference |Y(XY)−Y(Y^{c}X^{c})|≈0.01) which justifies utilization of the complementarity condition to a good approximation. The linear regression coefficient slightly improves (R = 0.92) after averaging over the complementary couples. Hence, except GG-motifs, the interactions of canonical WC pairings estimated from the probe intensities of SNP GeneChip microarrays in acceptable agreement correlate on a relative scale with free energies in solution.

The figure shows the sensitivity NN-terms of the At- (part a) and Aa- (part b) groups obtained in this study (Eq. (16)) with NN-stacking free energy terms for DNA/DNA-duplexes in solution taken from ref. ^{r}x^{c} are shown by different triangles. Only selected NN-motifs are assigned. The apparent mean stabilities of the mismatched pairings rank differently for chip (see vertical bar) and solution (horizontal bar) data.

Part b of _{Aa}(x_{Aa}(^{r}x^{c}) (solid symbols). As for the At-group, the double-guanine terms strongly deviate from the regression line and were excluded from the linear fit (R = 0.65). Additional exclusion of double-thymines further increases the regression coefficient (R = 0.75) which indicates satisfactory correlation between solution free energy data and most of the NN-sensitivities. A recent study also reports clear correlation between solution and array estimates of hybridization free energies using a specially designed Agilent microarray containing sets of PM and MM probes with #mm = 1 and 2 mismatches upon duplexing

Note that the mean stability of self-complementary mismatches rank according to CC<TT≈AA<GG in solution but according to Cc<Gg≈Aa<Tt on the chip (see

In this study we analyzed the probe intensities taken from a 100k GeneChip SNP array in terms of selected sequence motifs forming well defined WC- and mismatched base pairing in the probe/target duplexes. The particular probe design of these GeneChip SNP arrays enables one to disentangle different sources of intensity modulations such as the number of mismatches per duplex, the particular matched or mismatched base pairings, their nearest and next-nearest neighbors, their position along the probe sequence and the relative position of a second mismatch. As the elementary sequence motif we chose triples of subsequent nucleotides centered about the middle base of the probe and/or about the SNP base and calculate log-averages of the intensities over thousands of probes with identical motifs to average out the effect of the remaining sequence. These averages are measures of the stability of the base pairings formed by the selected triple in the probe sequence with the corresponding base triple in the target sequence. The former triple is defined by the probe sequence whereas the target triple can be deduced from the genotype and the hybridization mode. We analyzed the log-averaged intensities, their difference to selected reference values, the so-called sensitivity, and their variability in subsets of triple-motifs. In addition to triple motifs, we also consider special motifs such as flanking mismatches adjacent to the triples and tandem mismatches which were analyzed in terms of quadruplets including the edging WC pairings.

The first question of our analyses addresses the impact of different interaction motifs on the observed probe intensities. It turns out that

a) the number of mismatches per probe/target-duplexes exerts the largest effect which modulates the intensity. One mismatch is associated with the logarithmic intensity change of −δlogI = 0.5–0.6 which is equivalent with the decrease of the intensity by a reduction factor of about F = 0.3−0.25 per mismatch.

b) the effect of mismatches is strongly modulated by the adjacent WC pairings which give rise to a mean logarithmic increment of ∼δlogI = ±0.1, or equivalently, with an average modulation factor of 0.8<F<1.25 (see

c) duplexes with tandem mismatches are more stable than double mismatches which are separated by at least one WC pairing (δlogI≈+0.1 and F≈1.25).

d) flanking mismatches adjacent to the considered triples only weakly modulate their intensities (|δlogI|<0.025; 0.95<F<1.05).

e) the mean variability due to sequence effects in triples of WC pairings is markedly smaller than the effect in triples with a central mismatch (δlogI = ±0.05; 0.9<F<1.1; compare with b).

f) runs of three guanines in the probe sequence forming nominally WC pairings represent a special motif which decreases the intensity to an exceptionally strong extent (δlogI = −0.2−−0.35; F = 0.6−0.45). Also mismatched duplexes with runs of guanines possess relative small intensity values which are virtually incompatible with expected interaction symmetries in DNA/DNA-duplexes.

g) the positional dependence of triple-averaged intensities along the probe sequence is relatively weak (see

h) especially small (e.g., for probes with two mismatches, #mm = 2) and large intensity values are prone to background and saturation effects, respectively (see

Our analyses also address the question whether the number of considered sequence motifs can be reduced by utilizing symmetry relations and/or by decomposing the triple averages into nearest neighbor terms in analogy with interaction models for oligonucleotide duplexes in solution. It turned out that

i) triples of WC pairings (At-group) can be reasonably well decomposed into NN-terms which also meet the complementary condition to a good approximation and correlate well (R = 0.85) with the independent NN-free energy terms derived from duplex-data in solution

j) also the triples with a central mismatch (Aa-, Ag- and Ac-group) to a good approximation decompose into NN-terms except special motifs containing at least doublets of guanines. The mismatch motifs partly obey the symmetry relations, however, with larger residual variability compared with WC pairings. Comparison with NN-terms of solution free energies

k) tandem mismatches can be decomposed into two NN-terms referring to a combination of mismatched and WC pairings. These values well correlate (R = 0.59) with the NN-terms obtained from the triple data suggesting to use a unified set of NN-terms (see j). For tandem mismatches one has however to consider their systematically larger stability compared with duplexes containing two mismatches which are separated by at least one WC pairing.

In the following subsections we discuss the physical origin of selected effects more in detail and derive rules for appropriate correction of parasitic intensity errors to obtain unbiased genotyping estimates.

The intensity of microarray probes is directly related to the effective association constant for duplexing, ∼K_{duplex} after correction for parasitic effects (or their neglect, if justified) such as the optical background, non-specific hybridization and saturation (see Eq. (2)). The effective association constant is a function of different reaction constants characterizing relevant molecular processes such as the bimolecular stacking of unfolded probes and targets (P•T, P•P, T•T), and their unimolecular folding propensities (P-fold, T-fold) _{surface}<1 is a factor taking into account surface effects, such as electrostatic and entropic repulsions which effectively reduce target concentrations near the array surface. According to Eq. (18), the effective constant of duplex formation is reduced by the factor F_{array}<1 compared with the stacking interaction constant K^{P•T}. Folding and/or self-dimerization of probe and/or target become relevant at

Stacking interactions are mainly governed by the pairings formed between the nucleotides in the target and probe and their nearest-neighbors along the sequence. The decomposition of the corrected intensity into different interaction modes associated with single target-types enables assignment of the probe sequence to canonical and mismatched base pairings with the target. We analyzed triple motifs which represent a reasonable choice to study stacking interactions on an elementary level. Note that also the reduction factor F_{array} depends on the probe and target sequences, however in a more subtle fashion because, for example, folding reactions comprise longer sequence motifs.

The duplex-association constants can be multiplicatively decomposed into a triple-related factor which modulates the total (average) contributions

Comparison with Eq. (8) and considering the direct relation between the corrected intensity and K_{duplex} provides the relation between the analyzed observables and the binding constants,^{P•T}∼−logK^{P•T}, which applies also to the triple terms, i.e., ΔΔG^{P•T}(xBy) = ΔG^{P•T}(xBy)−<ΔG^{P•T}>∼−logk^{P•T}. With this definition and Eq. (20) one finds

The sensitivity and free energy change into opposite directions, i.e. larger stability of interactions is associated with larger Y but smaller (more negative) ΔG. After decomposition into NN-terms we found acceptable correlation between the estimates from chip data and solution data taken from the literature for most of the motifs (see

The proportionality constant in Eq. (21) is estimated by the slope of the regression lines in ^{−1} roughly one order of magnitude smaller than the proportionality constant predicted by the thermal energy ∼1/(RT⋅ln10)≈0.7 (T≈40°C). We previously argued that non-linear (in logarithmic scale, as, e.g., predicted by Eq. (18)) and sequence dependent contributions to log(f_{array}(xBy)) can cause proportionality constants less than unity

The stabilities of most of the mismatched pairings (Eq. (12)) rank in similar order as the results of previous chip and solution studies (Eq. (13)). ^{r} and B'•b'^{r} with b^{r} = B' and b'^{r} = B, respectively). Our figure was designed similar to

Relative stabilities of the 10 possible contexts of complementary triples containing the 16 possible central base pairings (mismatches or Watson-Crick base pairs, see legend in the figure). The sensitivities of the pairs of complementary triples xBy/y^{c}B^{r}x^{c} (B^{r} = B') are averaged using the triple data shown in

The stability of mismatched pairings is governed by the propensity of the paired nucleotides to form hydrogen bonds (e.g., two bonds (T, A) versus three bonds (G, C) in canonical WC pairings), by steric factors such as the size of the aromatic moiety (one ring of the pyrimidines (C,T) versus two rings of the purines (G, A)) as well as stacking effects associated with nearest neighbors.

Stable mismatched base pairs such as GT or GA form two H-bonds and only slightly disrupt the structure of the oligonucleotide-DNA duplex. In particular, the former purine/pyrimidine mismatch GT is usually slightly more stable than the latter purine/purine mismatch GA because a two-ringed guanine better fits with a single-ringed thymine than with a double ringed adenine

The second self complementary single ringed TT mismatch with low stacking propensity is, in contrast to CC, however stabilized by two H-bonds. The two purine/purine self complementary mismatches GG and AA have a relatively high stacking potential and form either two (GG) or only one (AA) H-bond. One expects therefore the stability-series AA≈TT<GG which is confirmed in solution experiments

Consideration of the neighboring bases shows that the apparent low stability of Gg-mismatches is accompanied with triple G-motifs in the probe sequence. These runs of guanines are associated with low intensities in triples with both, central WC- (At-group) and mismatched (Aa-, Ag- and Ac-group) pairings. The stability of central Gg-pairings in the context of adjacent ‘non-G’-bases, on the other hand, roughly agrees with the predictions from solution data (see

Our analyses reveal the following effects of triple-G on the observed probe intensities:

The GGG-effect is non-complementary, i.e. the complementary triples (e.g. CCC for perfect matches) don't show exceptionally small intensities as probes with GGG do.

Exceptional small intensities are also observed for triple-G with central mismatches independent of the nominal pairing of the central base (see the arrows in

The effect is non-additive, i.e. the intensity drop due to GGG is inconsistent with the decomposition into GG-contributions in the context of all triple-motifs.

The effect depends on the sequence position being typically smaller near the ends of the probe sequence (see

For probes with one mismatched pairing one observes, in contrast to (iv), that terminal GGG at the solution end of the probes gain intensity, i.e. the sign of the effect reverses compared with the remaining sequence positions.

The intensity drop due to one triple-G corresponds roughly to 50% of the intensity loss due to one mismatched pairing (see

The observations (i) and (ii) strongly indicate that the triple-G effect is not associated with the nominal base pairings deduced from the binding mode because otherwise one expects equal intensity changes for complementary sequence motifs. Observation (iii) indicates that the effect exceeds the range of stacking interactions with the nearest neighbors. Observation (vi) shows that the magnitude of the effect is relatively large compared with the variability due to other base-specific effects but smaller than the variability due to single mismatches.

To get further insight into the properties of poly-G motifs we calculated the mean sensitivity for runs of identical bases of length one to five, e.g. G, GG,…,GGGGG averaged over all sequence positions of homozygous-present PM-probes (PM-G•G, see

The sensitivity values are averaged over all sequence positions of homo-motifs of length 1 to 5 of homozygous present probes (PM-G'•G, see also

Previous studies also reported abnormal intensity responses of probes containing multiple guanines in a row (called G-runs or G-stacks) compared with other probes in different chip assays including Affymetrix expression and SNP arrays

The structural rationale behind the poly-G effect has been concordantly assigned to the propensity of poly-G motifs to arrange into stacks of stable molecular bundles of guanine tetrads. These structures potentially affect the efficiency of oligonucleotide synthesis and/or the hybridization of the probes to their target sequences accounting for the abnormal performance of G-runs on the array

As mentioned above, there are two dimensions which potentially affect the performance of probes containing poly-G motifs: firstly, their ability to be correctly synthesized on an array, and secondly the ability of correctly synthesized probes to bind its target.

Let us discuss the first option. The GeneChip arrays are fabricated by ^{Y(GGG)}≈0.4–0.5 for GGG motifs (with Y(GGG) = −0.2…−0.3; see

Also the second option of modified target binding to correctly synthesized probes provides a tentative explanation of the GGG-effect

Note that both discussed potential interpretations of the GGG-effect give rise to a common cause of the observed small intensity values, namely the reduced number of available binding sites for target binding either via truncation or via complexation of part of the probe oligomers. Both interpretations are compatible with our observations (i) and (ii) because the reduced amount of full-length probes and also probe-probe complexes are independent of the respective complementary target sequence upon allele-specific hybridization and independent of the respective mismatched target motif upon cross-allelic hybridization. Also the onset of the increased sensitivity increment per additional guanine for triple-G motifs shown in

Tethering of the involved oligonucleotides to the surface and zippering effects towards both ends of the probes are expected to modify their propensity for G-tetrad formation in a positional dependent fashion in analogy with the positional dependence of base pairings in probe/target dimers _{array}<1 (Eq. (18)). The GGG-profile of homozygous-absent probes (PM-G'•G, see part d of

The suggested mechanisms explain the decreased intensity of probes containing runs of consecutive guanines. The effect (v) however seems puzzling because terminal poly-G's increase the intensity of the respective probes, instead. On expression arrays one even observes much stronger intensity gains for poly-G containing probes

In summary, our data support the hypothesis that runs of consecutive guanines facilitate the formation of stable G-quadruplexes between neighboring probes which in final consequence reduce the number of probe oligomers available for target binding via two alternative mechanisms, firstly, the reduced synthesis yield of full length probes and/or, secondly, the formation of complexes of neighboring full-length probes. Both hypotheses are compatible with the observed intensity drop of probes containing runs of guanines on SNP arrays.

GGG-runs are relatively common on SNP arrays: About 11% of all probes on the studied 100k GeneChip SNP arrays contain at minimum one triple GGG motif and nearly 30% of the allele-sets contain at minimum one of these probes. We conclude that the discussed effect cannot be neglected in appropriate correction methods.

The SNP-specific sequence bias transforms into systematic errors of the genotyping characteristics derived from the signals of single probes. Note that the sequence-context of a partial SNP and consequently also the respective bias is essentially very similar for all probes of a selected probe set addressing the same SNP. As a consequence, the averaging of the probe signals into set-related allele values only weakly reduces the systematic signal error after the summarization step. SNP arrays differ in this respect from expression arrays where the sequences of the set of probes interrogating the expression of the same gene or exon can be chosen independently to a larger degree.

One central task of the preprocessing of signals of SNP probes is consequently their correction for sequence effects and in particular for SNP-specific biases. The detailed presentation and verification of an appropriate algorithm is beyond the scope of the present work and will be given elsewhere. The results of our systematic study however enable to identify relevant sequence motifs which significantly modulate the probe intensities. The intensity contributions of such motifs constitute the building blocks of an appropriate intensity model. In particular our results suggest the following rules for sequence correction of SNP probe intensities:

Sequence effects due to WC pairings between probe and target are well approximated using nearest-neighbor (NN) motifs in analogy with accepted NN-free energy models for oligonucleotide-duplexing in solution

The anisotropy of probe/target interactions due to the fixation of the probes at the chip surface and end-opening (zippering effects)

The modulation of probe intensities by mismatched pairings can be considered using triple-motifs which consist of the central mismatch and the two adjacent WC pairings.

Nominal base pairings according to (i) and (iii) can be deduced from the hybridization mode of the respective probes which, in turn, provides selection criteria of the probes for parameter estimation. The mean intensity penalty owing to one and two mismatches can be estimated from the respective class of probes.

Runs of triple guanines (GGG) represent a special motif which markedly modulates the intensities of the respective probes. The underlying effect does not originate from probe/target (pairwise) interactions but obviously results from the formation of collective complexes presumably of four neighboring probes. Therefore it affects essentially all probes with triple G-motifs independently of the hybridization mode.

Also tandem mismatches represent a special motif of MM-probes with a modified intensity penalty compared with other MM-probes possessing two mismatches with at least one WC pairing in-between. This sequence effect can be taken into account in a first order approximation by decomposing the quadruple formed by the tandem mismatch and the two adjacent WC pairings into two NN-terms referring to a WC- and a mismatched pairing each, or more roughly, by explicitly considering the two adjacent WC pairings.

The shift of mismatch motifs by a few sequence positions about the middle base of the probe and the effect of flanking mismatches adjacent to triples with a central mismatch can be neglected to a good approximation.

Background intensity contributions (optical background and “chemical” background due to non-specific hybridization) should be considered especially for probes forming at least one mismatched pairing.

Established preprocessing algorithms for GeneChip SNP arrays explicitly consider the mean intensity penalty per mismatch

Our present analysis has focused on sequence effects. Note for sake of completeness that an elaborated correction algorithm should also consider additional sources of intensity variation not taken into account here, such as the fragment length and the GC-content of the targets

Single mismatched pairings formed in cross-allelic probe target duplexes and runs of poly G-motifs in the probe sequence are, with the exception of the number of mismatches per duplex, the main sources of signal variability on SNP arrays. These effects must be considered in appropriate calibration methods of the probe intensities to improve the accuracy of genotyping and copy number estimates. The poly-G effect seems to be related to the crowded arrangement of probes on high density oligonucleotide arrays which facilitates the formation of G-quadruplexes between neighboring probes and this way reduces the amount of free probes available for target binding either via incomplete synthesis of full length oligomers and/or via complexation of full length probes. The probe/target interactions on the chip can be decomposed into nearest neighbor contributions which in most cases well correlate with the respective free energy terms describing DNA/DNA-interactions in solution. The effect of mismatches is about twice as large as that of canonical pairings for unknown reasons. Triple-averaging represents a model-free approach to estimate the mean intensity contributions of different sequence motifs which can be applied in improved calibration algorithms to correct signal values for sequence effects.

Hybridization modes and base pairings for probe selection. The supporting text provides an overview about the hybridization modes, probe attributes and interaction groups;about base pairings in probe/target duplexes at the middle and SNP position of the probe sequences; and how probes are selected for triple-averaging (including the ‘hook’ criteria and background correction).

(0.44 MB PDF)