> *,) :bjbjWW.@55
)))))====Y=Y#ul"n"n"n"n"n"n"$')r")"))#zzz))l"zl"zz:!,,"Pb&X
"X")#0Y#""
9*X"9*,"9*),",z""zY#9*
+: SUPPORTING INFORMATION
Methods
Eq. 2 in the main manuscript requires a value of the probability that an MTase has a certain property under the condition that it has a particular substrate type. We used two approaches to calculate this value, depending on whether the property in question was a continuous or a categorical variable. To calculate the value, we took the number of known MTases with a certain substrate and certain property and the number of all known MTases with the selected substrate type. Then (i) we assumed that the training set was a part of a bigger population and used hypergeometric probability distribution to estimate the probability for the whole population, (ii) additionally, for continuous properties we smoothed the probability function to avoid rapid changes between intervals.
Two applied approaches:
(i) Since MTases with known substrates are only a representation of the whole population of MTases, we do not use the probabilities P(propertysubstratei) computed for the set of known MTases, but rather estimate the true value for the whole population. Estimating these probabilities is especially crucial for combinations of properties and substrates that do not exist among known MTases, that is for which sample probabilities equal 0. To estimate P(propertysubstratei) needed in Eq. 1 in the main manuscript, we assume that the population size is 1000 and determine n such that it minimizes the absolute value of:
EMBED Microsoft Equation 3.0 , (S1)
where k is the observed number of MTases with a given combination of features and substrate specicity, i is the number of such MTases in the population of 1000 and P(ki) is the probability of observing k such MTases while sampling a population of 1000 containing i of them, computed using hypergeometric probability distribution.
(ii) In the present case, the only continuous properties considered were pI and time of expression onset. We observed that for both of these properties and for each substrate specicity, their domain can be divided into several intervals, within each their probability distribution functions have similar values. Therefore, we decided to model these probability distributions as a step function, smoothed to avoid rapid changes of probability between intervals. After some experimentation, we decided to use a highorder even exponential as a smoothing function, specically EMBED Microsoft Equation 3.0 QUOTE , where f(x) is a linear function of x. Thus, the whole probability distribution for continuous variables would be a linear combination of several exponential terms, each corresponding to a chosen interval, with weights derived from average probabilities within a given interval and with function f determined by the beginning and ends of a given interval. Examples of specic functions used are given in gures (see Fig. S3 and S4).
Threshold optimization
We optimized thresholds for each model by likelihood maximization using the Powell method ADDIN EN.CITE <EndNote><Cite><Author>Powell</Author><Year>1964</Year><RecNum>42424217Powell, M. J. D.An efficient method for finding the minimum of a function of several variables without calculating derivativesComputer JournalComputer Journal15516271964[1]. The starting points for optimization were quantiles of all MTases values chosen as suitable for each property variant, e.g. pI, divided into two intervals, had the median as the starting point.
Tested sets of properties
We tested all combinations of the properties described in Table S4 except combinations including:
any two of the pI, pI min and pI max,
both Localization and any of the Nucleus, Nucleolus, Mitochondrion or Other localization properties,
both Fold and any from the binary properties Rossmanlike, SET, SPOUT, Other fold,
both Expression cluster and any from Ox, R/B, R/C or No cluster binary properties.
Versions of the same property (pI or time) divided into intervals dierently were also not tested within the same model.
Akaike Information Criterion (AIC)
Akaike Information Criterion ADDIN EN.CITE <EndNote><Cite><Author>Akaike</Author><Year>1974</Year><RecNum>18</RecNum><record><recnumber>18</recnumber><foreignkeys>1817Hirotugu AkaikeA new look at the statistical model identification. System identification and timeseries analysis.IEEE Trans. Automatic ControlIEEE Trans. Automatic Control716723AC191974[7] was designed to help achieve a balance between model accuracy and complexity. It is given by the equation:
EMBED Microsoft Equation 3.0 , (ES2)
where k is the number of parameters in the model and L is the maximized value of the likelihood function for the estimated model. As seen, the penalty factor based on the number of parameters is added to avoid the e

}
\^`bdf$&(*fhjlnpнȨЏtjhL(OJQJU^Jjh^Uh^UEHUjf
V
h^UUVhL(OJQJ^JmHsHhL(mHsHjh^Uh^UEHUjV
h^UUVjhL(UhL(hL(H*OJQJ^JhL(6OJQJ^JhL(OJQJ^JhL(5OJQJ^J)A
Oi$
dha$gd^U
$dha$gd^U$
Vdh^`Va$gd/(s$
dha$gd^U$
Vdh^`Va$gd^U$dha$gd^U$dha$gd/(s
prNh!% h U$V$Y$Z$$$$$$$$$$&%'%&.¬¬~UjLh^Uh^UEHUjV
h^UUVhL(OJQJ^JhL(OJQJ^JaJ+jhL(OJQJU^JmHnHsHtH"hL(OJQJ^JmHnHsHtHjhL(UhL(5OJQJ^JhL(hL(6OJQJ^JhL(OJQJ^J0Vj $$00f24\5678:$0dh^`0a$gd^U$dha$gd/(s$dha$gd^U$dha$gd^U$
dha$gd^U'$
&F
Xdh^`Xa$gd^Uect of a more complex model always better tting the data. The number of parameters of a model was calculated by summing the number of parameters from each property (Table S4) plus adding 2 to account for prior probabilities (2 independent parameters are needed to describe a priori probabilities of protein, RNA and other substrate specicity).
SUPPORTING INFORMATION REFERENCES
ADDIN EN.REFLIST 1. Powell MJD (1964) An efficient method for finding the minimum of a function of several variables without calculating derivatives. Computer Journal 7: 155162.
2. Wlodarski T, Kutner J, Towpik J, Knizewski L, Rychlewski L, et al. (2011) Comprehensive structural and substrate specificity classification of the Saccharomyces cerevisiae methyltransferome. PLoS One 6: e23168.
3. Tu BP, Kudlicki A, Rowicka M, McKnight SL (2005) Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science 310: 11521158.
4. Rowicka M, Kudlicki A, Tu BP, Otwinowski Z (2007) Highresolution timing of cell cycleregulated gene expression. Proc Natl Acad Sci U S A 104: 1689216897.
5. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 2529.
6. Lehninger A, Nelson DL, Cox MM (2000) Principles of Biochemistry. New York: Worth Publisher.
7. Akaike H (1974) A new look at the statistical model identification. System identification and timeseries analysis. IEEE Trans Automatic Control AC19: 716723.
..&0800000 1"19:::ŷ"hL(OJQJ^JmHnHsHtHhL(^JmHnHsHtHjhL(UhL(hL(5OJQJ^Jh^UhL(6OJQJ]^JhL(OJQJ^JhL(OJQJ^J
:::
$dha$gd^U$0dh^`0a$gd^U,1h/ =!"#$%Dd
Hh
s*A??3"`?24)$vX}e2>D`!)$vX}e2>@
@xSkQۘM6) 9IPezrA)a!B&)$vJbM!~@"lPp@*~ ؙB_ Jt=s7QrٙsΜ;cP[qeɲz;fu+~ҐSb~kqk`m=A{1k{KHZ_iwø9^'1Oz=oQI7g#8FhUΙ>v>3SQ#rŐT[\I}<~缑ՊG"݉gw0v%
HvqNDl:aRuڳ7,yYtYDd
Th
s*A??3"`?2]Yuy`!q]Yu @ pXJ?xQK@}wzdED8tZ: UvpqpiBN(LJ}}"`xi۲+ evj^MzoUAP?Suǝ0N
E//ucTKwqpxx'3A=y,%b4\LE۶[`S0N~[eMjM~[>`Ўh!/JW4>
MyQ(<*):9o${2HHM@/qm0r/\pf
"#$%&'(+./021345689:;<=>?@ABCDEFGHIJKLRoot Entry Fb&
Data
!WordDocument.@ObjectPoolb&b&_1453526017Fb&b&Ole
CompObjfObjInfo !"#$%&'(*
FMicrosoft Equation 3.0DS EquationEquation.39q}
P(ki)i=0n"1
"
"P(ki)i=n+11000
"Equation Native _1453526374FPb&Pb&Ole
CompObj
f
FMicrosoft Equation 3.0DS EquationEquation.39q}<#
e"f(x)44
FMicrosoft Equation 3.0DS EquationEquation.39qObjInfo
Equation Native X_1453526248 FPb&Pb&Ole
CompObjfObjInfoEquation Native [1Table7M*}?d
AIC=2k"2ln(L)Oh+'0
0<H
T`hpxTeresa SzczepinskaNormalTeresa Szczepinska2SummaryInformation(DocumentSummaryInformation8CompObj){Microsoft Office Word@F#@@L_&@L_&/՜.+,D՜.+,0hp
UTMB2Tytułp0tEN.InstantFormat
EN.Layout
EN.Librariesx010{}Liberation Serif121072d My EndNote LibrarySavedSaved.enl
F)Dokument programu Microsoft Word 972003
MSWordDocWord.Document.89q^,666666666vvvvvvvvv66666686666666666666666666666666666666666666666666666666hH6666666666666666666666666666666666666666666666666666666666666666662 0@P`p2( 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p8XV~_HmH nH sH tH v`vNormalny
dd*$1$5B* CJOJPJQJ^J_H9aJmH nHph
sH tHJA`JDomy[lna czcionka akapituTi@T
0Standardowy :V44
la,k ,
0 Bez listy6o6 WW8Num1z0OJQJ^J6o6 WW8Num1z2OJQJ^J6o6 WW8Num1z3OJQJ^J6o!6 WW8Num3z0OJQJ^J6o16 WW8Num3z1OJQJ^J6oA6 WW8Num3z2OJQJ^J6oQ6 WW8Num5z0OJQJ^J6oa6 WW8Num5z1OJQJ^J6oq6 WW8Num5z2OJQJ^JD/DDefault Paragraph Font8U`8
HiperBcze>*B*phhohBalloon Text CharB*CJOJPJQJ^J _H9aJnHph
tHBoBPlaceholder Text B*phBoBComment ReferenceCJaJhohComment Text CharB*CJOJPJQJ^J _H9aJnHph
tHtotComment Subject Char35B*CJOJPJQJ\^J _H9aJnHph
tHNONHeading
x$OJ
QJ
CJPJ^JaJDB@DTekst podstawowy
x*/@*Lista!^JH"@"HLegenda
"xx$CJ6^JaJ].O2.Index#$^JVOBVPreformatted Text$OJQJCJPJ
^JaJ<R<Table Contents%$NObNBalloon Text&dOJQJCJ^J aJXOrXList Paragraph'^]`m$^J aJFOFComment Text(dCJ^J aJ@O@Comment Subject)5\L@L
+/(s0Tekst dymka*dCJOJQJ^J aJf/f*/(s0Tekst dymka ZnakB* CJOJPJQJ^J _H9aJnHph
tHPK![Content_Types].xmlN0EHJ@%ǎǢș$زULTB l,3;rØJB+$G]7O٭VFMG.H"
rxx5aKFXKS1,GTNz
b1UB8;^e9nNӨZ=dwqŷ7͠iލ.éIB'º;z~ݡOCQJ˦96M .*!ޞovngUVo UYC<7^S*C0ױDzRA#8ꍉU47U6KPtu̹'Yv@~ !E1
۰Nh4Of**byp/36Ĵ^NhwQV5nhϿ~Tڲ7ac$bL.Xܢ5w97[Cֳ6O(c5c4h܇EMnUD!~`ϘJ0A?My71e1ۭ.Ѣ4RUn{uSLɟ)nL \BuB7Ji\Cbw@m/k_֜aΒjFHP؏T,مdf,Yd̬#r@PU(T7$ow<缂Frz:Y'[`߇@ST滪7_ǬFQ
Zyٿ
ڎ`rPh1,QWIDANT~AoE)C3`]dm2iVֵ褽Vl
@.:::b "$::#QQ:Q8@0(
B
S ?hmO]BCKQ?@W]~lnrx;DHNRX\eis $+(,1267:V_
dj
.
^`OJQJ^JP^`P@@^@`0^`0``^``^`^`^``^``00^0`WW8Num3L(^U/(sh
@
==@,@4@ D@.UnknownG* Times New Roman5Symbol3.* ArialA$BCambria MathI
xP!Liberation Serif?&R! $
DejaVu Sans?= * Courier New;Wingdings5.*aTahoma5MangalG&
xP!Liberation SansWWenQuanYi Zen Hei SharpI5 &(DejaVu Sans MonogWenQuanYi Micro HeiTimes New R"AhFR"'FR"'/2/2!0KX /(s!xxTeresa SzczepinskaTeresa Szczepinska