The authors have declared that no competing interests exist.
Conceived and designed the experiments: GS CDG AC DMC JPTH. Performed the experiments: GS CDG AC DMC JPTH. Analyzed the data: CDG AC. Contributed reagents/materials/analysis tools: GS CDG AC. Wrote the paper: GS CDG AC DMC JPTH.
Systematic reviews that collate data about the relative effects of multiple interventions via network metaanalysis are highly informative for decisionmaking purposes. A network metaanalysis provides two types of findings for a specific outcome: the relative treatment effect for all pairwise comparisons, and a ranking of the treatments. It is important to consider the confidence with which these two types of results can enable clinicians, policy makers and patients to make informed decisions. We propose an approach to determining confidence in the output of a network metaanalysis. Our proposed approach is based on methodology developed by the Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group for pairwise metaanalyses. The suggested framework for evaluating a network metaanalysis acknowledges (i) the key role of indirect comparisons (ii) the contributions of each piece of direct evidence to the network metaanalysis estimates of effect size; (iii) the importance of the transitivity assumption to the validity of network metaanalysis; and (iv) the possibility of disagreement between direct evidence and indirect evidence. We apply our proposed strategy to a systematic review comparing topical antibiotics without steroids for chronically discharging ears with underlying eardrum perforations. The proposed framework can be used to determine confidence in the results from a network metaanalysis. Judgements about evidence from a network metaanalysis can be different from those made about evidence from pairwise metaanalyses.
A network metaanalysis produces inferences regarding the relative effectiveness or safety of multiple treatments
The GRADE approach leads to judgements about the confidence with which an estimate of treatment effect for a particular outcome can be believed, using four levels: high, moderate, low and very low. When the evidence arises from randomized trials – as is usually the case in network metaanalysis – the body of evidence is initially assigned to a high quality rating. Then five components are considered: study limitations, inconsistency, indirectness, imprecision and publication bias. For each component, the quality of the evidence can be maintained or downgraded by up to two levels, subject to a maximum downgrade by three levels (to very low quality) across the five components.
Some authors have applied
In our proposal we draw a key distinction between two types of findings from network metaanalysis for a specific outcome: a) effect sizes for pairwise comparisons of treatments (such as odds ratios), and b) a ranking of the treatments. The pairwise effect sizes are estimated using all relevant evidence in the network of treatment comparisons, and may be reinterpreted to aid decisionmaking, for example using ‘assumed’ and ‘estimated’ risks of an event as in Summary of Findings tables
To illustrate the ideas presented in this paper, we use an example network of topical antibiotics without steroids for chronically discharging ears with underlying eardrum perforations
Edges are weighted according to the inverse of the variance of the direct summary ln(OR) (presented along the edges) and nodes are weighted according to the number of studies.
Comparison  No. studies  Direct evidenceOR (95% CI)  Variance of ln(OR)  I^{2} (pvalue)  τ^{2} 
AB: Quinolone antibiotic vs no treatment  2  0.09 (0.01, 0.51)  0.83  69% (0.07)  1.22 
AD: Antiseptic vs no treatment  1  1.42 (0.65, 3.09)  0.16  NE  NE 
BC: Nonquinolone antibiotic vs quinolone antibiotic  7  1.46 (0.80, 2.67)  0.10  48% (0.07)  0.31 
BD: Antiseptic vs quinolone antibiotic  5  3.47 (1.71, 7.07)  0.13  66% (0.02)  0.39 
CD: Antiseptic vs nonquinolone antibiotic  4  1.69 (0.59, 4.83)  0.28  67% (0.03)  0.75 
The data underlying a network metaanalysis, as in
A distinction between the two types of output (pairwise comparisons and overall ranking) is important when assessing our confidence in the evidence that they convey. Ranking measures involve inferences about the network of evidence as a whole, whereas pairwise effect sizes are derived from complex weighted averages of particular sources of direct and indirect evidence, with direct evidence usually contributing more weight. Consider a simple triangular network with high quality evidence for AB but low quality evidence for BC and AC. We might be able to award high confidence to the effect size AB, but only low confidence to the overall treatment ranking.
The aim of this paper is we make suggestions about how to evaluate these two types of output from a network metaanalysis. We consider each component of GRADE separately (study limitations, inconsistency, indirectness, imprecision and publication bias). Then we summarize across all five components to obtain a
Judgements about our confidence in an estimated effect size can be made through consideration of the quality of all pieces of evidence that contribute to it. For instance, confidence in the mixed evidence for AB in
Consider a simple triangular network ABC. We first summarize all AB, AC and BC studies separately to obtain the effect estimates from the direct evidence alone using standard metaanalysis. We denote these as
Consider our example in
Rows correspond to network metaanalysis ORs (separated for mixed and indirect evidence) and columns correspond to direct metaanalysis ORs. The contribution of each direct comparison to the total network evidence that provides the ranking of the treatments is presented separately (row named Entire network). The sizes of the boxes are proportional to the percentage contribution of each direct estimate to the network metaanalysis estimates (rows 1–6) and to the entire network (row 7). The last row shows the number of included direct comparisons. The names of the treatments are given in
Generally, the largest contribution to each network estimate is provided by the respective direct evidence, but when direct evidence is missing or is imprecise more information is obtained indirectly. These contributions may be interpreted as weights and should be taken into account when evaluating the quality of network evidence for each pairwise comparison.
We can estimate the importance of information from each direct estimate to the entire network as well as to each pairwise comparison. Using the methodology outlined in
To determine our confidence in each estimate of effect size from a network metaanalysis, we follow the standard GRADE approach but make some modifications to reflect specific issues in network metaanalysis. These include (i) the key role of indirect comparisons (suggesting a reconsideration of the ‘indirectness’ component of GRADE); (ii) the contributions to each piece of direct evidence to the network metaanalysis estimates of effect size; (iii) the importance of the transitivity assumption to validity of network metaanalysis; and (iv) the possibility of inconsistency between direct evidence and indirect evidence.
In the GRADE approach, randomized trials are evaluated according to generation of the allocation sequence, concealment of the allocation sequence, blinding, incomplete accounting of participants and outcome events, selective outcome reporting bias and other limitations
Evaluate each piece of direct evidence in the network and classify it as low, moderate or high risk of bias according to the usual GRADE guidelines.
For each pairwise network estimate, consider the contribution of all direct estimates feeding into it. For a formal statistical approach to this, we recommend using the contributions matrix.
Illustrate the risk of bias assessments according to the contributions of each source of direct evidence to each network metaanalysis effect estimate, for example using a bar chart. We conventionally use green, yellow and red to represent low, moderate and high risk of bias.
For each pairwise comparison, integrate the risk of bias judgements and the respective contributions into a single judgement about study limitations and consider whether to downgrade the quality of the evidence. This can be done informally by interpreting the illustration in step (c). Alternatively, a highly quantitative approach would be to assign numerical scores to each risk of bias judgement (e.g. 0 for low, −1 for moderate and −2 for high risk of bias), and take a weighted average of these using the contribution of each direct estimate to the network estimates from the contributions matrix.
In our example we have decided that the direct evidence for AD, AB and BC comparisons have moderate risk of bias, direct evidence for the CD comparison has high risk of bias, and only evidence for the BD has low risk of bias.
Calculations are based on the contributions of direct evidence. The colours represent the risk of bias (green: low, yellow: moderate, red: high). The initial judgements about the risk of bias in the direct estimates are shown on the right side of the figure (there is no direct evidence for AC). The names of the treatments are given in
The standard GRADE guidance for indirectness considers (i) differences between the populations, treatments and outcomes in the studies to hand compared with the populations, treatments and outcomes targeted by the metaanalysis; and (ii) the use of indirect comparisons (5). The first set of issues is just as important when evaluating network metaanalysis estimates as when evaluating pairwise metaanalysis estimates. However, while we recognise the widespread concern about the validity of indirect comparisons, we argue against the idea of downgrading by default due to indirect evidence in the context of a network metaanalysis. Not only are indirect comparisons integral to the methodology of network metaanalysis, but under certain conditions indirect estimates can be just as reliable as direct comparisons (and in some probably uncommon situations even more reliable)
A key point of our proposal for considering indirectness is that we recommend that the populations, treatments and outcomes be examined specifically for differences across different sources of direct evidence. The transitivity assumption underlying indirect comparisons, and hence network metaanalysis, requires that the distribution of effect modifiers is similar for all sources of direct evidence. We therefore propose that face validity of the transitivity assumption be assessed as part of the consideration of indirectness. In essence, the directness of each contributing source of direct evidence must be consistently close to the research question for the network metaanalysis to provide high quality evidence.
We advocate empirical comparison of the distribution of important effect modifiers across the comparisons present in the network. Unfortunately these effect modifiers can be unknown, unobserved or unreported in studies, so that transitivity might be difficult to judge. When similarity of these effect modifiers cannot be assured, researchers should downgrade due to concerns over intransitivity. Clinical understanding of the context, and familiarity with the trials to be synthesized, are necessary to infer about transitivity, particularly in the absence of data on effect modifiers. Other, conceptual approaches (e.g. by using directed acyclic graphs see
Each pairwise metaanalysis in the network can be evaluated following standard GRADE. Subsequently, the contributions of direct evidence to each pairwise network metaanalysis estimate can be taken into account, considering the most influential direct comparisons. To summarize the judgments from direct evidence, steps analogous to (b) to (d) described in
The research question of this systematic review was to evaluate topical antibiotics (excluding steroids) for treating chronically discharging ears with underlying eardrum perforations. The authors of the review did not indicate that the studies were lacking relevance to their research question in terms of populations, treatments and outcomes. We are unable to undertake a detailed examination of the distribution of effect modifiers because most comparisons include few studies. As we lack convincing evidence for the plausibility of the transitivity assumption, we would recommend downgrading each pairwise comparison as well as the ranking of the treatments by one level.
In the usual GRADE approach, inconsistency refers to variability in the magnitude of effects across studies for a specific comparison that remains unexplained after accounting for important differences between subgroups. This variability is commonly known as heterogeneity. In the network metaanalysis context, the term inconsistency is frequently used specifically to refer to disagreement between direct and indirect evidence. For clarity, we will therefore use the term heterogeneity to describe disagreement between estimates within the same comparison and the term inconsistency for disagreement between estimates coming from different sources (e.g. direct and indirect evidence, or different routes of indirect evidence). We regard the two notions as very closely connected; inconsistency can be viewed as the extension of heterogeneity across studies evaluating different comparisons. Both are statistical consequences of betweenstudy differences in populations, treatments, outcomes and biases, and inconsistency often appears as large heterogeneity in models that ‘force’ sources of evidence to be consistent. Thus we suggest joint consideration of both notions here. Some technical considerations are necessary before presenting our proposal to consider downgrading for inconsistency.
In the majority of network metaanalysis applications, an assumption is made that every source of direct evidence has the same heterogeneity variance. That is, there is a single heterogeneity variance for the whole network, pertaining to every one of the direct comparisons. The assumption simplifies the analysis and allows for heterogeneity to be incorporated for direct comparisons with only one study. The assumption has implications for the estimation of effect sizes in the network metaanalysis, because any heterogeneity in one direct comparison gets propagated through the whole network. For example, a direct comparison that appears homogeneous when considered alone may have a nonzero heterogeneity variance imposed on it in the network metaanalysis if other evidence in the network displays heterogeneity. A full consideration of heterogeneity in network metaanalysis therefore should include both the magnitude of heterogeneity within each direct comparison
The magnitude of a heterogeneity variance (often denoted τ^{2}) can be difficult to interpret. For binary outcomes, we recommend referring to empirical distributions of heterogeneity values typically found in metaanalyses
The estimates from network metaanalyses are valid only under the assumption of transitivity
There are several statistical approaches to evaluate network inconsistency; for a review see Dias et al
Our proposed procedure for considering whether to downgrade a particular network metaanalysis effect estimate under the GRADE component of inconsistency is as follows.
Evaluate the extent of
If a common heterogeneity variance is being assumed, evaluate the impact of this variance on each network metaanalysis estimate by comparing the heterogeneity variance from the direct evidence in step (a) with the heterogeneity variance from the network metaanalysis.
Consider the magnitude of the heterogeneity estimate (or estimates) from the network metaanalysis for each effect size of interest. A particularly convenient way to do this is to look at predictive intervals for the effect in a new study of each comparison
Assess the involvement of each comparison in any
Make a judgment about downgrading for heterogeneity and/or inconsistency based on steps (b), (c) and (d) above. This might start by judging heterogeneity as described in steps a) to c); if important heterogeneity is found, the evidence might be downgraded by one or two levels according to standard GRADE guidance. If heterogeneity is moderate or low, consideration of inconsistency would proceed as in step d). In case of moderate heterogeneity and inconsistency, the evidence might be downgraded by two levels. In case of low heterogeneity and inconsistency it might be downgraded by one level (or two levels in case inconsistency is substantial). If neither inconsistency nor heterogeneity are found, no downgrading is needed.
Statistical evaluation of inconsistency has very little power in the presence of substantial heterogeneity and hence step d) is conditional on observing low or moderate heterogeneity in the network.
Information about heterogeneity in the example network is reported in the last two columns of
In
The names of the treatments are given in
The comparisons included in the network form two closed loops of evidence. For neither was there statistically significant evidence of inconsistency (discrepancy between direct and indirect evidence in the ABD loop is 1.56 on the log odds ratio scale, p = 0.10; and discrepancy in the BCD loop is 0.19, p = 0.81). However, power is low to detect important inconsistency and these results should not be interpreted as evidence of consistency; the point estimate for the ABD loop is very large suggesting that a true OR of 1 might be estimated to be 4.76.
In summary, we might want to downgrade two network estimates for AC and BD based only on our observations about heterogeneity and to consider downgrading evidence strongly influenced by studies involve in the ABD loop because of concerns over inconsistency in this loop (the point estimate might be considered large).
In the standard GRADE guidance, imprecision is evaluated primarily by examination of 95% confidence intervals, and specifically whether these intervals exclude clinically relevant effect sizes. Rules of thumb are proposed for the consideration of appropriate sample sizes. In a network metaanalysis, we recommend focusing on the confidence intervals. Because of the complex contributions of different source of evidence to network metaanalysis estimate of effect size, convenient rules of thumb for considering sample sizes are not currently available. Otherwise, we suggest that the same criteria are applied to network metaanalysis estimates to decide whether downgrading by one or two levels (if any) is necessary.
To evaluate imprecision in the network estimates we consider the ORs and their confidence intervals presented in
Even after a meticulous search for studies, publication bias can occur and usually it tends to lead to overestimation of an active treatment’s effect compared with placebo or other reference treatment. Several approaches have been proposed to generate assumptions about the presence of publication bias, including funnel plots, regression methods and selection models, but each has limitations and their appropriateness is often debated. Making judgements about the presence of publication bias in a network metaanalysis is usually difficult. We suggest that for each observed pairwise comparison, judgements about the presence of publication bias are made using standard GRADE. We recommend that the primary considerations are nonstatistical (by considering how likely it is that studies may have been performed but not published) and we advocate the use of contourenhanced funnel plots, which may help in identifying publication bias as a likely explanation of funnel plot asymmetry
The first consideration is the completeness of the search. The original review employed a comprehensive search strategy and authors report they have sought unpublished data. Pairwise comparisons include at most seven studies, so contourenhanced funnel plots may not be very informative to infer about the possibility of publication bias. We do not recommend downgrading because of publication bias in this particular example.
To determine our confidence in an overall treatment ranking from a network metaanalysis, we again follow the standard GRADE approach, making some modifications to reflect the same specific issues in network metaanalysis.
The main consideration for study limitations in a network metaanalysis as a whole is again to ensure that the relative contributions of different sources of direct evidence (which may have different study limitations) are accounted for appropriately. Our proposed procedure is as follows.
Evaluate each piece of direct evidence in the network and classify it as low, moderate or high risk of bias according to the usual GRADE guidelines
Illustrate the risk of bias assessments according to the contributions of each source of direct evidence to the network metaanalysis as a whole, for example using a pie chart. For a formal statistical approach to this, we recommend using the contributions matrix as described in section
Integrate the contributions and judgements of direct pieces of evidence into a single judgement about study limitations, and consider whether to downgrade the ranking evidence. A highly quantitative approach to this integration could also be employed.
The colours represent the risk of bias (green: low, yellow: moderate, red: high). The names of the treatments are given in
We propose that judgments are made across all studies and all comparisons, considering potential differences between the populations, treatments and outcomes in the studies to hand compared with the populations, treatments and outcomes targeted by the network metaanalysis. This should include particular consideration of whether there are differences between studies making different comparisons, since such differences may invalidate transitivity assumptions made across the network. If some pieces of evidence only indirectly address the research question, then the quality of any treatment ranking is likely to be affected and we would consider downgrading for indirectness. Again, it would be possible to use the contributions matrix to describe the precise contribution of each direct estimate. Note however that it is possible for all of the evidence to be indirectly relevant to the research question but for it still to provide good evidence for a treatment ranking within a particular context, for example if all studies are in a particular subpopulation (e.g. men) of a wider population of interest.
In the absence of evidence for an uneven distribution of effect modifiers, we decide that no downgrading is necessary for any of the direct comparisons, and consequently no downgrading of confidence for reasons of indirectness should take place for the overall ranking.
To assess inconsistency in the network as a whole, we again need to consider heterogeneity and network inconsistency. For the latter we suggest the implementation of statistical methods that evaluate the assumption of consistency in the entire network (e.g. comparisons of model fit, designbytreatment global test, see
Evaluate the extent of heterogeneity in the network. This is straightforward if a common heterogeneity variance is assumed. For dichotomous outcomes, we can refer to the empirical distribution of heterogeneity, as in section
Evaluate inconsistency in the network as a whole, for instance using statistical methods that provide a single inference about the plausibility of assuming consistency throughout the network. The power of such global tests of inconsistency may be expected to be higher than local tests. However, power can still be low, and interpretation of the test result requires the usual caution. An alternative to a test is to estimate a global inconsistency parameter, such as the variance of the differences between direct and indirect evidence as described by Lu and Ades
Consider downgrading the confidence in the ranking by one or two levels depending on the presence and magnitude of heterogeneity and/or network inconsistency from steps (a) and (b). Network inconsistency is considerably more important than heterogeneity in assessing confidence in treatment rankings, because the ranks are based primarily on mean effects and so heterogeneity of effects around this mean may be less important.
To consider whether to downgrade confidence in the network as a whole due to inconsistency, we need to consider the network heterogeneity parameter and the presence of network inconsistency. A common heterogeneity variance was assumed in the analysis, with an estimated value that suggests the presence of moderate to low heterogeneity. The designbytreatment interaction inconsistency model
Imprecision in a ranking of treatments can be understood as uncertainty in the relative order of the treatments for the specific outcome. The ranking of treatments is often estimated by calculating ranking probabilities, with rankograms used to present the probability that each treatment is achieving a particular rank
Network ranking with high imprecision  Network ranking with high precision  
Best  28  24  24  24  97  1  1  1 
Second  24  28  24  24  1  97  1  1 
Third  24  24  28  24  1  1  97  1 
Last  24  24  24  28  1  1  1  97 
The rankograms for the example, illustrated in
On the horizontal axes are the possible ranks and on the vertical axis the probability that each treatment achieves each rank.
Judgments about the potential impact of publication bias in the ranking of the treatments require, as before, consideration of the comprehensiveness of the search for studies and the likelihood that studies may have been conducted and not published. A statistical approach to detecting bias is offered in certain situations by the
The comparisonadjusted funnel plot in
Each observation is the difference between a study estimate and its direct metaanalysis mean effect. Studies on the right hand side ‘overestimate’ the effect of newer treatments.
We have proposed a strategy for considering the confidence of results from a network metaanalysis, building on ideas developed by the GRADE Working Group. At the heart of our proposal is the separation of an assessment for each pairwise estimate of treatment effect and for a ranking of treatments across the whole network. Both outputs are important and we summarize our suggested strategies in
Evaluate the confidence in a specific pairwise effect estimated in network metaanalysis  
GRADE domain  Domain assessment in NMA  Description of procedure  Instructions for downgrading 
Study limitations  Determine which directcomparisons contribute toestimation of the NMAtreatment effect 
Use standard GRADE considerations to inform judgment.  
Joint considerationof indirectnessandintransitivity  Evaluate indirectness ofpopulations, interventions andoutcomes as in standardGRADE. Evaluate transitivityby comparing the distributionof known effect modifiersacross comparisons thatcontribute evidence toestimation of the NMAtreatment effect 
If 

Joint considerationof statisticalheterogeneityand statisticalinconsistency  (a) Judge the extent ofheterogeneity, considering thecomparisonspecificheterogeneity variance, theNMA estimate of variance, aprediction interval and/or otherrelevant metrics such as I^{2}_{.} (b)Evaluate the extent to which thecomparison under evaluationis involved in inconsistentloops of evidence.  (a) If important heterogeneity is found, downgrade. If heterogeneity is low do not downgrade. (b) Power to detect inconsistency may be low; Downgrade in absence of statistical evidence for inconsistency when direct and indirect estimates imply different clinical decisions.  
Imprecision  Focus on width of theconfidence interval.  Assess uncertainty around the pairwise estimate. Downgrade if confidence interval crosses null value or includes values favoring either treatment).  
Publicationbias  Nonstatistical considerationof likelihood of nonpublication ofevidence that would inform thepairwise comparison. Plot pairwiseestimates on contourenhancedfunnel plot.  Use standard GRADE to inform judgment.  
Studylimitations  Integrate risk of bias assessmentsfrom each direct comparison toformulate a 
Use standard GRADEconsiderations to informjudgment.  
Joint considerationof indirectnessand intransitivity  Evaluate indirectness of populations,interventions and outcomes as instandard GRADE. Evaluatetransitivity across network bycomparing the distribution of knowneffect modifiers acrosscomparisons. 
If 

Jointconsiderationof statisticalheterogeneityand statisticalinconsistency  (a) Judge the extent of heterogeneityconsidering primarily the NMAvariance estimate(s) used and othernetworkwise metrics such as Q forheterogeneity in a network (b)Evaluate inconsistency in networkusing statistical methods (such as global tests of inconsistency, orglobal inconsistency parameter).  (a) If important heterogeneity is found, downgrade. If heterogeneity is low do not downgrade. (b) For overall treatment rankings, inconsistency should be given greater emphasis, since ranks are based on mean effects and the uncertainty they are estimated with. Downgrade in absence of statistical evidence for inconsistency when several direct and indirect estimates imply different clinical decisions.  
Imprecision  Visually examine rankingprobabilities (e.g. rankograms) foroverlap to assess precision oftreatment rankings  If probabilities are similarly distributed across the ranks, downgrade for imprecision.  
Publicationbias  Nonstatistical consideration oflikelihood of nonpublication foreach pairwise comparison. Ifappropriate, plot NMA estimateson a comparison adjusted funnelplot and assess asymmetry.  As asymmetry does not provideconcrete evidence of publication bias, downgrading should only be considered jointly with the non 
When integrating assessments about direct comparisons into a judgement about an NMA treatment effect or the ranking, more weight should be given to assessments from direct comparisons that contribute more information. We recommend use of the contributions matrix to quantify how much information each direct comparison contributes to the estimation of the NMA treatment effect under evaluation or the ranking.
On application of our ideas to an example network of antibiotics for discharging ears, we found the suggestions to be workable, but subjective. Some of the subjectivity can be alleviated by taking a highly quantitative approach to considering the contributions of each piece of direct evidence, and weighting standard GRADE assessments for direct (pairwise) comparisons according to the influence they have on network metaanalysis estimates. There are advantages and disadvantages to the quantitative approach; it can be systematically applied, it is transparent and replicable, but it can be misinterpreted or overinterpreted. Furthermore, the quantitative measures of the contributions of each piece of direct evidence are only approximate when Bayesian methods are used for the network metaanalysis.
We have discussed each of the five GRADE domains and suggested possible strategies that can be used to form judgement for each domain separately. Decisions about downgrading by one or two levels for a specific GRADE component relate to the degree to which it compromises the summary estimate and the ranking. For instance, important inconsistency in the network can prompt investigators to downgrade the evidence by two levels. There is not a unanimously agreed definition of what consists ‘important’ inconsistency and, while tests and measures can be used to facilitate judgement, the potential to bias the summary estimate should be the primary consideration.
Comparison  Nature of the evidence  Confidence  Downgrading due to 
AB: Quinolone antibiotic vs no treatment  Mixed  Low  Study limitations 
AC: Nonquinolone antibiotic vs no treatment  Indirect  Low  Study limitations 
AD: Antiseptic vs no treatment  Mixed  Very low  Study limitations 
BC: Nonquinolone antibiotic vs quinolone antibiotic  Mixed  Very low  Study limitations 
BD: Antiseptic vs quinolone antibiotic  Mixed  Moderate  Inconsistency 
CD: Antiseptic vs nonquinolone antibiotic  Mixed  Very low  Study limitations 
Dominated by evidence at high or moderate risk of bias.
No convincing evidence for the plausibility of the transitivity assumption.
Predictive intervals for treatment effect include effects that would have different interpretations (there is additionally no convincing evidence for the plausibility of the transitivity assumption).
Confidence intervals include values favouring either treatment.
60% of the information is from studies at moderate risk of bias.
Moderate level of heterogeneity, and some evidence of inconsistency in the network.
None of the effect estimates was accompanied by high confidence, one had moderate confidence, three low confidence and one very low confidence. Notably, the one comparison for which there was no direct evidence was given low confidence, while one comparison that had been investigated in four studies was given very low confidence. Our confidence in the ranking of the four treatments is low, due to downgrading for study limitations and for inconsistency.
We have provided tables and figures that offer some possibilities for presenting GRADE assessments and the information that informs them.
Grading the evidence from a network metaanalysis assumes that the analysis is technically adequate. The assumption of transitivity is key to a network metaanalysis, and assessment of this assumption within the indirectness component of the GRADE framework is critical. Some degree of inconsistency might be present in the data and appropriate statistical methods should be employed to detect it. Investigators should refrain from network metaanalysis in the presence of important inconsistency. To account from small or moderate disagreement between the sources of evidence methods that encompass inconsistency should be employed to estimate effect sizes and ranking. However, particular care is needed when interpreting the results from such models.
(DOCX)
(DOCX)
(DOCX)