The authors have declared that no competing interests exist.
Usually, the launch of the diffusion process is triggered by a few early adopters–i.e., seeds of diffusion. Many studies have assumed that all seeds are activated once to initiate the diffusion process in social networks and therefore are focused on finding optimal ways of choosing these nodes according to a limited budget. Despite the advances in identifying influencing spreaders, the strategy of activating all seeds at the beginning might not be sufficient in accelerating and maximising the coverage of diffusion. Also, it does not capture real scenarios in which marketing campaigns continuously monitor and support the diffusion process by seeding more nodes. More recent studies investigate the possibility of activating additional seeds as the diffusion process goes forward. In this work, we further examine this approach and search for optimal ways of distributing seeds during the diffusion process according to a preallocated seeding budget. Theoretically, we show that a universally best solution does not exist, and we prove that finding an optimal distribution of supporting seeds over time for a particular network is an NPhard problem. Numerically, we evaluate several seeding strategies on different networks regarding maximising the coverage and minimising the spreading time. We find that each network topology has a best strategy given some spreading parameters. Our findings can be crucial in identifying the best strategies for budget allocation in different scenarios such as marketing or political campaigns.
The increasing number of people who use social media has presented a new channel for marketing campaigns. Viral marketing targets potentially influential nodes in social networks to spread the word about certain products or services. However, the optimisation problem of selecting influential nodes as seeds is NPhard [
Despite the advances in the search for a set of influential nodes in a network, the problem of diffusion is considered stochastic after being triggered by selected influential nodes. This assumes all seeds (influential nodes) are activated at the beginning, then diffusion takes place, and marketing campaigns never interfere with them. In many real cases, however, marketing campaigns actively monitor the diffusion process and disrupt it by activating more seeds to ensure the continuity of adoption of their products or services. Recently, this scenario has been well addressed in many studies. Researchers attempted to use the extra knowledge about ongoing informationspreading processes [
Following the increasing attention directed toward continuous seeding, new questions arise about when additional seeds should be used and how many of them should be distributed over time, assuming that the campaign must finish in a given period. These questions are very important for marketing campaigns since budgets are allocated in advance for seeding. Hence a decision maker should know when to spend the budget and how to effectively and timely activate seeds that would increase the coverage. In this study, we investigate how the distribution of additional seeds affects the coverage of the network. We address this problem theoretically and numerically to search for optimal distributions of seeds over time in different network topologies. First, we theoretically prove that the optimisation of distribution of seeds over time is NPhard and a universally best solution does not exist. Due to this finding, we expect to find approximate solutions by examining a number of heuristics for different network topologies hoping to find the best distribution for each class of networks. We use four wellknown distributions as heuristics: (1) linear; (2) Gaussian; (3) Geometric; and (4) Decreasing Geometric distributions. We benchmark our heuristics against the standard strategy of seeding nodes that activates all the seeds at the beginning of the process. To simulate the spread of information, we use the independent cascade model [
We begin with a simple illustrative example based on a single primary seed that triggers the spread of information and a single supporting seed that is added after the spread takes place in a network with 43 nodes, presented in
The figure presents four different variants of the diffusion process consisting of four stages, with propagation probability
Note that while the propagation probability equal to one makes the example presented in
The example also illustrates the importance of our assumption that the diffusion process has to end after a certain period. Given the chance, the diffusion process presented in
We now move to its formal definition and investigate some of its theoretical properties.
We now present basic network notations, formally define the main problem of this study and analyse some of its theoretical properties.
Let
First, we formally define the problem of finding an optimal way of supporting seeding.
In other words, we want to investigate how one should distribute supporting seeds throughout the entire process in order to achieve maximal network coverage. It might seem that at least for simple strategies of choosing the seeds, such as selecting the node with highest degree, there should exist simple observations, such as:
The proofs of all theorems can be found in the supporting materials.
Mainly, the theorem shows that in a general case it is not possible to find
Even when focusing on a particular network, it might be computationally infeasible to find the optimal distribution of supporting seeds, as we show in the following theorem.
In other words, there exist examples of network structures, where there is no polynomial algorithm that finds the optimal distribution of supporting seeds for these structures. However, one might hope that this kind of structures rarely exist in reallife networks. To shed some light on the practical aspects of utilizing supporting seeding we perform experiments on a number of reallife network datasets. Considered strategies of distributing supporting seeds and detailed descriptions of experimental setup are presented in following sections.
We now describe four strategies of distributing supporting seeds among supporting seeding stages considered in our experiments, namely linear distribution, geometric ascending distribution, geometric descending distribution and Gaussian distribution. Scenarios utilizing these distributions are presented in
Plots illustrate coverage of the network increasing during the seeding process, while insets illustrate number of supporting seed used in each supporting seeding stage. Each arrow represents a single seeding stage, with red arrow representing primary seeding, and other arrows representing supporting seeding. Size of each arrow corresponds to the number of seeds used in particular seeding stage. Panel (A) presents the linear distribution of supporting seeds, with the same number of supporting seeds used in each stage. Panel (B) presents the ascending geometric distribution of supporting seeds, with low number of seeds used at the beginning of the process and high usage in later stages. Panel (C) presents the descending geometric distribution of supporting seeds, with high number of supporting seeds used at the beginning and significant drop in the later stages. Panel (D) presents the Gaussian distribution of supporting seeds, with maximal intensity in the middle of the process.
We can see that different distributions of supporting seeds over time can result in varying total coverage of the network in
The second distribution is an ascending geometric distribution of supporting seeds (GA) that is presented in
The third distribution is geometric (GD) and is presented in
The results of using the fourth distribution is shown in
We expect that the way of distributing the supporting seeds over time can affect the total coverage of the network in two different ways. The usage of supporting seeds at the beginning of the process may lead to selecting as supporting seeds nodes with high potential of being activated by the natural process,
Next, we move to describing the setup of our experiments in which we compare the coverage of the network using different types of distributions.
We now describe the parameters of our experimental setting.
We run our experiments on nine different reallife networks (parameter N), such as scientific collaboration networks. We used the following networks: N1—Condensed matter collaborations 1999 [
N  Nodes  Edges  Average values of network measures  Ref  

DG  CL  PR  EV  CC  BT  
N1  16726  47594  2.85  0.00036  0.00006  0.0054  0.6380  33238.58  [ 
N2  8361  15751  1.88  0.00046  0.00013  0.0032  0.4856  13478.94  [ 
N3  4941  6594  1.33  0.05368  0.00020  0.0048  0.0801  44433.29  [ 
N4  1589  2742  1.73  0.00075  0.00068  0.0137  0.6937  251.35  [ 
N5  5242  14496  3.30  0.15145  0.00016  0.0225  0.0109  11468.14  [ 
N6  12591  49743  3.95  0.00989  0.00008  0.0095  0.1166  21217.52  [ 
N7  6474  13895  2.15  0.27657  0.00015  0.0103  0.2522  8754.74  [ 
N8  1706  6207  3.64  0.01038  0.00059  0.0335  0.0012  2946.64  [ 
N9  3133  6726  2.15  0.00379  0.00033  0.0192  0.0658  4917.14  [ 
To model the flow of information in the network we use the Independent Cascade model with propagation probability (parameter PP) between 0.05 and 1.00. The number of nodes selected as primary seeds is determined by parameter PS, and equals
Symbol  Parameter  Distinct values  Values 

N  Network  9  Networks N1N9 presented in 
PP  Propagation probability  20  0.05, 0.1, 0.15, 0.2 0, 25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0 
PS  Percentage of nodes of the network used as a primary seeds  1  1% 
SS  Number of supporting seeds as percentage of number of primary seeds  5  100%, 200%, 300%, 400%, 500% 
SR  Ranking method used for seed selection  4  DG—degree, PR—PageRank, EV—eigenvector, BT—betweenness 
SD  Supporting seeds distribution  4  LN—linear, GA—ascending geometric distribution, GD—descending geometric distribution, GS—Gaussian distribution 
With the parameters described above, the experimental space N x PP x SD x PS x SS x SR contains 14,400 configurations. In order to obtain the number of activation stages determining the number of supporting seeding stages
We now present the results of our experiments. For each combination of parameters in the experimental space N x PP x SD x PS x SS x SR described in the previous subsection we perform 100 simulation runs and present average results of coverage. The main goal of the simulations is to compare the performance of supporting seeding using the suggested distributions and quantify the impact of each predefined parameter of the experimental space on the final coverage.
We start with comparing the aggregate performance of supporting seeding and primary seeding. The coverage of primary seeding is ordered in an increasing order.
(A) Total coverage of a network in experiments with only primary seeding and with supporting seeding presented for every combination of experiment parameters. All averaged results from all runs for all 14400 simulation cases are presented for all possible combinations of parameters listed in the
Now, we discuss the performance of all our strategies of distributing supporting seeds over time.
To further compare between the performance of different distributions of supporting seeds,
To know whether the differences between distributions in relation to linear distribution are statistically significant, we use Wilcoxon group test and Hodges Lehmann estimator. The linear support compared with ascending geometric support shows H = 1.0748 with statistical significance (pvalue < 2.2
We now analyse the average number of additionally activated nodes per each supporting seed. For linear support this value is equal to 3.01,
So far we presented aggregated results of various simulations. Next, we present results of using different distributions of seeds over time in one type of networks.
(A) Coverage. (B) Average linear distribution of supporting seeds. (C) Average ascending geometric distribution of supporting seeds. (D) Average descending geometric distribution of supporting seeds. (E) Average Gaussian distribution of supporting seeds.
We now analyse how the average performance of supporting seeding varies while using different values of propagation probabilities (parameter PP), the key characteristic of informationspreading process in experiments.
(A) Total coverage in supported and not supported processes for different distributions of supporting seeds as a function of propagation probability. (B) Total coverage for different support intensities as a function of propagation probability. (C) Average results for all networks in experiments with and without supporting seeding.
Finally, we found that the performance of supporting seeding is highly dependent on characteristics of the network itself. The average coverage for each considered network in experiments with and without supporting seeding are presented in
Detailed numerical average results for all combinations parameters are presented in the
Each result indicates the size of coverage in the process with supporting seeding, in comparison to the process with only primary seeding.
Parameter  Value  Average results  pvalue  

Linear  GeomA  GeomD  Gaussian  Linear  GeomA  GeomD  Gaussian  
0.05  2.20  1.81  1.83  2.39  2.00 
3.08 
1.98 
< 2 

0.10  1.97  1.66  1.67  2.09  3.96 
1.82 
1.25 
9.45 

0.15  1.76  1.56  1.56  1.83  1.25 
1.96 
2.13 
1.65 

0.20  1.61  1.44  1.44  1.66  5.71 
2.68 
3.07 
1.51 

0.25  1.58  1.43  1.43  1.62  1.10 
3.74 
3.73 
4.66 

0.30  1.59  1.45  1.45  1.62  9.74 
2.50 
2.40 
4.73 

0.35  1.61  1.48  1.48  1.63  6.11 
1.31 
1.14 
3.49 

0.40  1.62  1.50  1.51  1.64  4.92 
8.07 
6.83 
2.93 

0.45  1.66  1.56  1.57  1.70  1.50 
2.12 
1.66 
5.82 

0.50  1.67  1.58  1.58  1.70  1.29 
1.32 
1.07 
5.45 

0.55  1.67  1.60  1.60  1.70  1.20 
7.36 
7.86 
5.52 

0.60  1.68  1.62  1.61  1.70  1.13 
4.86 
5.97 
5.37 

0.65  1.72  1.64  1.64  1.73  4.08 
2.75 
2.58 
2.89 

0.70  2.12  1.85  2.07  2.05  3.38 
9.88 
1.91 
2.79 

0.75  2.17  2.02  2.12  2.14  < 2 
8.32 
3.13 
< 2 

0.80  2.35  2.16  2.28  2.28  < 2 
< 2 
< 2 
< 2 

0.85  2.55  2.43  2.49  2.56  < 2 
< 2 
< 2 
< 2 

0.90  2.52  2.52  2.49  2.53  < 2 
< 2 
< 2 
< 2 

0.95  2.48  2.47  2.45  2.48  < 2 
< 2 
< 2 
< 2 

1.00  2.34  2.33  2.33  2.35  < 2 
< 2 
< 2 
< 2 

N1  1.26  1.18  1.14  1.29  4.78 
2.45 
1.10 
1.25 

N3  1.38  1.26  1.22  1.43  < 2 
4.28 
2.73 
< 2 

N3  4.69  4.01  4.32  4.70  < 2 
< 2 
< 2 
< 2 

N4  4.54  4.35  4.41  4.59  < 2 
< 2 
< 2 
< 2 

N5  1.12  1.08  1.07  1.13  3.72 
2.10 
3.40 
1.73 

N6  1.06  1.04  1.02  1.07  4.98 
9.69 
1.92 
2.49 

N7  1.09  1.06  1.06  1.11  1.05 
3.71 
4.88 
4.38 

N8  1.17  1.13  1.10  1.20  3.01 
4.00 
6.81 
7.09 

N9  1.18  1.13  1.11  1.21  2.30 
2.56 
6.12 
5.33 

D  1.33  1.24  1.21  1.37  < 2 
< 2 
< 2 
< 2 

PR  1.46  1.33  1.33  1.51  < 2 
< 2 
< 2 
< 2 

EV  3.59  3.38  3.52  3.58  < 2 
< 2 
< 2 
< 2 

BT  1.39  1.27  1.26  1.43  < 2 
< 2 
< 2 
< 2 

100%  1.45  1.40  1.42  1.47  < 2 
< 2 
< 2 
< 2 

200%  1.87  1.77  1.79  1.89  < 2 
< 2 
< 2 
< 2 

300%  2.00  1.86  1.88  2.03  < 2 
< 2 
< 2 
< 2 

400%  2.11  1.93  1.98  2.15  < 2 
< 2 
< 2 
< 2 

500%  2.27  2.06  2.08  2.32  < 2 
< 2 
< 2 
< 2 
In general, supporting seeding delivers best relative increase of coverage in experiments with the lowest used propagation probability PP = 0.05. The Gaussian support delivers 8.62% better results than linear for propagation probability PP = 0.05. Performance is dropping till 2.25% increase for PP = 0.25. Results from Gaussian distribution compared with geometric descending and ascending showed highest increase of performance of Gaussian approach for PP = 0.10 with 30.53% increase for descending and 31.69 for ascending geometric distribution. The lowest increase was observed for PP = 0.05 with a 7.74% and a 9.24% increase respectively. In terms of used networks, supporting seeding delivered best results for networks N3 and N4 with more than 4 times increase for all used algorithms, when compared to information spreading process without any support. The lowest increase was observed for network N6 for all used distributions of supporting seeds. Performance was related to seeds selection strategy. The best performance of supporting seeding was observed when the selection was based on eigenvector, and was more than 3 times higher when compared to the informationspreading process without support. Support equal to initial seeding performance was similar to all used strategies. The highest differences were observed for intensity of support five times higher than initial seeding. Gaussian support delivered highest 2.32 times increase, 2.27 for linear support.
While the emphasis of our research has been put on real networks, we also performed experiments using synthetic networks to search for strategies that could be generalized to a class of networks: Barabási–Albert model (BA), WattsStrogatz (WS) model and ErdosRenyi model (ER). Experiments were based on the same propagation probabilities PP, seed selection strategies (Degree, Page Rank, Eigenvector, Betweenness) and support intensities (100%, 200%, 300%, 400%, 500%) as for the real networks. Synthetic networks with 5,000 nodes were generated according to Barabási–Albert model (BA), WattsStrogatz (WS) model and ErdosRenyi model (ER). For Barabási–Albert model the number of edges added with each node was 2, while the size of initial clique was 2. For WattsStrogatz model the expected degree of a node was 4, while the rewiring probability was set to
Results based on the networks following BA model are presented in Table A in
Results from simulations performed on networks following WS model are presented in Table B in
Simulations based on ER network model are presented in Table C in
Our further analysis of randomly generated network is focused on how network characteristics, such as average degree, closeness or betweenness, affect the performance of supporting seeding. The results are presented in the Figs EG in
The increasingly important role of social media in marketing strategies requires new analytical and decision supported solutions. Most of the earlier studies related to viral marketing are focused on seed selection and initialisation of the information diffusion process, without considering additional support. At the same time, reallife marketing campaigns are based on continuous monitoring of performance, using an additional budget to boost campaign dynamics and coverage. However, the budget assigned to supporting campaigns can be allocated according to different strategies. One possible strategy is to assign the same number of supporting seeds at each stage, while another strategy can add more supporting seeds at the beginning or close to the end of a campaign. Spending the additional budget at the beginning of the campaign may result in activating nodes that would be reached anyway by the natural diffusion processes, while postponing supporting seeding to the end of the process enables activating nodes which are difficult to reach with the natural processes, but it might be too late to fully exploit the potential of activating these nodes in initiating new information cascades.
We theoretically investigated whether it would be possible to find one universally useful strategy of distributing supporting seeds for any network structure and we found that the best strategy of distributing supporting seeds is dependent on network structures. However, finding the optimal distribution for a given network structure is NPhard. Therefore, we explore the problem numerically for a number of real networks and use different combinations of parameters: propagation probabilities, total number of supporting seeds, seed selection criterion, and supporting seeds distributions.
By using different distributions of supporting seeds over time, we show that the performance of different ways of distributing additional seeding is highly dependent on the selected parameters. For many cases, we obtain the best results using the Gaussian distribution, with lower usage of additional seeds at the beginning and the end of the process. The distribution avoids seeding nodes with the potential to be activated on their own and gives seeds enough time to explore new nodes.
Several extensions of the proposed approach can be planned for future work. Other distributions of supporting seeding can be explored on different networks. Another possibility is considering supporting seeding with diffusion models other than independent cascade, e.g. the linear threshold model. New directions can be based on the other assumptions e.g., conditional support and the use of information related to dynamics of spreading processes, detection of optimal points to provide support or prediction of time when the process dynamics drops.
(PDF)
(ZIP)