^{†}Corresponding author. Email: hustlyu@hust.edu.cn
^{*}Project supported by the National Natural Science Foundation of China (Grant No. 61231010) and the Fundamental Research Funds for the Central Universities, China (Grant No. HUST No. 2012QN076).
Identifying influential nodes in complex networks is of both theoretical and practical importance. Existing methods identify influential nodes based on their positions in the network and assume that the nodes are homogeneous. However, node heterogeneity (i.e., different attributes such as interest, energy, age, and so on) ubiquitously exists and needs to be taken into consideration. In this paper, we conduct an investigation into node attributes and propose a graph signal processing based centrality (GSPC) method to identify influential nodes considering both the node attributes and the network topology. We first evaluate our GSPC method using two realworld datasets. The results show that our GSPC method effectively identifies influential nodes, which correspond well with the underlying ground truth. This is compatible to the previous eigenvector centrality and principal component centrality methods under circumstances where the nodes are homogeneous. In addition, spreading analysis shows that the GSPC method has a positive effect on the spreading dynamics.
Complex networks have attracted much attention in recent years since they meet the urgent need to understand the structures and dynamics of many real systems.^{[1]} In this research field, identifying influential nodes is an essential task.^{[2]} A small fraction of influential nodes can greatly affect the dynamics of networks, such as disease spreading, ^{[3]} information propagation, ^{[4, 5]} election, ^{[6]} and cascading failures.^{[7]} Moreover, the identification of the influential nodes plays an important role in controlling rumor spreading, defining new marketing strategies, and even predicting the total sale.^{[8, 9]} All of these practical applications contribute to the importance of identifying influential nodes in complex networks.
In previous work, nodes occupying important positions in the network topology are regarded as important in complex networks. Known methods of ranking node centrality include degree centrality, closeness centrality, ^{[10]} betweenness centrality, ^{[11]} cluster centrality, ^{[12]} and eigenvector centrality.^{[13]} The degree centrality is straightforward and identifies the most connected nodes. The closeness centrality identifies those near to other nodes most. The betweenness centrality identifies those located on the most traveled paths. The cluster centrality identifies those with high clustering coefficient and influential neighbors. The eigenvector centrality (i.e., EVC, including PageRank^{[14]} and subsequent LeaderRank^{[15]}) considers nodes connected to other high degree nodes as highly central. Since influential nodes identified by the EVC tend to locate in a restricted region of the network topology, the principal component centrality (PCC) method^{[16]} using several principal eigenvalues is proposed as a global view.
The prior methods mainly focus on the network topology assuming that the nodes are homogeneous. However, in reality, node heterogeneity exists.^{[17– 19]} For example, in wireless sensor networks, some nodes may be of different batteries to prolong their lifetime and reliability. Even homogeneous sensors have different levels of initial energy, depletion rate, etc. As a matter of fact, node heterogeneity greatly affects routing, epidemic spread, and safety deployment.^{[17– 19]} Besides, the topological equivalence of homogeneous networks (e.g., regular networks, smallworld networks^{[20]}) makes the network topology based node centrality approximately equal. Therefore, it is necessary to adopt node attributes as a supplement for ranking node centrality. Moreover, owing to the accumulated prior knowledge of complex networks, node attributes entitle us to rank the node centrality more comprehensively. In other words, node heterogeneity needs to be taken into account in the influential node identification.
In this paper, we propose a graph signal processing based centrality (GSPC) method to identify the influential nodes considering both the node heterogeneity and the network topology. The first intuitive step is to derive an overall evaluation of node heterogeneity (i.e., different attributes such as interest, energy, age, and so on), which is called the node signal. We categorise node attributes into three types and propose a modulated 2norm based model to obtain the node signal. In order to integrate the node signals with the network topology, we apply the graph signal processing method. In particular, the node signals are processed by the graph Fourier transform. Signal processing techniques and methodologies are used to pick up the most principal eigenvectors. Then, the node centrality is calculated by the combination of the node signal and the selected eigenvectors. By this means, we identify the influential nodes with both node attributes and the network topology.
We first evaluate our proposed method in two realworld datasets with node heterogeneity information and underlying ground truth. We have three major findings. First, identified nodes based on our GSPC method corresponds well with the underlying ground truth. Second, GSPC degenerates to the previous EVC and PCC methods under the circumstance where the nodes are homogeneous. Furthermore, the results under varying parameters and null models prove the stability of our GSPC method. In addition, spreading in a synthetic network and a real social network is investigated, which indicates that the GSPC method has a positive effect on the spreading dynamic.
The rest of this paper is organized as follows. Section 2 presents the modulated 2norm based evaluation model of node heterogeneity. In Section 3, we propose a GSPC method which integrates both node attributes and the network topology. The data description and centrality measures of two networks are, respectively, given in Subsections 4.1 and 4.2, compatibility analysis and stability analysis are, respectively, discussed in Subsections 4.3 and 4.4. Then, spreading is analyzed in Subsection 4.5. Section 5 ends this paper with some conclusions.
Node behaviors are significantly affected by node heterogeneity (i.e., different node attributes) in various networks. For example, individuals in social networks determine their interactions concerning age, gender, profession, opinions, and so on, while nodes in wireless networks forward packets due to trust, delay, battery, and so on. Therefore, in this section, node attributes are investigated and a modulated 2norm based model is proposed to achieve an overall evaluation of each node.
We divide node attributes into three types. Firstly, since dynamics like rumor control, immunization, and opinion diffusion involve opposite sides, parts of the node attributes are used to decide the sign, which is denoted as g. The sign type refers to whether or not you could support, ignore, or oppose an event, which can be modeled as a triple g ∈ (+ 1, 0, − 1), where + 1, 0, − 1 stand for endorsement, disinterest, and opposition, respectively. For example, for opinions diffusion in social networks, if node i stands for one opinion, g is positive. And if the node is opposed to the opinion, it should be negative. Otherwise, if it pays no attention, g = 0, that is, it is insignificant in the network dynamics.
Besides, analogous to the trust evaluation of nodes, ^{[21]} each node has two other types of node attributes: confidence and ability. The confidence type corresponds to the probability at which nodes function like online time, activity, and frequency of tweets, denoted as t. For a simplified example, the online time in social networks could be modeled as a probabilistic model. For a specific hour in a day, if a node gets online α times in β days, then the probability that this node gets online in this hour is t = α /β according to the minimum unbiased estimation.
The ability type corresponds to the potential service provided by the nodes, denoted as a. A high ability means that the node has more potential to respond to dynamics on the networks fast and wide. Owing to the development of computer science, empirical analysis^{[9, 22]} of various social networks have sprung up. Empirical data of a tweet indicates that a user generally follows experts on various topics of her/his interest.^{[23]} Elite users like media, celebrities, bloggers, and organizations roughly are 0.05% of the population and account of about 50% attention.^{[24, 25]} Other personal information like age, gender, location, and so on is less mentioned. Thus identity (e.g., celebrities, actors, writers), role (e.g., connectors, experts, and salesman) and interest (e.g., favorite book, music, and so on) are regarded as some of the more important node attributes in social networks. In this way, we roughly classify the ability attributes into two categories in Table 1.
Because ability attributes have different importances, coefficients η _{i} are introduced to modulate the significance of each attribute a_{i} based on two rules. First, make sure that η _{1} > η _{2} if a_{1} belongs to the important category while a_{2} belongs to the less important category. Attributes that belong to the same category are assigned the same coefficients. Second, ∑ _{i}η _{i} = 1. Examples and details of η will be discussed in the stability analysis in Subsection 4.4.
Then, node ability is codified as the distance to the zero point and is calculated by a modulated 2norm model. That is, a = ‖ η a‖ _{2} = η _{1}a_{1}+ · · · + η _{N}a_{N} (∑ _{i}η _{i} = 1, N is the number of the ability attributes). Without loss of generality, we just take 3dimension as a simple example, as shown in Fig. 1.
In conclusion, the overall evaluation of node i is calculated as
It could be either positive or negative, the higher the absolute value is, the more important role the node plays in networks.
With an overall evaluation of node attributes at each node, denoted as the node signal, a complex network can be intuitively represented as a graph with node signals on it, as in the example of the Zachary club network shown in Fig. 2. Each node has a value f (i). The blue line means positive, and the red line means negative. The height represents the absolute value of each node. Nodes 10, 17, and 19 have zero signal because they do not support either node 1 or node 34 in the Zachary club network. By this means, we can integrate node heterogeneity with the network topology to identify influential nodes in the next section.
In order to unite node attributes with the network topology together to identify influential nodes, we propose a graph signal processing centrality method in this section.
Signal processing on graphs focuses on the interplay between the graph topology and the graph signals.^{[26]} For signal processing on graphs, a spectral graph theory has been explored as a tool to define expansion bases for graph Fourier transforms.^{[27]} Many important mathematical ideas and intuitions can be extended from the classical Fourier analysis to the graph setting. By this means, main bases are picked up to do compression, filter, and so on.
Let A denote the adjacency matrix of a network with N vertexes, E edges, and a node evaluation set f, graph G = (N, E, f). When a link is presented between two nodes v_{i} and v_{j}, both A_{i, j} and A_{j, i} are set to 1, otherwise they are set to 0. Since A is a real symmetric matrix and diagonalizable, it has a complete set of orthogonal eigenvectors denoted by u_{l} (l = 1, … , N) and corresponding eigenvalues λ _{l} (l = 1, … , N, λ _{1} ≤ λ _{2} ≤ · · · λ _{N}), which satisfy Au_{l} = λ _{l}u_{l}.
According to the classic Fourier transform, it turns the time domain signal into the frequency domain signals, which turns complex things into independent components. That is,
According to the spectrum, the main frequency band is reserved by an appropriate filter. Thus, the main component of the spectrum is picked up, which could be used to reconstruct the original signal f (t).
Analogously, treating the evaluation of node attributes as signals, the graph Fourier transform is utilized to find the principal eigenvectors of the adjacency matrix A. Since the eigenvectors of the adjacency matrix are used to measure the topology importance of the nodes, ^{[14– 16, 28]} we define the graph Fourier transform to identify influential nodes as
This graph Fourier transform gives us a way to represent both the network topology and the node attributes in the graph spectral domain. According the the Fourier analysis, the signals are compressible when the graph Fourier coefficients decay rapidly. If it is compressible, then several principal eigenvectors selected according to the signals in the spectral domain are used to represent the network under the node heterogeneity.
Thus, according to the spectrum, p appropriate eigenvectors are selected to represent the main information of the network topology. Choosing an appropriate p is important. We choose p analogous to that in the EVC or PCC method. The first idea is similar to the EVC method, we choose the eigenvector with the highest spectral value, that is,
This is suitable for the case where one spectral value is much more higher than the others. Otherwise, we list spectral values in descending order and then choose p according to
That is, if the next value f̃ (λ _{p+ 1}) is much less than current one f̃ (λ _{p}), then the spectral values after the pth would be ignored, and so are the corresponding eigenvectors. In this way, the principle eigenvectors are picked up.
Combing the node heterogeneity, the centrality of the nodes is calculated as
where ⊙ is the Hadamard operator.
Equation (6) states that the node centrality is determined by both the node signals and the network topology. The larger the node signal is, the higher the node centrality is. The more important the network position a node occupies, the higher the node centrality is.
We now analyz the time complexity. The GSPC method contains four parts step by step, that is, eigenvalues and eigenvectors calculation (O(N^{2})), Fourier transform (O(N log N)), selection (O(N log N)), and centrality computing (O(pN), where p is the number of selected eigenvectors and p ≪ N). Then, the time complexity of the GSPC method is O(N^{2}).
The time complexity of the most simple degree centrality is O(E). For the most popular PageRank algorithm, its time complexity is O(EL), where L stands for the iterations. The time complexity for the betweenness centrality is O(NE). The time complexity of the recently proposed PCC algorithm is O(N^{2}log^{2}N) or O(N^{2}log^{3}N) to different networks topologies and PCC could be used in largescale networks like Facebook data with 3097165 users.^{[16]} The time complexity of the GSPC method is higher than the former two and lower than the latter two.
In this section, we first identify influential nodes by our GSPC method in two small real networks with ground truth. These networks contain not only the network topology information but also the node attributes information. Then the compatibility and stability of our GSPC method are analyzed. At last, spreading analysis is implemented both in a larger synthetic network and a larger real social network.
The Zachary club network^{[29]} is a social network of friendships among 34 members of a karate club at a US university in the 1970s. During the observation period, for the disagreement between the administrator and instructor of the club, the club splits into two groups and one group leaves to start their own club. Mapping the network to the topological graph, the administrator and the instructor are represented by vertices 1 and 34, respectively.
Figure 3(a) shows the well known Zachary club network. It shows that two groups surround their own centers, nodes 1 and 34, respectively, and it has a very clearly divided community structure.
First, we obtain the node signal as follows. As shown in Table 2, the support attribute belongs to the sign category and determines the sign to be negative or positive. In this case, nodes are set as + 1, − 1, and 0 corresponding to their support on nodes 1, 34, and none, respectively. The strength attribute belongs to the ability category and determines the degree of support. Thus, the node evaluation is
By the graph Fourier transform, we calculate f̃ (λ _{l}) and obtain the spectrum in Fig. 3(b). Given that the spectral value at λ _{33} is much higher than the rest, we set p = 1. Then we pick up λ _{33} and its corresponding eigenvector to rank the node centrality as
The results in Table 2 show that the node centrality obtained by the GSPC method conforms to the real situation of the Zachary club network because the centrality of nodes 1 and 34 are the two highest. In addition, we can find out that the network topology based methods distinguish nodes 1 and 34, which may be due to the Zachary karate club displaying a divided community structure and hence a distinct topology.
The KrackHighTec managers network^{[30]} was collected from the managers of a hightec company on the west coast of the United States. It has just over 100 employees with 21 managers. Mapping to topological graph, it has 21 nodes and the most important is node 7.
The topology is shown in Fig. 4(a) in which the nodes are too mixed with each other to tell the central node intuitively.
Table 3 involves four attributes of the managers: age (in years), tenure (length of time employed by the company, in years), level in corporate hierarchy (coded 1, 2, 3), department of the company (coded 0, 1, 2, 3, 4). These attributes all belong to the ability category and we use the modulated 2norm model to obtain the node signal.
In particular, age and department attributes are not in accordance with the tenure and level attributes (contribute to role of an individual a lot), so their coefficients are set rather small. For the level attribute, the smaller, the better, we turn the values of level upside down, that is, a_{level} = 4− a_{level}. Then the node signal is
Then by the graph Fourier transform, we obtain the spectrum in Fig. 4(b). According to Fig. 4(b), the spectral value at λ _{21} exceeds the rest by a lot, thus p = 1 and λ _{21}, u_{21} is picked up. The node centrality is calculated as
In this way, we obtain the node centrality of the KrackHighTec network shown in Table 3. It shows that the node centrality obtained by our GSPC method conforms the ground truth of the KrackHighTec managers because the centrality of node 7 is the highest. For network topology based methods, the centrality of node 15 is the highest, which suggests that node 15 is the most important person in the network and turns out to be false. This indicates that the node attributes are dispensable in identifying influential nodes in complex networks.
We first test the compatibility of our GSPC method to the eigenvector based centrality methods such as the PCC and EVC methods. Assume that the nodes are all homogeneous and the node signals are all set to be 1. Then, our GSPC method is degenerated to
which is of the same form as that in the PCC method. And when p = 1, PCC equals a scaled version of EVC. This indicates that our GSPC method is compatible to both the PCC and EVC methods under different p selection and could be effective with only topological information.
The GPSC method under homogenous nodes is used to calculate the node centrality in the above two networks, which is then compared to the PCC and EVC methods. The parameter r_{1} stands for the correlation coefficient, if it is more closer to 1, then the variables are more correlated; and r_{2} for testing the hypothesis of no correlation, the smaller the better.
Table 4 shows the correlation of our GSPC with the PCC and EVC in the Zachary club network. Since we chose p = 1 in our GSPC method, the result equals to that of the EVC method and is highly correlated to that of the PCC method. We also find that the top five nodes are the same, 1, 2, 3, 33, 34.
Similar to Table 4, Table 5 shows that our method is highly correlated with the PCC and EVC methods in the KrackHighTec network. We then find that the top five nodes are the same too, 2, 3, 15, 18, 20. However, the most important node 7 is not spotted, which suggests that the node attributes do affect the node centrality and are dispensable for the node centrality ranking.
This subsection is mainly divided into two parts. First, we analyze the modulation of parameters of our GSPC model. Second, we analyze the stability of the topology and node signals.
In the first place, we vary the value of η to test the stability of our method satisfying the two rules aforementioned in Section 2. The test group includes [0.05, 0.45, 0.45, 0.05], [0.1, 0.4, 0.4, 0.1], and [0.2, 0.3, 0.3, 0.2] in the KrackHighTec network. It turns out that as long asη for the important attributes are larger than the others, we can still maintain the same node centrality ranking order and correctly find the influential nodes.
We test the effect of p on the node centrality ranking. The results in Table 6 and Table 7 show that when p varies from 1 to 3, both in the Zachary club network and the KrackHighTec network, the influential nodes are picked out in accordance to the ground truth.
Then, we analyze the stability of topology and node signals. For a null model, we randomly shuffle the edges while conserving the degree distribution. We then shuffle the node signals. The results are shown in Tables 6 and 7, respectively.
For the Zachary club network, the most influential nodes always change in both conditions. As in both cases, the node signals in one community are of different signs. The sign of the node signal greatly affects the node centrality ranking.
For the KrackHighTec network, we find that the node attributes play a more important role in node centrality ranking. When we shuffle the edges randomly, we still identify node 7 to be the most influential, and when we shuffle the attributes randomly, the node with the attributes of original node 7 is found to be the most influential.
In other words, it indicates that both network topology and node heterogeneity play an important role in influential node identification. In networks with a divided community, like the Zachary club network, the sign attributes are essential because they chaos the network topology so that neighbor nodes may be opponents instead of supporters. In networks with homogeneous topology, like the KrackHighTec network, the node heterogeneity becomes much more important in influential node identification.
As the previous works^{[12, 15]} have pointed out, the influential nodes are verified by their efficiency in dynamics like spreading. That is, the nodes are taken as sources and their corresponding spreading efficiency is used to measure their influential importance. We apply the SI model^{[31, 32]} to test the spreading efficiency.
In the SI model, the nodes are classified into two states, infected or suspectable. At the beginning, some nodes are chosen as the source and in the infected state while all others are in the susceptible state. The source nodes will spread to all her neighbors. In every time step and for every link connected to the infected nodes, the suspectable nodes become infected at a fixed probability λ .
We use two datasets in this section. The first one is a synthetic network developed by a BA model with gradual aging. The BA model^{[33]} captures the essential scalefree property of real networks by growth and preferential attachment. Since nodes have a finite lifetime or a finite capacity, ^{[34]} the probability a new node connecting to an old one is not only proportional to its degree but also depends on its age, decaying as τ ^{− γ }, γ < 1. Thus, with age decaying considered, the node signal is calculated as τ ^{− γ }, γ = 0.5.
We then apply a real social network, a Facebook like forum network^{[35]} which originates from an online community for students at the University of California, Irvine. The dataset includes 899 users and the numbers of messages posted by the users. Then, the number of posted messages is referred to as the node signal.
By using the methods aforementioned and our GSPC method, we obtain the rank of nodes. Then several influential nodes are chosen as the source nodes for the SI model.
Figure 5 shows that in both cases, nodes spread faster where the sources are the most influential nodes chosen by our GSPC method compared to the other methods. As shown in Fig. 5(a), the GSPC method outperforms the other methods because it adds node attributes into consideration. The closeness and PCC methods gain a better performance than the other three. Since the BA model based networks have small cluster coefficients, the influential nodes obtained by the closeness method are scattered. The PCC methods identify influential nodes at a global view. The EVC method performs the worst, as the influential nodes under the EVC method gather in one restricted region of the network topology.
As shown in Fig. 5(b), the GSPC method performs well in spreading. It is applicable in real social networks and indicates that the top five influential nodes are identified effectively. The EVC method results in bad performance, as the influential nodes gather in one restricted region of the network topology, which is the same as that in Fig. 5(a). Due to the positive correlation between the number of posted messages and the network topology in the Facebook like forum network, the influential nodes identified by PCC, betweenness, closeness, and cluster centrality are close to those by the GSPC method.
In summary, the GSPC method has a positive role in spreading. It gains the insight of balance between the network topology and the node heterogeneity. Thus, it is very useful in network dynamics where both the network topology and the node heterogeneity matter.
Identifying influential nodes for complex networks is an important task. A graph signal processing based centrality method is proposed in this paper, which combines the network topology and the node attributes together to distinguish the influential nodes. First a modulated 2norm model is presented to obtain an overall evaluation of the node attributes. Then, the graph Fourier transform is utilized to select principal eigenvectors. Together, the node centrality is calculated with both the network topology and the node attributes. The results show that the GSPC method is effective in influential node identification and it is compatible to the previous eigenvector based centrality methods, the PCC and EVC methods. In addition, we verify the stability of the GSPC method by varying parameters and null models. Also, the spreading performance is upgraded.
Our GSPC method takes node heterogeneity into account and extends the scope of the node centrality calculation. This paper shows that our method can identify influential nodes in signed graphs (the Zachary club network). There are some possible applications. For example, many real complex networks can be regarded as weighted networks. Ranking node centrality in weighted networks attracts lots of interest. By converting the weights on edges into the graph signal or use the weight matrix other than the adjacency matrix for the graph Fourier transform, our method is easy to transplant to weighted networks. For many real networks, nodes with constraints (for example, limited online time in social networks, node with limited capacity in power grid) have a strong impact on the node behaviors. Our GSPC method could explore these constraints and calculate the node centrality combining the node properties with the network topology.
1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 
