图数据管理-课程笔记2

2021-02-04

字数统计: 804字 | 阅读时长≈ 5分

Network Properties and Random Graph Model

I. Key Network Properties

Problem: How to measure a network?

Answer: Using Network Properties.

1. Degree Distribution P(k)

P(k) is the probability that a random chosen node has degree k.

$N_k$ is number of nodes with degree k. Thus $P(k)=\frac{N_k}{N}$ .

2. Path p, Average Distance and Diameter

A path p is a sequence of nodes in which each node is linked to the next one.

Shortest Path Distance: the number of edges along the shortest path connecting two nodes.
Diameter: the maximum distance between any pair of nodes in a graph. That is, $d = MAX_{i\neq j}(h_{ij})$
Average path length for a connected graph or a SCC graph is defined as: $\overline{h}=\frac{\sum_{i\neq j}h_{ij}}{n\times (n-1)}$ . In practice, we compute the $\overline{h}$ only over the connected pairs of nodes.

3. Cluster Coefficient c

A vertex’scluster coefficient c measures how a vertex’s neighbors are connected to each other. Assume $E_i$ as the number of edges between neighbors of vertex $v_i$ , and $k_i$ as the vertex degree.

The cluster coefficient can be calculated as $C_{local}(v_i)=\frac{2\times E_i}{k_i \times(k_i-1)}$

Globally, the Average clustering coefficient is $C_{avg}(G)=\frac{\sum_{i=1}^nC_{local}(v_i)}{n}$ .

cluster

4. Connectivity

To show the connectivity of graph, one can calculate the size of the largest connected component in graph, BTW, largest component is also known as giant component.

connect

II. Measure Real-world Networks

In this part, we use MSN network as an exaple to show some properties of real-world networks.

1. Degree Distribution

r-pk

Power law distribution: $P(k)\propto k^{-\gamma}$ , where $\gamma$ is a parameter whose value is typically between 2 and 3. The graph degree distribution is heavily skewed.

2. Clustering Coefficient

r-cluster

Average Clustering Coefficient of Real Graph can be really big(0.1140) compared to the random graph.

3. Connected Components

Nearly all of the vertices are in one largest(giant) connected component.

4. Diameter

r-diameter

Average path length is 6.6. Besides, 90% of the nodes can be reached in <8 hops.

5. Small World Effect(Six Degrees of Separation), 1967

A small-world network is a type of mathematical graph where most nodes are not neighbors of one another, but most nodes can be reached from every other by a small number of hops or steps.

In mathematical format, assuming L is the distance between two randomly chosen nodes, and N is the number of nodes in a network, then we have $L\propto \log N$ .

III. Graph Generation Model

There are four kinds of Graph Generation Model.

1. Random Graph Model(Erdos-Renyi Graph)

(1) Generation:

$G_{np}$ : undirected graph on n nodes where each edge (u,v) appears i.i.d with probability p.
$G_{nm}$ : undirected graph with n nodes, and m edges picked uniformly at random.

(2) Degree Distribution P(k)

Binomial Distribution: $P(k)=\binom{n-1}{k}p^k(1-p)^{n-1-k}$ .
$\overline{k}=p(n-1), \sigma^2=p(1-p)(n-1)$ .

er-pk

(3) Clustering Coefficient

$E[e_i] = p \frac{k_i(k_i-1)}{2}$

Thus, $E[C_i]=\frac{2e_i}{k_i(k_i-1)}=\frac{pk_i(k_i-1)}{k_i(k_i-1)}=p$

And $p=\frac{\overline{k}}{n-1}$ , so $C$ decreases with the graph size n.

Note: the $C=8\times 10^{-8}$ if $N=180M$ .

(4) Path Length

Randomly pick a node $v_i$ , and it will have:

$k$ points whose distance is 1
$k^2$ points whose distance is 2
$k^3$ points whose distance is 3
$k^{d_{max}}$ points whose distance is $d_{max}$

At the same time, the number of vertices is $N$ . It means that: $\sum_{i=1}^{d_{max}}\leq N$ .

So $d_{max}=O(\frac{ln(N)}{ln(k)})$ .

In E-R Random Graph, d_max increase slowly with N.

er-distance

(5) Giant Component

er-gcc

When p(n-1) = 1, the Giant Connected Component emerges.

When k=ln N, the fully connected graph emerges.

(6) Problems

Degree distribution differs from that of real-world graph
Giant component in most real networks does NOT emerge through a phase transition
No local structure – Clustering Coefficient is too low.

Conclusion: Real-world network is not random!

2. Small-world Model[Watts-Strogatz ’98]

Problem: E-R random graph’s clustering is low! Need: High cluster and low diameter.

Start with a low-dimensional regular lattice

Has high clustering coefficient

Rewire: Introduce randomness

Add / remove edges to create shortcuts to join remote parts of the lattice.
For each edge, with probability p move the other endpoint to a random node.

The more probability of rewiring p, the smaller clustering coefficient will be.

3. Barabasi-Albert(BA) Model

Problem: How to model the power-law distribution of node degree?

Solution: Introduce Growth and rich-get-richer.

(1) Assumptions

Growth: the graphs grows continuously
Preferential attachment(i.e., rich-get-richer): nodes with larger connectivity tend to receive new edges

(2) Model Definition

Start with a small graph of $m_0$ vertices generated randomly.
At each step, add a new vertex with $m$ edges connecting to $m$ distinct vertices already present in the graph. For each connection, the selection of the existing vertex is governed by the following equation. $Pr[v\ is\ attached]=\frac{d(v)}{\sum_{w}d(w)}$ .