The Power of Being Single: Accelerating web scale graph on a ...

The Power of Being Single: Accelerating web scale graph on a ...

Scale-Free Graph Processing on a NUMA Machine Tanuj Kr Aasawat*, Tahsin Reza, Matei Ripeanu Networked Systems Lab The University of British Columbia *Now at RIKEN AIP, Tokyo 1 Graphs are Everywhere 3.5B pages 128B hyperlinks 0.7B users 137B friendships 100B neurons 700T connections PetaBytes of Data 2 Challenges in Graph Processing Highly irregular, data-dependent memory access pattern Neighbors scattered in memory Leads to poor data locality Low compute-to-memory access ratio Memory bound Large memory footprint Real world graphs: current Facebook (137 Billion edges), WDC Web Graph (128 Billion edges) Needs at least 2 TB of memory Needs entire graph in memory for high-performance Hard to obtain balanced partitions Hard to obtain balanced partitions to achieve better load balance and overall performance Power-law degree distribution: few vertices with high degree; most vertices with low degree 3 Hardware Platform Large shared-nothing cluster

Intel Xeon, 4 sockets, 1.5 TB Memory Shared-memory SMP machine Shared-memory NUMA machine Throughput Latency 52% 40 Sequential vs Random access: 6x to 9x Can a distributed graph processing approach provide better performance on NUMA? Requires: Explicit graph partitioning Explicit communication Easy to program Doesnt require: Partitioning & communication Totem Ligra Usually programmed as SMP Suffers communication penalty Same as distributed memory Polymer 4 Hypothesis: A distributed-memory like middleware provides better performance on NUMA Machines Explicit graph partitioning and inter-partition communication Advantages Controlled data placement Disadvantages Memory overhead Better locality Communication Trade-offs

Communication overhead 5 Bulk Synchronous Parallel (BSP) Processing Model Preprocessing Step Processing Units Graph Partition Superstepi Computation Processing Step Computation Phase Communication Phase Synchronization Phase Communication Barrier Synchronization Superstepi+1 Sequence of supersteps Computation 6 Benefits of BSP in Graph Processing on NUMA Benefit 1: Explicit partitioning Easy design/experimentation with partitioning strategies Better load balance and overall performance We explore two popular strategies and introduce a new one Benefit 2: Choice of Communication Designs Explicit communication embrace distributed systems design Communication Trade-off embrace shared-memory nature of NUMA 7 Benefits of BSP in Graph Processing on NUMA Benefit 1: Explicit partitioning

Easy design/experimentation of partition strategy Better load balance and overall performance Benefit 2 Comparison We explore Benefit 1 two popular strategies and introduce a new one (Communica against (Explicit tion Related Partitioning) Benefit 2: Choice of Communication DesignsWork Designs) Conclusion Explicit communication embrace distributed systems design Communication Trade-off embrace shared-memory nature of NUMA 8 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Benefit 1: Explicit Partitioning Makes designing and experimenting with partitioning strategies easier Popular partitioning strategies Random partitioning Googles Pregel, GraphLab Sorted/Degree-aware partitioning Totem and others New Hybrid partitioning strategy Hybrid = Partition randomly + Sort each partition by degree Success criteria Better load balance Overall performance improvement

9 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: PageRank Computation Time (sec) 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Random Partitioning Load imbalance 3% PID-0 PID-1 PID-2 PID-3 Workload: RMAT-31 (64 Billion undirected edges 512 GB) 10 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work

Conclusion Impact of Partitioning: PageRank Sorted Strategy Computation Time (sec) 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Load imbalance 46% PID-0 PID-1 PID-2 PID-3 Workload: RMAT-31 (64 Billion undirected edges 512 GB) 11 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: PageRank Hybrid Strategy Computation Time (sec) 45000

40000 35000 30000 25000 20000 15000 10000 5000 0 Load imbalance 5% PID-0 PID-1 PID-2 PID-3 Workload: RMAT-31 (64 Billion undirected edges 512 GB) 12 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Impact of Partitioning: PageRank 45000 Conclusion Performance gain: 2x and 1.18x Computation Time (sec) 40000 35000 30000 25000 20000 15000

10000 5000 0 PID-0 PID-1 PID-2 PID-3 Workload: RMAT-31 (64 Billion undirected edges 512 GB) 13 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Impact of Partitioning: BFS-DO Computation Time (sec) 3,000.0 Performance gain: 1.55x and 5.3x 2,500.0 2,000.0 1,500.0 1,000.0 500.0 0.0 PID-0 PID-1 PID-2 PID-3 Workload: RMAT-31 (64 Billion undirected edges 512 GB)

14 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Benefit 1: Summary Better load balance (and balanced partitions) does not mean better performance (Random vs Sorted strategy) Hybrid strategy improves both load balance as well as performance Our infrastructure enables easy implementation and experimentation with different partitioning strategies 15 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Benefit 2: Communication Designs Explicit communication embrace distributed systems design NUMA 2-Box: Allocate message box on both source and destination Communication Trade-off embrace shared-memory nature of NUMA NUMA 1-Box: Allocate message box on one partition, and send pointer to the box on the other partition NUMA 0-Box: Get rid of entire communication infrastructure, since NUMA is a shared-memory system 16 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work

Conclusion Explicit Communication NUMA 2Computation: kernel Box manipulates local state Assuming NUMA as a sharedCPU PID-0 V 013456 2 1 nothing distributed system E 1 2 3 0 4 24 2 0 5 Maintains two message boxes 1 Zero remote memory accesses 1 3 0 1 2 3 0 NUMA 2-Box Design 0 1 0 1 0 1 S 0 2 2 Message reduction outbox 4 CPU PID-1

3 2 2 0 1 2 3 0 1 3 3 V 31 21 40 21 E 0 S 1 4 0 1 BSP Supersteps Computation 1 Communication Synchronization 0 4 5 2 0 outbox 4 0 2 inbox inbox

2 Updates to remote vertices aggregated locally Comm1: transfer outbox buffer to remote inbox buffer Comm2: merge with local state 17 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion E x ecu ti o n T im e (s ec) Explicit Communication NUMA 2-Box: Evaluation Execution Time/Iteration (sec) PageRank 50 45 40 35 30 25 20 15 10 5 0 0.00 44.34 0.00 2.07x 0.64

25.57 Totem Computation numactl 20.76 NUMA-2B Communication BFS-DO 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1.63x 0.00 0.00 0.63 3.81 2.84 1.71 Totem Computation numactl NUMA-2B Communication Workload: RMAT-31 (64 Billion undirected edges 512 GB)

18 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Communication Trade-Off: NUMA 0Computation: kernel Box manipulates local state NUMA is a distributed sharedmemory system CPU CPU distributed Explicit partitioning 0 1 3 3 V V 013456 0 2 1 3 shared-memory Implicit 3 2 4 2 E E 1 2 3 0 4 24 2 0 5 2 communication 3 1 S S 4 for applications that communicate via selected edges only 1 NoUseful communication infrastructure 0 1 2 3 0

0 0 4 5 1 0 0 1 2 3 1 0 1 1 1 NUMA 0-Box design BSP Supersteps Overlap Computation with Communication Synchronization 2 Computation: kernel updates remote state 19 0 1 Benefit 2 (Communication Designs) Benefit 1 (Explicit Partitioning) Comparison against Related Work Conclusion

NUMA 0-Box: Evaluation PageRank 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.00 0.00 3.81 1.63x 0.63 0.00 1.71 1.89 2.84 Totem numactl NUMA-2B NUMA-0B Computation Communication Execution Time/Iteration (sec) Execution Time (sec) BFS-DO 50 45 40 35 30

25 20 15 10 5 0 0.00 44.34 2.07x 0.00 0.64 25.57 Totem 20.76 numactl Computation 0.00 23.44 NUMA-2B NUMA-0B Communication Workload: RMAT-31 (64 Billion undirected edges 512 GB) 20 Benefit 2 (Communication Designs) Comparison against Related Work E x e c u ti o n T im e / I t e r a ti o n ( s e c ) Benefit 1 (Explicit Partitioning) Conclusion

Evaluation on Real-World Graphs Execution Time (sec) BFS Top Down (BFS-TD) on Twitter 1,800.0 0.00 1,600.0 0.00 1,400.0 1.29x 1,200.0 179.53 1,000.0 0.00 800.0 1514.44 1684.06 600.0 986.56 954.82 400.0 200.0 0.0 Totem numactl NUMA-2B NUMA-0B Computation Communication Twitter (|V| = 51 Million, |E| = 3.9 Billion 16 GB) 1.69x PageRank on clueWeb12 14000 12000 0.00 10000 8000 0.00 0.00 6000 12651.64 4000

2399.50 5061.16 2000 0 8440.94 7687.96 Totem numactl Computation NUMA-2B NUMA-0B Communication clueWeb12 (|V| = 978 Million, |E| = 74 Billion 286 GB) 21 S p e e d u p a g a in s t 1 s o c k e t Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Strong Scaling 3.7x 2.9x 2.7x 2.8x 4.0 3.5 3.0

2.5 2.0 1.5 1.0 0.5 0.0 Workload: RMAT-30 (32B undir. edges 256 GB) RMAT-29 (weighted, 16B undir. edges 192 GB) 22 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Evaluation against Polymer Algorithm Workload PageRank BFS SSSP RMAT30 26.2 8.29 RMAT31 - 21.40 Twitter 1.53

0.75 RMAT30 34.63 4.94 RMAT31 - 12.63 Twitter 13.1 0.94 RMAT29 24.95 9.56 RMAT30 - 21.79 5.3 8.4 Twitter Segmentation fault (core dumped) Time (sec) Polymer NUMA-xB Conclusion Memory (GB) Speedup

3.16x Polymer NUMA-xB Efficiency 1401 330 - 674 2.04x 144 21.8 6.61x 7.01x 1302 322 4.04x - 653 13.9x 93 19.2 4.84x 2.61x 886

232 3.82x - 452 115 41 0.63x *We were able to run up to Scale 32 as well as clueWeb graph, but Polymer couldnt. 4.24x 2.8x 23 23 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Conclusion Distributed-memory like middleware provides Opportunity for designing partitioning schemes on a single-node Better load balance: Improvement of up to 9x Better overall performance: up to 5.3x Opportunity to explore communication strategies Introduced a hybrid design of Distributed and SMP systems Better scalability Up to 3.7x on a 4-Socket machine High Performance Up to 2.25x faster than NUMA-oblivious and 13.9x against

NUMA-aware framework Graph500: SSSP: World Rank 2 (ISC, June, 2018), World Rank 3 (SC, 2017) BFS: Among top 3 single-nodes 24 Thank You 25 Benefit 2 (Communication Designs) Benefit 1 (Explicit Partitioning) Comparison against Related Work Conclusion Graph Partitioning Strategies 1.00017x Random Strategy 10827x Sorted/Degree-aware Strategy 100000000000 100000000000 |V| or |E| (in millions) |V| or |E| (in millions) 10000000000 10000000000 1000000000 1000000000 100000000 10000000 1000000

100000000 PID1 #Vertices: PID2 PID3 #Edges: PID4 100000 PID1 #Vertices: PID2 PID3 PID4 #Edges: Workload: RMAT-31 (64 Billion undirected edges 512 GB) 26 Benefit 2 (Communication Designs) Benefit 1 (Explicit Partitioning) Comparison against Related Work Conclusion Conclusion 13.79x Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu, Scale-Free Graph Processing on a NUMA Machine, IEEE Workshop on Irregular Applications: Architectures and Algorithms (IA3), co-located with SC, November 2018. Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu, How well do CPU, GPU and Hybrid Graph

Processing Frameworks Perform?, IEEE Workshop on High-Performance Big Data, Deep Learning, and Cloud Computing (HPBDC), co-located with IPDPS, Vancouver, May 2018 Graph500 Submission: Polymer SJTU, China > UTAustin BFS: Among top 3 single-nodes Submission up to RMAT32 128 Billion undirected edges, edge list size: 1 TB) Ligra SSSP CMU World Rank 2 (ISC, June, 2018) World Rank 3 (SC, November, 2017) X-Stream EPFL 2.25x Totem UBC > GraphMat Gunrock Nvgraph UTAustin Intel UCDavis Nvidia 27 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication

Designs) Comparison against Related Work Conclusion High-level view of System Implementation BSP Engine User Inputs Graph Partitioning Random Degree-aware Hybrid Computation pid0 pid1 pid2 pid3 Process graph_kernel(pid) using child threads within a NUMA domain Communication libnuma for numa allocation NUMA 2-Box NUMA 1-Box Barrier Synchronization

Superstepi NUMA 0-Box Continue until the global finish flag is set One parent thread is launched on each NUMA domain to initiate Superstep for the local partition graph, graph_kernel to run, partitioning and comm. strategy Superstepi + 1 Superstepn 28 Challenges in Partitioning Why not specific partitioning algorithms like min-cut? NP complete problem Partitioning algorithm will take more time than benchmark execution. Partitioning is complex and we are looking for simple partitioning strategies. Our infrastructure is so flexible that one can plugin any partitioning scheme you want. We picked one which is simple and works fantastic. These simple techniques make the hypothesis that there are no natural communities/clusters emerging in these graphs. But we are skeptical that there are there are natural clusters in many of these graphs. Some of the real world graphs, are highly skewed, need more supersteps to converge Possible solution: edge-balanced, vertex-cut approach adopted by GraphLab and several other distributed frameworks. Support for vertex-cut partitioning, however, would require additional changes to our core processing engine and a candidate for future work. 29 Variability in Communication Designs Variability comes because you have different memory access pattern - very different between 2- and 0-Box designs NUMA 2-Box: Gain from local accesses, and remote accesses are sequential and they cost very little Copying data in bulk, followed by local random accesses

NUMA 0-Box: All the remote accesses are random == # remote edges. Depending on the algorithm we have different volumes of data. some are updating all the neighbors and some are just subset of the neighbors. This explains why some algorithms are finding better solution with one and some with other one. 30 Testbed Characteristics System CPU Throughput (MB/S) Access Local Remote Read Sequential 2,464 2,069 Random 286 226 Sequential 1,438 1,024 Random 238 188

4x Intel Xeon E7-4870 v2 (IvyBridge) Write #Cores 40% faster 60 6x to 9x faster Memory Latency (ns) LLCache 52% faster 1536 GB DDR3 120 MB Local Remote 119 178 31 M em ory Bandw idth (G Local Random Read Local Seq Read Local Random Write Local Seq Write 4500.0 Remote Random Read Remote Seq Read Remote Random Write Remote Seq Write

4000.0 3500.0 3000.0 2500.0 2000.0 1500.0 1000.0 500.0 0.0 1 2 4 8 16 32 Workload (MB) 64 128 256 512 1024 32 Latest Intel Xeon System CPU 8x Intel Xeon E7-8800 v4 #Cores 192 Memory

24 TB DDR4 LLCache 480 MB 33 Frequency (M illions) Remote vertices vs Remote updates, in BFS-DO 350000000 300000000 #Remote Updates #Remote Vertices 22x difference 250000000 200000000 150000000 100000000 50000000 0 Workload: RMAT-31 (64 Billion undirected edges 512 GB) 34 Applications and Workloads PageRank High computational intensity Stable workload per superstep BFS - Top Down Memory bound Suffers from high write traffic BFS - Direction Optimized (BFS-DO) Requires hand-tuning the switching parameters BFS - Graph500 SSSP

Dataset #Vertices #Edges Size (GB) RMAT28 256 M 8B 64 RMAT29 512 M 16 B 128 RMAT30 1B 32 B 256 RMAT31 2B 64 B 512 RMAT32 4B 128 B 1024

Twitter 51 M 3.9 B 15 clueWeb12 978 M 74 B 286 Requires distance vector SSSP - Graph500 Requires distance vector & SSSP tree 35 Explicit Communication NUMA 2Box Computation: kernel manipulates local state Assuming NUMA as a sharednothing distributed system Maintains two message boxes NUMA 2-Box Design V E Computation CPU PID-1 CPU PID-0 4 5 0 1 3 4 5 6

1 10 20 30 01 40 21 40 21 1 S 0 2 0 3 5 3 2 V 31 21 40 21 E S 1 4 4 0 1 2 0 1 2 3 0 1 3 3 0 2 2

0 outbox BSP Supersteps Computation 1 Communication Synchronization 0 1 2 3 outbox 4 0 2 inbox inbox 2 Updates to remote vertices aggregated locally Communication Local- Random updates Local- Random updates Memcopy v*LS W + (e+e )*LRR (N-1) * (LRR + LSW) * v RStW Comm1: transfer outbox buffer to remote inbox buffer Comm2: merge with local state 36 Communication Trade-Off: NUMA 1Box

Can pass pointer as NUMA is shared-memory system Physically allocate only 1 Box NUMA 1-Box Design Computation: kernel manipulates local state V E S BSP Supersteps Computation 1 Communication Synchronization 0 1 2 3 CPU PID-1 CPU PID-0 4 5 0 1 3 4 5 6 1 10 20 30 01 40 21 40 21 1 1 0 2 0 3 5 3 2

2 V 31 21 40 21 E S 1 4 4 0 1 outbox 0 1 3 3 0 2 0 1 2 3 0 2 outbox Updates to remote vertices aggregated locally Computation Local- Random updates v*LS W + (e+e )*LRR Communication Pointer Copy Local- Random updates (N-1) * (LRR + RS W) * v

Comm: merge with local state 37 Communication Trade-Off: NUMA 0Box NUMA is a distributed shared-memory system distributed Explicit partitioning V 013456 shared-memory Implicit E 1 2 3 0 4 2 communication S No communication infrastructure 0 1 2 3 0 0 0 1 0 CPU CPU 4 5 1 1 40 21 0 2 5 3 4 0

3 2 0 1 2 3 0 1 3 3 V 31 21 40 21 E S 1 NUMA 0-Box design Overlaps computation with communication Overlapped Computation and Communication Local- Random updates Remote- Random updates v*LS W + e*LR R e * RRR Comp: kernel updates remote state 38 Benefit 1 (Explicit Partitioning) Benefit 2 (Communication Designs) Comparison against Related Work Conclusion Frequency (M illions) Remote vertices vs Remote updates, in BFS-DO

350000000 300000000 #Remote Updates #Remote Vertices 22x difference 250000000 200000000 150000000 100000000 50000000 0 Workload: RMAT-31 (64 Billion undirected edges 512 GB) 39 Supersteps required for different algorithms Algorithm Workload Supersteps BFS-TD RMAT30 7 Twitter 15 clueWeb12 133 RMAT30 8 Twitter 9

clueWeb12 125 RMAT30 11 Twitter 33 clueWeb12 129 BFS-DO SSSP 40 Execution Tim e (sec) PageRank 50 2.07x 0.00 1.93x 1.33x 0.00 1.63x 0.00 2.5 Communication 1.5

2.84 1.0 21.08 Computation 23.44 0.00 1.71 1.78 1.89 NU M A -0 B 0.00 0.5 0.0 num actl 20.76 1.87 N U M A -0 B 5 n u m a ctl 25.57 10 N U M A -2 B 15 0.64 N U M A -1 B

44.34 0.63 3.81 N U M A -1 B 0.00 1.09 N U M A -2 B 2.0 30 0 4.0 3.0 35 20 4.5 3.5 40 25 BFS-DO Totem 45 T o te m E x e c u ti o n T im e / It e ra ti o n (s Explicit Partitioning and Implicit Communication

Communication Computation 41 Totem numactl NUMA-2B NUMA-1B NUMA-0B 12.0 PageRank 10.0 Billion TEPS 8.0 6.0 4.0 2.0 0.0 RMAT28 RMAT29 RMAT30 RMAT31 RMAT32 Twitter clueWeb 42 PageRank Workload RMAT28

RMAT29 RMAT30 RMAT31 RMAT32 Twitter clueWeb Totem 1x 1x 1x 1x 1x 1x 1x numactl NUMA-2B NUMA-1B NUMA-0B 1.51x 1.62x 1.54x 1.16x 1.74x 1.88x 1.77x 1.47x 1.67x 1.83x 1.69x 1.55x 1.73x 2.08x 1.93x 1.89x 1.84x 2.05x 1.28x 1.63x 1.56x 1.50x 1.64x 1.69x 1.59x 1.49x 43

Totem numactl NUMA-2B NUMA-1B NUMA-0B Billion TEPS 40.0 BFS-DO 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 RMAT28 RMAT29 RMAT30 RMAT31 Twitter clueWeb 44 BFS-DO Workload RMAT28 RMAT29 RMAT30 RMAT31 Twitter clueWeb Totem

1x 1x 1x 1x 1x 1x numactl NUMA-2B NUMA-1B NUMA-0B 1.44x 1.61x 1.45x 2.25x 1.26x 1.53x 1.34x 2.01x 1.27x 1.34x 1.23x 1.67x 1.34x 1.63x 1.33x 2.02x 1.04x 0.91x 0.86x 1.24x 1.14x 0.55x 0.48x 0.62x 45 Totem numactl NUMA-2B NUMA-1B NUMA-0B 8.0

BFS-TD 7.0 Billion TEPS 6.0 5.0 4.0 3.0 2.0 RMAT28 RMAT29 RMAT30 RMAT31 Twitter clueWeb 46 BFS-TD Workload RMAT28 RMAT29 RMAT30 RMAT31 Twitter clueWeb Totem 1x 1x 1x 1x 1x 1x numactl NUMA-2B NUMA-1B NUMA-0B 1.19x 1.33x 1.33x 1.34x

1.26x 1.35x 1.35x 1.39x 1.15x 1.22x 1.29x 1.32x 1.08x 1.42x 1.38x 1.30x 0.90x 1.29x 1.23x 1.58x 1.04x 0.72x 0.81x 1.10x 47 Totem numactl NUMA-2B NUMA-1B NUMA-0B 2.5 Billion TEPS SSSP 2.0 1.5 1.0 0.5 0.0

RMAT28 RMAT29 RMAT30 RMAT31 Twitter clueWeb 48 SSSP Workload RMAT28 RMAT29 RMAT30 RMAT31 Twitter clueWeb Totem 1x 1x 1x 1x 1x 1x numactl NUMA-2B NUMA-1B NUMA-0B 2.33x 1.22x 1.18x 2.27x 2.08x 1.33x 1.09x 2.12x 1.76x 1.19x 1.02x 1.77x 1.07x 1.10x 0.85x

1.45x 1.35x 1.11x 1.14x 1.73x 1.18x 1.05x 0.88x 1.26x 49 Supersteps required for different algorithms Algorithm Workload Supersteps BFS-TD RMAT30 7 Twitter 15 clueWeb12 133 RMAT30 8 Twitter 9 clueWeb12 125 RMAT30 11

Twitter 33 clueWeb12 129 BFS-DO SSSP 50 Sp e ed u p a g a in s t 1 s o cket Strong Scaling w.r.t Resources 3.38x 3.7x 3.43x 3.15x 2.9x 4.0 3.5 3.0 2.5 2.02x 2.2x 2.33x 2.14x 2.67x2.73x 2.39x2.51x 2.77x 2.72x 2.07x 1.74x 1.31x

1.74x 1.31x 2.0 1.5 1.0 0.5 0.0 51 Degree distribution of Twitter (|V| = 52,579,683; |E| = 3,926,527,642) 10000000 84.4% of vertices have <50 degree 1000000 # of vertices 100000 10000 1000 100 10 1 1 10 100 1000 10000 100000 1000000 10000000 Degree Max. degree: 3.69 M

52 Degree distribution of clueWeb12 (|V| = 978,408,098; |E| = 74,744,358,622) 100000000 65% of vertices have <50 degree 10000000 # of vertices 1000000 100000 10000 1000 100 10 1 1 10 Max. degree: 75.6 M 100 1000 10000 Degree 100000 1000000 10000000 100000000 53 Degree distribution of RMAT30 (|V| = 1,073,741,823; |E| = 34,359,738,336) 100000000 86.9% of vertices have <50 degree 10000000 1000000 # of vertices 100000

10000 1000 100 10 1 1 10 Max. degree: 14.51 M 100 1000 10000 100000 Degree 1000000 10000000 100000000 54 Workload Remote Edges (%) Local Edges (%) Local Vertices (%) Remote Vertices (%) RMAT28 74.93 37.88 RMAT29 74.89 36.93 RMAT30

75.05 36.05 RMAT31 75.13 35.22 55 Performance Model for 3 designs: Computation NUMA 2-Box Communication Local-Random updates Local-Random updates Memcopy V * WriteSeqLocal + (E+E) * ReadRandLocal (N-1) * V * (WriteSeqLocal + ReadRandLocal) Memcpy() Computation Local-Random updates NUMA 1-Box NUMA 0-Box Use Case - PageRank C.M. Emp. 1x 1x

Communication Pointer Copy V * WriteSeqLocal + (E+E) * ReadRandLocal Local-Random updates (N-1) * V * (WriteSeqRemote + ReadRandLocal) 1.04x 1.06x Overlapped Computation and Communication Local-Random updates Remote-Random updates V * WriteSeqLocal + E * ReadRandLocal E * ReadRandRemote 1.27x 1.23x 56 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 5 3 8 7 1 6 2 9

4 57 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 Frontier = 5 Next Frontier = 8,1,3 5 1 1 1 8 7 3 1 6 2 9 4 58 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 Frontier = 8,1,3 Next Frontier = 7,2,9,6

5 1 1 1 8 7 2 3 1 2 6 2 9 2 2 4 59 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 Frontier = 7,2,9,6 Next Frontier = 4 5 1 1 1 8 7 2 3 1

2 6 2 9 2 2 4 3 60 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 Frontier = 4 Next Frontier = null 5 1 1 1 8 7 2 3 1 2 6 2 9 2

2 4 3 61 Applications BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 5 3 8 7 1 6 2 9 4 62 Applications BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 Frontier = 5 Next Frontier = 8,1,3 5 1 1 1

8 7 3 1 6 2 9 4 63 Applications BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 Frontier = 8,1,3 Next Frontier = 7,2,9,6 5 1 1 1 8 7 2 3 1 2 6 2 9 2

2 4 Unvisited 64 Applications BFS-Top Down BFS-Direction Optimized(BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 Frontier = 7,2,9,6 4 Next Frontier = null 5 1 1 1 8 7 2 3 1 2 6 2 9 2 2 4 3 65 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO)

Graph500 BFS PageRank SSSP Graph500 SSSP -1 5 5 8 7 8 3 5 1 5 6 3 1 2 9 1 4 7 66 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 9 5 3

3 2 8 7 5 1 6 4 2 9 3 5 4 8 67 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 5 5 8 2 3 3 1 9 1

3 7 2 8 7 2 9 2 6 9 6 8 4 4 68 Applications BFS-Top Down BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0 5 5 5 8 2 3

3 1 3 9 1 3 7 7 2 6 7 9 8 5 2 8 2 6 3 9 2 8 4 4 7 69 Applications BFS-Top Down

BFS-Direction Optimized (BFS-DO) Graph500 BFS PageRank SSSP Graph500 SSSP 0, -1 5, 5 8 5 5 2 3 3 1 3, 5 9 1 3 7 7 8, 8 2 8 2 7, 9 2 2, 5 6 3, 3 9 5, 1

4 4 7, 6 70 Degree distribution of Twitter (|V| = 52,579,683; |E| = 3,926,527,642) 10000000 84.4% of vertices have <50 degree 1000000 # of vertices 100000 10000 1000 0.1% of vertices have 25.5% of edges 100 10 1 1 10 100 1000 10000 100000 1000000 10000000 Degree Max. degree: 3.69 M 71

Degree distribution of clueWeb12 (|V| = 978,408,098; |E| = 74,744,358,622) 100000000 65% of vertices have <50 degree 10000000 # of vertices 1000000 100000 10000 1000 0.1% of vertices have 15.4% of edges 100 10 1 1 10 Max. degree: 75.6 M 100 1000 10000 Degree 100000 1000000 10000000 100000000 72 Degree distribution of RMAT30 (|V| = 1,073,741,823; |E| = 34,359,738,336) 100000000 86.9% of vertices have <50 degree 10000000 1000000 # of vertices

100000 10000 1000 0.1% of vertices have 48.7% of edges 100 10 1 1 10 Max. degree: 14.51 M 100 1000 10000 100000 Degree 1000000 10000000 100000000 73 Totem numactl NUMA-2B NUMA-1B NUMA-0B Billion TEPS 40.0 35.0 Cache Hit Rate: 47.1% Cache Hit Rate:

62.0% 30.0 25.0 20.0 15.0 Cache Hit Rate: 30.7% Cache Hit Rate: 23.4% 10.0 5.0 0.0 RMAT28 RMAT29 RMAT30 RMAT31 Twitter clueWeb 74

Recently Viewed Presentations

  • WAWASAN NUSANTARA - Gunadarma

    WAWASAN NUSANTARA - Gunadarma

    Pada waktu itu berkembang paham perdagangan bebas (Merchantilism). Menurut mereka ukuran keberhasilan ekonomi suatu negara adalah seberapa besar surplus ekonominya, terutama diukur dengan seberapa banyak emas yang dimiliki oleh negara itu. LANJUTAN e.
  • The Disney Case

    The Disney Case

    Van Gorkom? Van Gorkom unilaterally negotiated the merger with Pritzker The Disney Case Professor D. Gordon Smith University of Wisconsin Law School The Players "Triad of Fiduciary Duty" Traditional "Bad Faith" in Corporate Law Illegal Fraudulent Ultra vires The Good...
  • Revising Rivers - vle.wildern.hants.sch.uk

    Revising Rivers - vle.wildern.hants.sch.uk

    Revising Rivers Section A of the Key Themes Exam Case Studies include - The River Severn Boscastle Floods 2004/2007 Mozambique Floods 2000 What does the unit include?
  • Learning Analytics Update Paul Bailey and Michael Webb

    Learning Analytics Update Paul Bailey and Michael Webb

    Paul Bailey and Michael Webb. Learning Analytics Update. Go to 'View' menu > 'Header and Footer…' to edit the footers on this slide (click 'Apply' to change only the currently selected slide, or 'Apply to All' to change the footers...
  • The 3.8 Paragraph

    The 3.8 Paragraph

    Example: The little town of Romney was an important hub for the east-west railroad lines. The area changed hands 56 times during the course of the war because each side wanted control of the B&O Railroad. Sentence 8: The conclusion...
  • Candidate Pack Fisheries Licensing Officer Welcome About the

    Candidate Pack Fisheries Licensing Officer Welcome About the

    Candidate Pack Fisheries Licensing Officer
  • Overlapping Communities for Identifying Misbehavior in Network Communications

    Overlapping Communities for Identifying Misbehavior in Network Communications

    A modularity-based community detection algorithm is not useful. Our alternative definitionis being . member of multiple communities. Algorithms which find overlapping communities can be used for intrusion detection. Non-overlapping communities can be enhanced with . auxiliary communities. for intrusion detection....
  • Moving Toward Catholicism

    Moving Toward Catholicism

    EMERGING VS EMERGENT "Emerging is the wider, informal, global, ecclesial (church centered) focus of the movement, while Emergent is an official organization in the U.S. and the U.K. Emergent Village, the organization, is directed by Tony Jones, a Ph.D. student...