GitHub - jian-ru/Project1-CUDA-Flocking: An introduction to CUDA programming by way of a Boids Flocking simulation

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking

Jian Ru
Tested on: Windows 10, i7-4850 @ 2.30GHz 16GB, GT 750M 2GB (Personal)

Results

Parameters
- Number of particles: 40,000
- Blocks: 40, 1, 1
- Threads: 128, 1, 1
- Rule distances: 5.0, 3.0, 5.0
- Rule scales: 0.01, 0.1, 0.1
- Scene scale: 100.0
- Delta time: 0.2

Analysis

Simulation Time vs. Number of Particles
- For the brute force version, the simulation time grows polynomially as particle count increases. This is expected because even though the complexity of each thread is O(n) but there are too many particles and hence too many threads. So it is impossible to parallize all the threads at once. Therefore, the time complexity should still grow in a polynomial fashion but less sensitive than sequential implementation.
- For the scattered and coherent grid versions, they still demonstrates a little polynomial growth but the speed is much slower and their growth seems almost linear. This is expected because each particle has much fewer neighbours to examine in each step. Statistically, the number of neighbours grows linearly as particle count increases. But the number of threads also increase at the same time so the time complexity of the implementation should be a liitle bit more expensive than O(n).
Simulation Time vs. Block Size
- The relationship between simulation time and block size is somewhat random but expected. Since it guaranteed that GPU executes each block on a single SM, putting more threads that access the same memory region with similar access pattern should increase performance due to the increased cache hit-rate. But putting too many threads in a single block may hinder performance if a SM cannot execute all the threads in a block at once.
Coherent Grid vs. Scattered Grid
- From my experimentation, coherent grid performances better than scattered grid. This is expected because even though reordering position and velocity arrays has cost, in this case, the gain from increased cache hit-rate outweight the cost of copying and additional kernel calls. Since adjacent threads tend to have shared neighouring cells, they tend to access the same memory regions when they execute. Even for just one thread, it also enjoys cache hit-rate increase because after sorting, the data of particles in the same cell are stored closely in one consecutive memory region.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cmake		cmake
external		external
images		images
shaders		shaders
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Results

Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Results

Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages