Home

Welcome to the NeuroGPU wiki!

Background:

The general idea of NeuroGPU is to take a simulation of detailed biophysical model of a neuron which is simulated using the NEURON simulation environment and port it to CUDA. One of the more common uses for this kinds of models is to simulate a neuron that we recorded from a living brain, in order to fit the model to the data recorded from the real neuron we need to tune the parameters of the model till they mimic the real data. This process is done using optimization algorithm and require many model simulations with different parameters. NeuroGPU takes a model and creates many instances of it with different parameters, it reads the parameters from the file ./Data/AllParams.csv (the first line indicates the number of models to be simulated)

General comments:

Translating a model from NEURON to CUDA - this is done via python and the code discussed here is after we already translated a model. I am using a model from the Blue Brain Project portal as they would probably be one of the institutes that would profit from accelerating these kinds of models.

Data folder - there are several necessary files but I think the only one that is relevant would be AllParams.csv. When using one gpu 1024 models would utilize all the SM since a lower number of models takes similar time to compute.

Simulating neurons - the general idea is to divide the neuron to several compartments/segments (there is a slight difference between them try to consider them as the same for simplicity) and treat each piece of the neuron membrane as an electrical circuit (compartment). Than we can represent each of those circuits as an differential equation that can be represented as a quasi-tridiagonal matrix - quasi in the sense that some of the elements are off the diagonals when there is a branching in the neuron tree. When we have the matrix at each time step of the simulation we update the matrix and solve it to get the voltages of each segment in the neuron and continue to the next time step.

Flow of NeuroGPU:

In main (kernel.cu) we call RunByModelP (Main.cu) which reads all the data to the host calls steFork2Main (stands for Staggered euler which was the method to solve the tridiagonal matrix now its crank nicholson) steFork2Main (CudaStuff.cu) is in fact the main. It first initiates all the framework data. We divided those data structures to the following: constants - variables that are not changing and mostly hold the data we need in order to solve the tridiagonal matrix in parallel. Constants are initiated at: initFrameWork (cudaStuff.cu) data structure which are often changed are stored in shared memory at the second half of callKernel local memory (registers) the parameters are stored in local memory this probably cause register spilling but i am not sure what else to do as those are necessary. After initiating everything we call the device code at NeuroGPUKernel the whole simulation runs here: it first arranges the shared memory then the main loop that iterate over all the time steps at line #379 at CudaStuff.cu
at each time step we update the matrix by calling all the mechanisms the are being used in this specific comaprtment. We can look at them as the different elements we have in the specific electrical circuit. then we parallel solve the matrix to get the Volts at the current step in the functions beforeLU and bksub

Important comments:

When calling the global kernel the x diemnsion of the block is set to 32 no matter how big the neuron is (how many compartments it has) if the neuron had only 32 compartments each thread in the block is responsible on updating the specific element in the tridiagonal matrix for the segments it is responsible of. Usually neurons have much more than 32 compartments/segments but i make sure they will be a multiplication of 32 and then we use instruction level parallelism to simulate multiple compartments at the same thread.

Code Versions:

OneKernel:

In this version we have only one global kernel that calls several device kernels. The main changes in this code were made that I moved ModelStates to global memory from registers, and using grid stride loops.

Splitted:

In this version we have divided the code to several global kernels which brings some overhead of copying memory, but should be better in hiding latency of memory transfers.

Build & Run:

Each of the code folder go into the src folder type:

make

this will create a bin folder, to run cd to bin and type:

./neuroGPU

Testing the output

In the Code folder there is a python file name test_output.py copy it to the data folder you are testing and run

Python test_output.py

pass results are good

if you get pass then the output is good! otherwise there might be a problem:

error:: results are out of physiological bounds

In that case please shoot me an email: roy.benshalom@ucsf.edu

Profiling & Performance issues

There are two profiling reports one for each version of the code, I think the main performance issue is the low utilization of the GPU oneKernel version we get ~ 10%:

And in the splittedversion we get 3.6% allthough runtime is very similiar: Splitted version

Output

After running in the terminal window output should look like this:

Asking 8736 bytes, 0+8736*1

kernel not ran yet

kernel ran before memcpyasync currkernel run is 1024

done copying*&&&&&&

done synch0

it took 20902.156250 ms

printing ../Data/VHotP.dat size is 3244032

Provide feedback

Saved searches

Use saved searches to filter your results more quickly