I am wondering what is the best way to do the following in Cuda: Imagine* you have a long array and want the sum of all elements to be below 1. And if the sum is above 1 you divide every element by 2 and calculate the sum again. Dividing by two and calculating the sum are done on the gpu. My question is now: what is the best way to check whether the sum is below 1 or not on the cpu side? I could do cudaMemcpy within every iteration, but I also read (and have seen) that it is better to do as few transfers between the two memory as possible. I have found I am wondering what is the best way to do the f