**What API design would you like to have changed or added to the library? Why?** Is it possible to allow setting every tensor attribute of scheduler to cuda device? In https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lcm.py It looks like that attributes like `scheduler.alphas_cumprod` are tensors on cpu, but the scheduler.set_timesteps() allows setting `scheduler.timesteps` to gpu/cuda device. Isn't this causing device mismatch when indexing scheduler.alphas_cumprod with scheduler.timesteps? Below is the code snippet that the pipline is indexing a cpu tensor(alphas_cumprod) with a gpu tensor(timestep)  I simply added following lines to print the timestep and self.alphas_cumprod type and device at the begining of the `scheduler.step()` ``` print("Printing scheduler.step() timestep") print(type(timestep)) print(isinstance(timestep, torch.Tensor)) print(timestep.device) print("Printing scheduler.step() self.alphas_cumprod") print(type(self.alphas_cumprod)) print(isinstance(self.alphas_cumprod, torch.Tensor)) print(self.alphas_cumprod.device) ``` Output when running text-to-image: ``` Printing scheduler.step() timestep <class 'torch.Tensor'> True cuda:0 Printing scheduler.step() self.alphas_cumprod <class 'torch.Tensor'> True cpu ``` **What use case would this enable or better enable? Can you give us a code example?** We are using a modified LCMScheduler (99% same as the original LCMScheduler) for video generations, it's generating frames repeatedly in a loop. for most of the time, this step doesn't cause performance issue. But we did see intermittent high cpu usage and latency for `alpha_prod_t = self.alphas_cumprod[timestep]`. And from torch.profiler and tracing output, it. shows high latency for this specific step. We are wondering if this is the performance bottleneck. 