Thanks for good work!Could you explain why adjacent layer feature output aggregation enhances video generation results? Since CogVideo's training data is not publicly available, I wonder if this benefit might come from differences in training data distribution between RepVideo and CogVideo, making direct comparisons difficult. Are there any experimental results for RepVideo without the aggregation operation?