PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular VideosYufei ZhangJeff Kephartet al.2024CVPR 2024
QAttn: Efficient GPU Kernels for mixed-precision vision transformersPiotr Sebastian KluskaAdrián Castellóet al.2024CVPR 2024
Machine Unlearning in Computer Vision: Foundations and ApplicationsSijia LiuYang Liuet al.2024CVPR 2024
Resource- Efficient Transformer Pruning for Finetuning of Large ModelsFatih IlhanGong Suet al.2024CVPR 2024
Grounding Everything: Emerging Localization Properties in Vision-Language TransformersWalid BousselhamFelix Petersenet al.2024CVPR 2024
Open3DIS Open-Vocabulary 3D Instance Segmentation with 2D Mask GuidancePhuc NguyenTuan Duc Ngoet al.2024CVPR 2024
What When and Where? Self-Supervised Spatio Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsBrian ChenNina Shvetsovaet al.2024CVPR 2024
Overload: Latency Attacks on Object Detection for Edge DevicesErh-Chung ChenPin-Yu Chenet al.2024CVPR 2024
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World KnowledgeAndong WangBo Wuet al.2024CVPR 2024