請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99384完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 傅立成 | zh_TW |
| dc.contributor.advisor | Li-Chen Fu | en |
| dc.contributor.author | 李哲維 | zh_TW |
| dc.contributor.author | Che-Wei Lee | en |
| dc.date.accessioned | 2025-09-10T16:07:21Z | - |
| dc.date.available | 2025-09-11 | - |
| dc.date.copyright | 2025-09-10 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-04 | - |
| dc.identifier.citation | X. Tian et al., ‘Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving’. p. arXiv:2304.14365, April 01, 2023 2023.
A.-Q. Cao and R. de Charette, ‘MonoScene: Monocular 3D Semantic Scene Completion’. p. arXiv:2112.00726, December 01, 2021 2021. M. Pan et al., ‘UniOcc: Unifying Vision-Centric 3D Occupancy Prediction with Geometric and Semantic Rendering’. p. arXiv:2306.09117, June 01, 2023 2023. Y. Zhang, Z. Zhu, and D. Du, ‘OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction’. p. arXiv:2304.05316, April 01, 2023 2023. Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, ‘SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving’. p. arXiv:2303.09551, March 01, 2023 2023. H. Jiang et al., ‘Symphonize 3D Semantic Scene Completion with Contextual Instance Queries’. p. arXiv:2306.15670, June 01, 2023 2023. Y. Li et al., ‘VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion’. p. arXiv:2302.12251, February 01, 2023 2023. S. Wang et al., ‘Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation’. p. arXiv:2404.11958, April 01, 2024 2024. Z. Yu et al., ‘Context and Geometry Aware Voxel Transformer for Semantic Scene Completion’. p. arXiv:2405.13675, May 01, 2024 2024. Z. Li et al., ‘FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation’. p. arXiv:2307.01492, July 01, 2023 2023. J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, ‘BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View’. p. arXiv:2112.11790, December 01, 2021 2021. Z. Li et al., ‘BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers’. p. arXiv:2203.17270, March 01, 2022 2022. Y. Huang, W. Zheng, Y. Zhang, J. Zhou, and J. Lu, ‘Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction’. p. arXiv:2302.07817, February 01, 2023 2023. Z. Yu et al., ‘FlashOcc: Fast and Memory-Efficient Occupancy Prediction via Channel-to-Height Plugin’. p. arXiv:2311.12058, November 01, 2023 2023. Z. Yu et al., ‘Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center’. p. arXiv:2406.10527, June 01, 2024 2024. H. Chen, J. Wang, Y. Li, N. Zhao, J. Cheng, and X. Yang, ‘Improving 3D Occupancy Prediction through Class-balancing Loss and Multi-scale Representation’. p. arXiv:2405.16099, May 01, 2024 2024. D. Lee, J. Park, and J. Kim, ‘Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling’. p. arXiv:2210.03331, October 01, 2022 2022. Y. Huang, W. Zheng, Y. Zhang, J. Zhou, and J. Lu, ‘GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction’. p. arXiv:2405.17429, May 01, 2024 2024. Y. Huang, A. Thammatadatrakoon, W. Zheng, Y. Zhang, D. Du, and J. Lu, ‘GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction’. p. arXiv:2412.04384, December 01, 2024 2024. H. Caesar et al., ‘nuScenes: A multimodal dataset for autonomous driving’. p. arXiv:1903.11027, March 01, 2019 2019. Y. Li et al., ‘SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving’. p. arXiv:2306.09001, June 01, 2023 2023. Y. Liao, J. Xie, and A. Geiger, ‘KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D’. p. arXiv:2109.13410, September 01, 2021 2021. M. Buda, A. Maki, and M. A. Mazurowski, ‘A systematic study of the class imbalance problem in convolutional neural networks’. p. arXiv:1710.05381, October 01, 2017 2017. Y. Aurelio, G. Almeida, C. Castro, and A. Braga, ‘Learning from Imbalanced Data Sets with Weighted Cross-Entropy Function’, Neural Processing Letters, vol. 50, 10 2019. K. R. M. Fernando and C. P. Tsokos, ‘Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks’, IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 7, pp. 2940–2951, 2022. J. Hou et al., ‘FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird’s-Eye View and Perspective View’. p. arXiv:2403.02710, March 01, 2024 2024. J. Philion and S. Fidler, ‘Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D’. p. arXiv:2008.05711, August 01, 2020 2020. H. Moravec and A. Elfes, ‘High resolution maps from wide angle sonar’, in Proceedings. 1985 IEEE International Conference on Robotics and Automation, vol. 2, pp. 116–121. S. Thrun, ‘Probabilistic robotics’, Commun. ACM, vol. 45, no. 3, pp. 52–57, 2002. K. He, X. Zhang, S. Ren, and J. Sun, ‘Deep Residual Learning for Image Recognition’, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, p. 1. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, ‘Feature Pyramid Networks for Object Detection’. p. arXiv:1612.03144, December 01, 2016 2016. J. Dai et al., ‘Deformable Convolutional Networks’. p. arXiv:1703.06211, March 01, 2017 2017. A. G. Howard et al., ‘MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications’. p. arXiv:1704.04861, April 01, 2017 2017. J. Huang and G. Huang, ‘BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection’. p. arXiv:2203.17054, March 01, 2022 2022. J. Huang and G. Huang, ‘BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment’, arXiv preprint arXiv:2211. 17111, 2022. F. Chollet, ‘Xception: Deep Learning with Depthwise Separable Convolutions’, arXiv [cs.CV]. 2017. Y. Li et al., ‘BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection’. p. arXiv:2206.10092, June 01, 2022 2022. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘Deformable DETR: Deformable Transformers for End-to-End Object Detection’. p. arXiv:2010.04159, October 01, 2020 2020. W. Shi et al., ‘Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network’. p. arXiv:1609.05158, September 01, 2016 2016. Y. Tian, S. Bai, Z. Luo, Y. Wang, Y. Lv, and F.-Y. Wang, ‘MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering’. p. arXiv:2408.11464, August 01, 2024 2024. I. Loshchilov and F. Hutter, ‘Decoupled Weight Decay Regularization’. p. arXiv:1711.05101, November 01, 2017 2017. O. Russakovsky et al., ‘ImageNet Large Scale Visual Recognition Challenge’. p. arXiv:1409.0575, September 01, 2014 2014. K. Chen et al., ‘MMDetection: Open MMLab Detection Toolbox and Benchmark’, arXiv preprint arXiv:1906. 07155, 2019. F. Rosenblatt, ‘The perceptron: a probabilistic model for information storage and organization in the brain’, Psychol Rev, vol. 65, no. 6, pp. 386–408, 1958. A. Vaswani et al., ‘Attention Is All You Need’. p. arXiv:1706.03762, June 01, 2017 2017. A. Dosovitskiy et al., ‘An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale’. p. arXiv:2010.11929, October 01, 2020 2020. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, ‘Gradient-based learning applied to document recognition’, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. D. V. Godoy, ‘Image via Wikimedia Commons’. [Online]. Available: https://creativecommons.org/licenses/by/4.0. Y. Guo, Y. Li, R. Feris, L. Wang, and T. Rosing, ‘Depthwise Convolution is All You Need for Learning Multiple Visual Domains’. p. arXiv:1902.00927, February 01, 2019 2019. Z. Ye, T. Jiang, C. Xu, Y. Li, and H. Zhao, ‘CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction’. p. arXiv:2409.13430, September 01, 2024 2024. Y. Wang, Y. Chen, X. Liao, L. Fan, and Z. Zhang, ‘PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation’. p. arXiv:2306.10013, June 01, 2023 2023. H. Liu et al., ‘Fully Sparse 3D Occupancy Prediction’. p. arXiv:2312.17118, December 01, 2023 2023. Y. Lu, X. Zhu, T. Wang, and Y. Ma, ‘OctreeOcc: Efficient and Multi-Granularity Occupancy Prediction Using Octree Queries’. p. arXiv:2312.03774, December 01, 2023 2023. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99384 | - |
| dc.description.abstract | 隨著自動駕駛的快速發展,三維感知已成為智慧汽車的核心能力。基於攝影機的系統必須準確理解環境的結構和語義,以支援運動規劃和決策等下游任務。在複雜動態的交通場景中,如何高效且準確地偵測物體,是確保行車安全的重要挑戰。
傳統方法利用三維體素表示進行編碼和解碼。雖然能達到良好性能,但常伴隨龐大的記憶體消耗與高昂的計算成本。 為提升效率,部分研究將三維體素空間投影至二維鳥瞰圖表示,有效降低記憶體開銷並維持合理性能。然而,反向過程中,即自BEV特徵重建回三維空間時,常因缺乏垂直訊息而導致準確率下降。 此外,現有的三維佔用率資料集普遍存在嚴重的類別不平衡問題,對於如行人與摩托車等低出現頻率但對安全至關重要的目標,其表徵能力明顯不足。 針對上述挑戰,我們提出HintOcc,一個高效的三維佔據預測框架,旨在強化BEV特徵的三維重建能力,並改善類別不平衡的問題。它能夠增強二維鳥瞰圖特徵到三維體素空間重建的能力,同時提升現實世界數據集中代表性不足類別的性能。首先,我們引入了二維垂直視圖分支,以結構提示形式提供高度訊息,輔助網路對二維至三維特徵的重建過程。其次,我們採用可變形深度可分離頭進行空間自適應解碼,同時降低參數開銷。最後,我們提出了一種批次動態加權策略,根據每個訓練批次中類別的出現情況,自適應地強化稀有類別的學習。 我們在 Occ3D-NuScenes 基準上評估了我們的方法。實驗結果表明,在相當的計算限制條件下,HintOcc 的表現優於現有方法,並且與基準模型相比,它提高了代表性不足類別的準確率。消融研究進一步驗證了每個組件在增強基於二維鳥瞰圖特徵的三維佔用預測方面的有效性。 | zh_TW |
| dc.description.abstract | With the rapid advancement of autonomous driving technologies, 3D visual perception has become a critical component in the design of intelligent vehicles. Given the inherent complexity of real-world driving environments, camera-based 3D perception systems must provide accurate and comprehensive spatial and semantic understanding to support downstream tasks such as motion planning and decision-making. Accurately and efficiently identifying objects is vital for ensuring safety in complex and dynamic driving scenes.
Traditional approaches utilize 3D voxel representations for both encoding and decoding. Despite their notable performance, these methods often result in excessive memory consumption and high computational complexity. To improve efficiency, several works project the 3D voxel space into a 2D bird's-eye view (BEV) representation, reducing memory consumption while preserving performance. Nonetheless, the inverse process—lifting 2D BEV features back into 3D—often lacks vertical information, leading to degraded accuracy. On the other hand, existing 3D occupancy datasets often suffer from severe class imbalance, where less frequent but safety-critical objects—such as pedestrians and motorcycles—are significantly underrepresented compared to dominant classes like drivable areas and vegetation. To enhance the performance on 2D BEV-to-3D reconstruction and improve class imbalance problem in 3D occupancy prediction, we propose HintOcc, an efficient framework that enhances the ability of BEV to 3D reconstruction while improving the performance of under-represented classes of real-world dataset. First, we introduces 2D vertical-view branch that provides structural hints to enrich height information for 2D to 3D reconstruction. Second, we employs a deformable depth-separable head for spatially adaptive decoding while reducing parameter overhead. Finally, we proposes a batch-wise dynamic weighting strategy that adaptively emphasizes rare classes based on the classes present within each batch. We evaluate our method on the Occ3D-NuScenes benchmark. Experimental results show that HintOcc outperforms existing methods under comparable computational constraints and improves accuracy for underrepresented classes compared to baseline model. Ablation studies further verify the effectiveness of each component in enhancing BEV-based 3D occupancy prediction. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-10T16:07:21Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-09-10T16:07:21Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Acknowledgements i
摘要 iii Abstract v Contents ix List of Figures xiii List of Tables xv Chapter 1 Introduction 1 1.1 Background 1 1.1.1 3D occupancy Prediction 2 1.2 Motivation 3 1.3 Challenges 4 1.3.1 Bird’s-Eye View reconstruction lacks vertical information 4 1.3.2 Occupancy Prediction Head 5 1.3.3 Class-wise performance imbalance 5 1.4 Objectives 8 1.4.1 Vertical-view and BEV (Bird’s-Eye View) Fusion 9 1.4.2 Leveraging Deformable and Depthwise Separable Convolution in the Prediction Head 9 1.4.3 Batch-wise Dynamic Class Weighting 10 1.5 Related Work 10 1.5.1 3D Voxel-based Occupancy Prediction 10 1.5.2 2D BEV-based Occupancy Prediction 12 1.5.3 Weighted Loss Function 13 1.6 Contribution 14 1.6.1 Enrich the BEV reconstruction 14 1.6.2 Deformable Depthwise Separable Prediction Head 14 1.6.3 Dynamic Batch-wise Class Weighting 14 1.7 Thesis Organization 15 Chapter 2 Preliminaries 17 2.1 CNN 19 2.1.1 Convolution Layer 19 2.1.2 Pooling Layer 20 2.1.3 Activation Function 21 2.1.4 Loss Function 25 2.2 Variation of CNN 26 2.2.1 Depthwise Separable Convolution 26 2.2.2 Feature Pyramid Network 27 2.2.3 Deformable Convolution 28 2.3 Transformer 30 2.3.1 Attention 30 2.4 Backbone 32 2.4.1 ResNet 32 Chapter 3 Methodology 35 3.1 Architecture Overview 36 3.2 Feature extraction 37 3.3 View Transformation 38 3.4 Temporal Fusion 40 3.5 BEV Encoder 40 3.6 Vertical-View Encoder 42 3.7 Deformable Depthwise Separable Prediction Head 44 3.8 Channel-to-Height Operator 45 3.9 Batch-wise Dynamic Weighting Strategy 46 3.10 Loss Function 48 Chapter 4 Experiments 50 4.1 Benchmark 50 4.2 Metrics 51 4.3 Implementation Details 52 4.4 Result 52 4.5 Ablation Study and Analysis 53 4.5.1 Ablation study on the effect of proposed components 54 4.5.2 Comparison on Weighting Strategies 54 4.5.2.1 Analysis on Efficiency 55 | - |
| dc.language.iso | en | - |
| dc.subject | 電腦視覺 | zh_TW |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | 鳥瞰圖 | zh_TW |
| dc.subject | 類別不平衡 | zh_TW |
| dc.subject | 三維佔據網路 | zh_TW |
| dc.subject | 3D Occupancy Prediction | en |
| dc.subject | Class Imbalance | en |
| dc.subject | Bird’s-Eye View | en |
| dc.subject | Computer Vision | en |
| dc.subject | Deep Learning | en |
| dc.title | HintOcc:透過空間感知與動態類別平衡提升鳥瞰圖轉三維的佔據預測 | zh_TW |
| dc.title | HintOcc: Enhancing BEV-to-3D Reconstruction in Occupancy Prediction with Spatial-Awareness and Dynamic Class Balancing | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 蕭培墉;黃世勳;施吉昇;傅楸善 | zh_TW |
| dc.contributor.oralexamcommittee | Pei-Yung Hsiao;Shih-Shinh Huang ;Chi-Sheng Shih;Chiou-Shann Fuh | en |
| dc.subject.keyword | 深度學習,電腦視覺,三維佔據網路,類別不平衡,鳥瞰圖, | zh_TW |
| dc.subject.keyword | Deep Learning,Computer Vision,3D Occupancy Prediction,Class Imbalance,Bird’s-Eye View, | en |
| dc.relation.page | 65 | - |
| dc.identifier.doi | 10.6342/NTU202503090 | - |
| dc.rights.note | 同意授權(限校園內公開) | - |
| dc.date.accepted | 2025-08-08 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 資訊工程學系 | - |
| dc.date.embargo-lift | 2030-07-30 | - |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf 未授權公開取用 | 5 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
