請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88980完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 王鈺強 | zh_TW |
| dc.contributor.advisor | Yu-Chiang Frank Wang | en |
| dc.contributor.author | 賴永玄 | zh_TW |
| dc.contributor.author | Yung-Hsuan Lai | en |
| dc.date.accessioned | 2023-08-16T16:37:43Z | - |
| dc.date.available | 2023-11-09 | - |
| dc.date.copyright | 2023-08-16 | - |
| dc.date.issued | 2023 | - |
| dc.date.submitted | 2023-07-25 | - |
| dc.identifier.citation | Y. Tian, D. Li, and C. Xu, “Unified multisensory perception: Weakly- supervised audio-visual video parsing,” in ECCV. Springer, 2020, pp. 436–454. iv, v, vii, 1, 2, 5, 6, 16, 20, 22
H. Cheng, Z. Liu, H. Zhou, C. Qian, W. Wu, and L. Wang, “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” in ECCV. Springer, 2022, pp. 431–448. v, 11, 16, 22 T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” TPAMI, vol. 44, no. 12, pp. 8717–8727, 2018. 1, 12 Q. Song, B. Sun, and S. Li, “Multimodal sparse transformer network for audio-visual speech recognition,” IEEE Transactions on Neural Networks and Learning Systems, 2022. 1, 12 B. Shi, W.-N. Hsu, and A. Mohamed, “Robust self-supervised audio-visual speech recognition,” in INTERSPEECH, 2022. 1, 12 F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer, “Audiovisual slowfast networks for video recognition,” arXiv preprint arXiv:2001.08740, 2020. 1, 12 R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in CVPR, 2020, pp. 10457–10467. 1, 12 R. Panda, C.-F. R. Chen, Q. Fan, X. Sun, K. Saenko, A. Oliva, and R. Feris, “Adamml: Adaptive multi-modal learning for efficient video recognition,” in ICCV, 2021, pp. 7576–7585. 1, 12 C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and A. Torralba, “Foley music: Learning to generate music from videos,” in ECCV. Springer, 2020, pp. 758–775. 1, 12 K. Su, X. Liu, and E. Shlizerman, “Audeo: Audio generation for a silent performance video,” NeurIPS, vol. 33, pp. 3325–3337, 2020. 1, 12 X. Xu, H. Zhou, Z. Liu, B. Dai, X. Wang, and D. Lin, “Visually informed binaural audio generation without binaural audios,” in CVPR, 2021, pp. 15 485–15 494. 1, 12 H. Yun, Y. Yu, W. Yang, K. Lee, and G. Kim, “Pano-avqa: Grounded audio- visual question answering on 360deg videos,” in ICCV, 2021, pp. 2031–2041. 1, 13 G. Li, Y. Wei, Y. Tian, C. Xu, J.-R. Wen, and D. Hu, “Learning to answer questions in dynamic audio-visual scenarios,” in CVPR, 2022, pp. 19 108– 19 118. 1, 13 A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763. 3, 7, 8 Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP. IEEE, 2023, pp. 1–5. 3, 8, 15 G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. 3, 9 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778. 5 D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, 2018, pp. 6450–6459. 5 S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in ICASSP. IEEE, 2017, pp. 131–135. 6 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017. 6 J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016. 6 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in CVPR, 2016, pp. 2818–2826. 6 X. Mei, X. Liu, J. Sun, M. D. Plumbley, and W. Wang, “On metric learning for audio-text cross-modal retrieval,” in INTERSPEECH, 2022. 8 A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extending clip to image, text and audio,” in ICASSP. IEEE, 2022, pp. 976–980. 8 B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP. IEEE, 2023, pp. 1–5. 8 S. Deshmukh, B. Elizalde, and H. Wang, “Audio retrieval with wavtext5k and clap training,” arXiv preprint arXiv:2209.14275, 2022. 8 Y.-B. Lin and Y.-C. F. Wang, “Exploiting audio-visual consistency with partial supervision for spatial audio generation,” in AAAI, vol. 35, no. 3, 2021, pp. 2056–2063. 11, 12 S. Mo and Y. Tian, “Multi-modal grouping network for weakly-supervised audio-visual video parsing,” in NeurIPS, 2022. 11, 16 J. Yu, Y. Cheng, R.-W. Zhao, R. Feng, and Y. Zhang, “Mm-pyramid: Multi- modal pyramid attentional network for audio-visual event localization and video parsing,” in ACM MM, 2022, pp. 6241–6249. 11, 16 Y. Wu and Y. Yang, “Exploring heterogeneous clues for weakly-supervised audio-visual video parsing,” in CVPR, 2021, pp. 1326–1335. 11, 16 J. Zhou, D. Guo, Y. Zhong, and M. Wang, “Improving audio-visual video parsing with pseudo visual labels,” arXiv preprint arXiv:2303.02344, 2023. 11, 16 Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” in ECCV, 2018, pp. 247–263. 12, 16, 24, 25, 26 Y.-B. Lin, Y.-J. Li, and Y.-C. F. Wang, “Dual-modality seq2seq network for audio-visual event localization,” in ICASSP. IEEE, 2019, pp. 2002–2006. 12, 16, 26 Y.-B. Lin and Y.-C. F. Wang, “Audiovisual transformer with instance attention for audio-visual event localization,” in ACCV, 2020. 12, 26 Y. Wu, L. Zhu, Y. Yan, and Y. Yang, “Dual attention matching for audio-visual event localization,” in ICCV, 2019, pp. 6292–6300. 12 H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, “Cross-modal relation-aware networks for audio-visual event localization,” in ACM MM, 2020, pp. 3893– 3901. 12, 26 J. Zhou, L. Zheng, Y. Zhong, S. Hao, and M. Wang, “Positive sample propa- gation along the audio-visual event line,” in CVPR, 2021, pp. 8436–8444. 12, 26 Y. Xia and Z. Zhao, “Cross-modal background suppression for audio-visual event localization,” in CVPR, 2022, pp. 19989–19998. 12, 25, 26 D. Hu, X. Li et al., “Temporal multimodal learning in audiovisual speech recognition,” in CVPR, 2016, pp. 3574–3582. 12 L. Sarı, K. Singh, J. Zhou, L. Torresani, N. Singhal, and Y. Saraf, “A multi- view approach to audio-visual speaker verification,” in ICASSP. IEEE, 2021, pp. 6194–6198. 12 S. Shon, T.-H. Oh, and J. Glass, “Noise-tolerant audio-visual online person verification using an attention-based neural network fusion,” in ICASSP. IEEE, 2019, pp. 3995–3999. 12 Y. Qian, Z. Chen, and S. Wang, “Audio-visual deep neural network for robust person verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1079–1092, 2021. 12 G. Sell, K. Duh, D. Snyder, D. Etter, and D. Garcia-Romero, “Audio-visual person recognition in multimedia data from the iarpa janus program,” in ICASSP. IEEE, 2018, pp. 3031–3035. 12 J. S. Chung, B.-J. Lee, and I. Han, “Who said that?: Audio-visual speaker diarisation of real-world meetings,” in INTERSPEECH, 2019. 12 Y. Ding, Y. Xu, S.-X. Zhang, Y. Cong, and L. Wang, “Self-supervised learning for audio-visual speaker diarization,” in ICASSP. IEEE, 2020, pp. 4367–4371. 12 E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “Epic-fusion: Audio- visual temporal binding for egocentric action recognition,” in ICCV, 2019, pp. 5492–5501. 12 B. Korbar, D. Tran, and L. Torresani, “Scsampler: Sampling salient clips from video for efficient action recognition,” in ICCV, 2019, pp. 6232–6242. 12 D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “On training tar- gets and objective functions for deep-learning-based audio-visual speech enhancement,” in ICASSP. IEEE, 2019, pp. 8077–8081. 12 T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio- visual speech enhancement,” in INTERSPEECH, 2018. 12 ——, “My lips are concealed: Audio-visual speech enhancement through obstructions,” in INTERSPEECH, 2019. 12 M. Sadeghi and X. Alameda-Pineda, “Robust unsupervised audio-visual speech enhancement using a mixture of variational autoencoders,” in ICASSP. IEEE, 2020, pp. 7534–7538. 12 J. Lee, S.-W. Chung, S. Kim, H.-G. Kang, and K. Sohn, “Looking into your speech: Learning cross-modal affinity for audio-visual speech separation,” in CVPR, 2021, pp. 1336–1345. 12 Z. Kang, M. Sadeghi, R. Horaud, X. Alameda-Pineda, J. Donley, and A. Ku- mar, “The impact of removing head movements on audio-visual speech enhancement,” in ICASSP. IEEE, 2022, pp. 7302–7306. 12 R. Gao, R. Feris, and K. Grauman, “Learning to separate object sounds by watching unlabeled video,” in ECCV, 2018, pp. 35–53. 12 H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Tor- ralba, “The sound of pixels,” in ECCV, 2018, pp. 570–586. 12 A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba, “Self- supervised audio-visual co-segmentation,” in ICASSP. IEEE, 2019, pp. 2357–2361. 12 H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in ICCV, 2019, pp. 1735–1744. 12 X. Xu, B. Dai, and D. Lin, “Recursive visual sound separation using minus- plus net,” in ICCV, 2019, pp. 882–891. 12 R. Gao and K. Grauman, “Co-separating sounds of visual objects,” in ICCV, 2019, pp. 3879–3888. 12 M. Chatterjee, J. Le Roux, N. Ahuja, and A. Cherian, “Visual scene graphs for audio source separation,” in ICCV, 2021, pp. 1204–1213. 12 Y. Tian, D. Hu, and C. Xu, “Cyclic co-learning of sounding object visual grounding and sound separation,” in CVPR, 2021, pp. 2745–2754. 12 V. K. Kurmi, V. Bajaj, B. N. Patro, K. Venkatesh, V. P. Namboodiri, and P. Jyothi, “Collaborative learning to generate audio-video jointly,” in ICASSP. IEEE, 2021, pp. 4180–4184. 12 R. Gao and K. Grauman, “2.5 d visual sound,” in CVPR, 2019, pp. 324–333. 12 K. K. Parida, S. Srivastava, and G. Sharma, “Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention,” in WACV, 2022, pp. 3347–3356. 12 H. Zhou, X. Xu, D. Lin, X. Wang, and Z. Liu, “Sep-stereo: Visually guided stereophonic audio generation by associating source separation,” in ECCV. Springer, 2020, pp. 52–69. 12 E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot adver- sarial learning of realistic neural talking head models,” in ICCV, 2019, pp. 9459–9468. 12 H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in CVPR, 2021, pp. 4176–4186. 12 Y. Liang, Q. Feng, L. Zhu, L. Hu, P. Pan, and Y. Yang, “Seeg: Semantic energized co-speech gesture generation,” in CVPR, 2022, pp. 10 473–10 482. 12 H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, M.-H. Yang, and J. Kautz, “Dancing to music,” NeurIPS, vol. 32, 2019. 12 R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “Ai choreographer: Music conditioned 3d dance generation with aist++,” in ICCV, 2021, pp. 13401–13412. 12 X. Li, D. Hu, and X. Lu, “Image2song: Song retrieval via bridging image content and lyric words,” in ICCV, 2017, pp. 5649–5658. 12 D. Sur´ıs, A. Duarte, A. Salvador, J. Torres, and X. Giro´-i Nieto, “Cross-modal embeddings for video and audio retrieval,” in ECCV workshops, 2018, pp. 0–0. 12 R. Arandjelovic and A. Zisserman, “Objects that sound,” in ECCV, 2018, pp. 435–451. 12 A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in ECCV, 2018, pp. 631–648. 12 A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon, “Learning to localize sound sources in visual scenes: Analysis and applications,” IEEE TPAMI, vol. 43, no. 5, pp. 1605–1619, 2019. 12 X. Hu, Z. Chen, and A. Owens, “Mix and localize: Localizing sound sources in mixtures,” in CVPR, 2022, pp. 10483–10492. 12 D. Hu, Y. Wei, R. Qian, W. Lin, R. Song, and J.-R. Wen, “Class-aware sounding objects localization via audiovisual correspondence,” IEEE TPAMI, vol. 44, no. 12, pp. 9844–9859, 2021. 12 C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman, “Soundspaces: Audio-visual navigation in 3d environments,” in ECCV. Springer, 2020, pp. 17–36. 13 C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B. Tenenbaum, “Look, listen, and act: Towards audio-visual embodied navigation,” in ICRA. IEEE, 2020, pp. 9701–9707. 13 C. Chen, S. Majumder, Z. Al-Halah, R. Gao, S. K. Ramakrishnan, and K. Grauman, “Learning to set waypoints for audio-visual navigation,” in ICLR, 2021. 13 C. Chen, Z. Al-Halah, and K. Grauman, “Semantic audio-visual navigation,” in CVPR, 2021, pp. 15516–15525. 13 A. Younes, D. Honerkamp, T. Welschehold, and A. Valada, “Catch me if you hear me: Audio-visual navigation in complex unmapped environments with moving sounds,” IEEE Robotics and Automation Letters, 2023. 13 Y. Yu, W. Huang, F. Sun, C. Chen, Y. Wang, and X. Liu, “Sound adversarial audio-visual navigation,” in ICLR, 2022. 13 S. Majumder, Z. Al-Halah, and K. Grauman, “Move2hear: Active audio- visual source separation,” in ICCV, 2021, pp. 275–285. 13 H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson et al., “Audio visual scene-aware dialog,” in CVPR, 2019, pp. 7558–7567. 13 C. Hori, A. Cherian, T. K. Marks, and T. Hori, “Joint student-teacher learning for audio-visual scene-aware dialog.” in INTERSPEECH, 2019, pp. 1886–1890. 13 I. Schwartz, A. G. Schwing, and T. Hazan, “A simple baseline for audio-visual scene-aware dialog,” in CVPR, 2019, pp. 12548–12558. 13 S. Geng, P. Gao, M. Chatterjee, C. Hori, J. Le Roux, Y. Zhang, H. Li, and A. Cherian, “Dynamic graph representation learning for video dialog via multi-modal shuffled transformers,” in AAAI, vol. 35, no. 2, 2021, pp. 1415–1423. 13 A. Shah, S. Geng, P. Gao, A. Cherian, T. Hori, T. K. Marks, J. Le Roux, and C. Hori, “Audio-visual scene-aware dialog and reasoning using audio-visual transformers with joint student-teacher learning,” in ICASSP. IEEE, 2022, pp. 7732–7736. 13 Y.-B. Lin, H.-Y. Tseng, H.-Y. Lee, Y.-Y. Lin, and M.-H. Yang, “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” NeurIPS, vol. 34, pp. 11449–11461, 2021. 16 X. Jiang, X. Xu, Z. Chen, J. Zhang, J. Song, F. Shen, H. Lu, and H. T. Shen, “Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,” in ACM MM, 2022, pp. 719–727. 16 J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP. IEEE, 2017, pp. 776–780. 24 H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y. Yan, “Cross-modal attention network for temporal inconsistent audio-visual event localization,” in AAAI, vol. 34, 2020, pp. 279–286. 26 J. Ramaswamy and S. Das, “See the sound, hear the pixels,” in WACV, 2020, pp. 2970–2979. 26 J. Ramaswamy, “What makes the sound?: A dual-modality interacting net- work for audio-visual event localization,” in ICASSP. IEEE, 2020, pp. 4372–4376. 26 | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/88980 | - |
| dc.description.abstract | 音頻–影像之跨模態學習是多模態機器學習中一個重要的研究領域。目前該領域主要聚焦於「模態對齊」的設定,即假設音頻和影像模態能「同時」提供預測目標的信號。然而在實際應用中,我們更常遇到「非模態對齊」的情況,即只能從音頻或影像「單一」模態中得到目標信號。為了深入研究非模態對齊情境,我們研究「音頻–影像影片事件分析」任務,此任務的目的是在非模態對齊且弱監督式學習的情境下,需識別影片中所有的聲音和影像事件(非模態對齊),並預測事件發生的時間。但是在訓練模型時,僅能使用影片層級的弱標籤(弱監督式學習),意即只能從此弱標籤知道影片中發生了哪些事件,但無法得知這些事件是經由哪種模態(聲音、影像或兩者同時)感知的,也無法確定事件發生的時間。為了應對這一挑戰,本研究提出一種簡單、有效且通用的方法——「影像–音頻標籤細化(VALOR)」。我們分別在音頻和影像模態引入大規模對比式預訓練模型作為模態教師模型,從而獲取含有事件模態和時間資訊的偽標籤,進一步對模型做偽標籤訓練。實驗結果顯示,VALOR 方法相較於基準方法使模型的平均 F-score 提升了 8.0。有趣的是,我們發現使用模態獨立的教師模型產生偽標籤在表現上優於使用模態融合的教師模型,這是因為前者能夠更好地抵抗非模態對齊的干擾。此外,我們的最佳模型在所有指標上都顯著優於前沿水平,使平均 F-score提升了 5.4。最後,我們將 VALOR 方法推廣至「音頻–影像事件定位」任務,同樣在該任務上取得了超越其他方法和模型的最新成果,展現出卓越的普適性。 | zh_TW |
| dc.description.abstract | Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed to signal the prediction target. With the Look, Listen, and Parse dataset (LLP), we investigate the under-explored unaligned setting, where the goal is to recognize audio and visual events in a video with only weak labels observed. Such weak video-level labels only tell what events happen without knowing the modality they are perceived (audio, visual, or both). To enhance learning in this challenging setting, we incorporate large-scale contrastively pre-trained models as the modality teachers. A simple, effective, and generic method, termed Visual-Audio Label Elaboration (VALOR), is innovated to harvest modality labels for the training events. Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). Surprisingly, we found that modality-independent teachers outperform their modality-fused counterparts since they are noise-proof from the other potentially unaligned modality. Moreover, our best model achieves the new state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score for Type@AV). VALOR is further generalized to Audio-Visual Event Localization and achieves the new state-of-the-art as well. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-08-16T16:37:43Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2023-08-16T16:37:43Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | Abstract i
List of Figures iv List of Tables vi 1 Introduction 1 2 Preliminary 5 2.1 Audio-Visual Video Parsing (AVVP) 5 2.2 Baseline Model 5 3 Proposed Method 7 3.1 Zero-Shot Transfer of Contrastive Pre-trained Models 7 3.2 Harvesting Training Signals 9 4 Related Work 11 4.1 Audio-Visual Video Parsing with Look, Listen, and Parse 11 4.2 More Audio-Visual Learning 12 5 Experiments 14 5.1 Experimental Setup 14 5.2 Unified Label Elaboration for State-of-the-Art Audio-Visual Video Parsing 16 5.3 Ablation Studies 17 5.4 Qualitative Comparison with Previous AVVP Works 22 5.5 Generalize VALOR to Audio-Visual Event Localization 23 6 Conclusion 27 Reference 28 | - |
| dc.language.iso | en | - |
| dc.subject | 音頻-影像學習 | zh_TW |
| dc.subject | 跨模態學習 | zh_TW |
| dc.subject | 深度學習 | zh_TW |
| dc.subject | Deep Learning | en |
| dc.subject | Audio-Visual Learning | en |
| dc.subject | Cross-Modality Learning | en |
| dc.title | 利用模態獨立模型於弱監督式音頻-影像事件分析 | zh_TW |
| dc.title | Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 111-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 孫紹華;陳祝嵩 | zh_TW |
| dc.contributor.oralexamcommittee | Shao-Hua Sun;Chu-Song Chen | en |
| dc.subject.keyword | 音頻-影像學習,跨模態學習,深度學習, | zh_TW |
| dc.subject.keyword | Audio-Visual Learning,Cross-Modality Learning,Deep Learning, | en |
| dc.relation.page | 38 | - |
| dc.identifier.doi | 10.6342/NTU202301932 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2023-07-27 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-111-2.pdf | 13.82 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
