文本意圖的多模態分析：以Instagram為例

Ying-Yu Chen; 陳盈瑜

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70749

Full metadata record

???org.dspace.app.webui.jsptag.ItemTag.dcfield???	Value	Language
dc.contributor.advisor	謝舒凱(Shu-Kai Hsieh)
dc.contributor.author	Ying-Yu Chen	en
dc.contributor.author	陳盈瑜	zh_TW
dc.date.accessioned	2021-06-17T04:37:05Z	-
dc.date.available	2025-08-25
dc.date.copyright	2020-09-03
dc.date.issued	2020
dc.date.submitted	2020-08-26
dc.identifier.citation	Austin, J. L. (1975). How to do things with words (Vol. 88). Oxford university press. Baecchi, C., Uricchio, T., Bertini, M., Del Bimbo, A. (2016). A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimedia Tools and Applications, 75(5), 2507–2525. Baltrušaitis, T., Ahuja, C., Morency, L.-P. (2018). Multimodal machine learning: a survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2), 423–443. Barbieri, F., Ballesteros, M., Ronzano, F., Saggion, H. (2018). Multimodal emoji prediction. arXiv preprint arXiv:1803.02392. Bateman, J. (2008). Multimodality and genre: a foundation for the systematic analysis of multimodal documents. Springer. Bateman, J. (2014). Text and image: a critical introduction to the visual/verbal divide. Routledge. Bateman, J., Schmidt, K.-H. (2013). Multimodal film analysis: how films mean. Routledge. Bongers, B., van der Veer, G. C. (2007). Towards a multimodal interaction space: categorisation and applications. Personal and Ubiquitous Computing, 11(8), 609–619. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123–140. Breiman, L. (1999). Using adaptive bagging to debias regressions (tech. rep.). Technical Report 547, Statistics Dept. UCB. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32. Castro, S., Hazarika, D., Pérez-Rosas, V., Zimmermann, R., Mihalcea, R., Poria, S. (2019). Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815. Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175. Chancellor, S., Kalantidis, Y., Pater, J. A., De Choudhury, M., Shamma, D. A. (2017). Multimodal classification of moderated online pro-eating disorder content, In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. Chen, F., Ji, R., Su, J., Cao, D., Gao, Y. (2017). Predicting microblog senti- ments via weakly supervised multimodal deep learning. IEEE Transactions on Multimedia, 20(4), 997–1007. Condon, C., Perry, M., O’Keefe, R. (2004). Denotation and connotation in the human–computer interface: the `save as...'command. Behaviour Infor- mation Technology, 23(1), 21–31. David, P. (1998). News concreteness and visual-verbal association: do news pictures narrow the recall gap between concrete and abstract news? Human Commu- nication Research, 25(2), 180–201. DeLong, E. R., DeLong, D. M., Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 837–845. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database, In 2009 IEEE conference on computer vision and pattern recognition. Ieee. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2018). Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Dwivedi, P. (2019). Understanding and coding a resnet in keras. Online. Goffman, E. et al. (1978). The presentation of self in everyday life. Harmondsworth London. Hancher, M. (1992). Bailey and after: illustrating meaning. Word Image, 8(1), 1–20. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition, In Proceedings of the IEEE conference on computer vision and pattern recognition. Hidderly, R., Rafferty, P. (1997). Democratic indexing: an approach to the re- trieval of film. Proceedings of Library and Information Studies: Research and Professional Practice. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8), 832– 844. Illendula, A., Sheth, A. (2019). Multimodal emotion classification, In Companion Proceedings of The 2019 World Wide Web Conference. Jakobson, R. (1960). Linguistics and poetics. In Style in language (pp. 350–377). MA: MIT Press. Jeni, L. A., Cohn, J. F., De La Torre, F. (2013). Facing imbalanced data– recommendations for the use of performance metrics, In 2013 Humaine asso- ciation conference on affective computing and intelligent interaction. IEEE. Jin, X., Gallagher, A., Cao, L., Luo, J., Han, J. (2010). The wisdom of social multimedia: using flickr for prediction and forecast, In Proceedings of the 18th ACM international conference on Multimedia. Kloepfer, R. (1976). Komplementarität von sprache und bild am beispiel von comic, karikatur und reklame.(la complémentarité de la langue et de l’image. l’exemple des bandes dessinées, des caricatures et des réclames). Sprache in Technischen Zeitalter Stuttgart, (57), 42–56. Kruk, J., Lubin, J., Sikka, K., Lin, X., Jurafsky, D., Divakaran, A. (2019). Integrating text and image: determining multimodal document intent in in- stagram posts. arXiv preprint arXiv:1904.09073. Lazarsfeld, P. F., Berelson, B., Gaudet, H. (1944). The people’s choice. Le, Q., Mikolov, T. (2014). Distributed representations of sentences and documents, In International conference on machine learning. LeCun, Y., Bengio, Y. et al. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995. LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., Jackel, L. D (1990). Handwritten digit recognition with a backpropagation network, In Advances in neural information processing systems. Lee, C., Chau, D. (2018). Language as pride, love, and hate: archiving emotions through multilingual instagram hashtags. Discourse, context media, 22, 21–29. Lemke, J. (1998). Multiplying meaning. Reading science: Critical and functional perspectives on discourses of science, 87–113. Levin, J. R. (1981). On functions of pictures in prose. Neuropsychological and cognitive processes in reading, 203. Levin, J., Mayer, R. (1993). Understanding illustrations in text. en, bk britton, a. woodward y m. brinkley (eds.), learning from textbooks. Hillsdale: Erlbaum. Liaw, A., Wiener, M. et al. (2002). Classification and regression by randomforest. R news, 2(3), 18–22. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L. (2014). Microsoft coco: common objects in context, In European conference on computer vision. Springer. Liu, S., Jansson, P. (2017). City event identification from instagram data using word embedding and topic model visualization. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Mahoney, J., Feltwell, T., Ajuruchi, O., Lawson, S. (2016). Constructing the visual online political self: an analysis of instagram use by the scottish electorate, In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. Marsh, E. E., White, M. D. (2003). A taxonomy of relationships between images and text. Journal of Documentation. Mazloom, M., Rietveld, R., Rudinac, S., Worring, M., Van Dolen, W. (2016). Multimodal popularity prediction of brand-related social media posts, In Proceedings of the 24th ACM international conference on Multimedia. Meffert, J. J. (2009). Key opinion leaders: where they come from and how that affects the drugs you prescribe. Dermatologic therapy, 22(3), 262–268. Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Eﬀicient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Mirsarraf, M., Shairi, H., Ahmadpanah, A. (2017). Social semiotic aspects of instagram social network, In 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA). IEEE. Morency, L.-P., Baltrušaitis, T. (2017). Multimodal machine learning: integrating language, vision and speech, In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. O'Halloran, K. L., Tan, S., Wignell, P., Bateman, J. A., Pham, D.-S., Grossman, M., Moere, A. V. (2019). Interpreting text and image relations in violent extremist discourse: a mixed methods approach for big data analytics. Terrorism and Political Violence, 31(3), 454–474. Phan, T.-T., Muralidhar, S., Gatica-Perez, D. (2019). # drink or# drunk: multimodal signals and drinking practices on instagram, In Proceedings of the 13th EAI International Conference on Pervasive Computing Technologies for Healthcare. Rahman, W., Hasan, M. K., Zadeh, A., Morency, L.-P., Hoque, M. E. (2019). M- bert: injecting multimodal information in the bert structure. arXiv preprint arXiv:1908.05787. Razali, M. S., Halin, A. A., Norowi, N. M., Doraisamy, S. C. (2017). The impor- tance of multimodality in sarcasm detection for sentiment analysis, In 2017 IEEE 15th Student Conference on Research and Development (SCOReD). IEEE. Recanati, F. (2004). Literal meaning. Cambridge University Press. Reimers, N., Gurevych, I. (2019). Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3), 211–252. Safavian, S. R., Landgrebe, D. (1991). A survey of decision tree classifier method- ology. IEEE transactions on systems, man, and cybernetics, 21(3), 660–674. Schifanella, R., de Juan, P., Tetreault, J., Cao, L. (2016). Detecting sarcasm in multimodal social platforms, In Proceedings of the 24th ACM international conference on Multimedia. Searle, J. R., Searle, J. R. S., Vanderveken, D., Willis, S., et al. (1985). Foundations of illocutionary logic. CUP Archive. Singla, K., Mukherjee, N., Koduvely, H. M., Bose, J. (2018). Evaluating usage of images for app classificatio, In 2018 15th IEEE India Council International Conference (INDICON). IEEE. Sismondo, S. (2015). How to make opinion leaders and influence people. CMAJ, 187(10), 759–760. Soergel, D. (1974). Indexing languages and thesauri: construction and maintenance. Melville Los Angeles. Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.-F., Pantic, M. (2017). A survey of multimodal sentiment analysis. Image and Vision Computing, 65, 3–14. Stager, M., Lukowicz, P., Troster, G. (2006). Dealing with class skew in context recognition, In 26th IEEE International Conference on Distributed Comput- ing Systems Workshops (ICDCSW’06). IEEE. Summaries, P. E., Panel, C. (n.d.). Esrc centre for corpus approaches to social science (cass). Surowiecki, J. (2005). The wisdom of crowds. Anchor. Swain, P. H., Hauska, H. (1977). The decision tree classifier: design and potential. IEEE Transactions on Geoscience Electronics, 15(3), 142–147. Taylor, C., Marchi, A. (2018). Corpus approaches to discourse: a critical review. Routledge. Wang, Z., Zhang, J., Zhang, Y., Huang, C., Wang, L. (2020). Short-term wind speed forecasting based on information of neighboring wind farms. IEEE Access, 8, 16760–16770. Wason, P. C., Jones, S. (1963). Negatives: denotation and connotation. British Journal of Psychology, 54(4), 299–307. Yu, J., Jiang, J. (2019). Adapting bert for target-oriented multimodal sentiment classification. Zappavigna, M. (2016). Social media photography: construing subjectivity in instagram images. Visual Communication, 15(3), 271–292. Zeppelzauer, M., Schopfhauser, D. (2016). Multimodal classification of events in social media. Image and Vision Computing, 53, 45–56. Zhang, M., Hwa, R., Kovashka, A. (2018). Equal but not the same: understanding the implicit relationship between persuasive images and text. arXiv preprint arXiv:1807.08205 .
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/70749	-
dc.description.abstract	時至今日，社群媒體（如Instagram）趨向結合圖片以及文字表徵，建構出一種新的「多模態」溝通方式。利用計算方法分析多模態關係已成為一個熱門的主題，然而，尚未有研究針對台灣的百大網紅發文中的多模態圖文配對（Image-caption Pair）來分析文本意圖和圖文關係。利用文字和圖片的多模態表徵，本研究沿用 Kruk et al. (2019)的圖文關係分類方法（contextual relationship/semiotic relationship/authors intent），對此三種分類提出新的圖文表徵方式（Sentence-BERT及image embedding），並利用計算模型（Random Forest, Decision Tree Classifier）精準分類以上三種圖文關係，研究結果顯示正確率高達86.23%。	zh_TW
dc.description.abstract	A majority of representation style on social media (i.e., Instagram) tends to combine visual and textual content in the same message as a consequence of building up a modern way of communication. Message in multimodality is essential in almost any types of social interactions especially in the context of social multimedia content on- line. Hence, effective computational approaches for understanding documents with multiple modalities needed to identify the relationship between them. This study extends recent advances in intent classification by putting forward an approach us- ing Image-caption Pairs (ICPs). Several Machine Learning algorithm like Decision Tree Classifier (DTC’s), Random Forest (RF) and encoders like Sentence-BERT and picture embedding are undertaken in the tasks in order to classify the relation- ships between multiple modalities, which are 1) contextual relationship 2) semiotic relationship and 3) authors intent. This study points to two results. First, despite the prior studies consider incorporating the two synergistic modalities in a com- bined model will improve the accuracy in the relationship classification task, this study found out the simple fusion strategy that linearly projects encoded vectors from both modalities in the same embedding space may not strongly enhance the performance of that in single modality. The results suggest that the incorporating of text and image needs more effort to complement each other. Second, we show that these text-image relationships can be classified with high accuracy (86.23%) by using only text modality. In sum, this study may be of essential in demonstrating a computational approach to access multimodal documents as well as providing a better understanding of classifying the relationships between modalities.	en
dc.description.provenance	Made available in DSpace on 2021-06-17T04:37:05Z (GMT). No. of bitstreams: 1 U0001-2008202019480500.pdf: 5653673 bytes, checksum: 62b221102a851a11cf01308ebfad7d04 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	Chinese Abstract iii English Abstract v Contents vi List of Figures x List of Tables xi 1 Introduction 1 1.1 Motivation................................. 1 1.2 ResearchPurposes ............................ 4 1.3 Overview ................................. 6 2 Literature Review 9 2.1 Taxonomies................................ 9 2.1.1 TheContextualTaxonomy ................... 10 2.1.2 TheSemioticTaxonomy..................... 11 2.1.3 TheIntentTaxonomy ...................... 13 2.2 MultimodalDocumentIntent ...................... 15 2.2.1 Multimodality .......................... 15 2.2.2 MultimodalDocumentUnderstanding . . . . . . . . . . . . . 16 3 Taxonimies, Datasets, and Annotation 19 3.1 KOLs,Instagram............................. 20 3.1.1 KOLs ............................... 20 3.1.2 Instagram............................. 22 3.2 Taxonomies ................................ 23 3.2.1 TheContextualTaxonomy ................... 23 3.2.2 TheSemioticTaxonomy..................... 26 3.2.3 TheIntentTaxonomy ...................... 29 3.3 DatasetsCollectionandSelection .................... 29 3.4 Annotation ................................ 31 3.4.1 DataLabeling........................... 31 3.4.2 Inter-annotatorAgreement ................... 32 3.5 DataRepresentation ........................... 33 3.5.1 TextEmbedding:Sentence-BERT................ 34 3.5.2 ImageEncoder:Pic2vec ..................... 37 4 Automatic Classification 43 4.1 DatasetsforTrainingandTesting.................... 43 4.2 ModelTraining .............................. 43 4.2.1 DecisionTreeClassifier...................... 45 4.2.2 RandomForest.......................... 46 4.3 ClassifierEvaluation ........................... 48 4.4 Implementation.............................. 48 4.4.1 Results .............................. 49 5 Conclusion 55 References 59
dc.language.iso	en
dc.subject	自然語言處理	zh_TW
dc.subject	多模態文本分析	zh_TW
dc.subject	semiotic relationship	en
dc.subject	contextual relationship	en
dc.subject	multimodal documents understanding	en
dc.subject	Natural Language Processing	en
dc.subject	authors intent	en
dc.title	文本意圖的多模態分析：以Instagram為例	zh_TW
dc.title	An Analysis of Multimodal Document Intent in Instagram Posts	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	張瑜芸(Yu-Yun Chang),陳正賢(Cheng-Hsien Chen)
dc.subject.keyword	多模態文本分析,自然語言處理,	zh_TW
dc.subject.keyword	multimodal documents understanding,contextual relationship,semiotic relationship,authors intent,Natural Language Processing,	en
dc.relation.page	65
dc.identifier.doi	10.6342/NTU202004151
dc.rights.note	有償授權
dc.date.accepted	2020-08-26
dc.contributor.author-college	文學院	zh_TW
dc.contributor.author-dept	語言學研究所	zh_TW
Appears in Collections:	語言學研究所

Files in This Item:

File	Size	Format
U0001-2008202019480500.pdf Restricted Access	5.52 MB	Adobe PDF

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets