Machine Learning-Enhanced Text Mining as a Support Tool for Research on Climate Change: Theoretical and Technical Considerations

Machine Learning-Enhanced Text Mining as a Support Tool for Research on Climate Change: Theoretical and Technical Considerations

DOI: 10.4018/978-1-6684-8634-4.ch004
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In this chapter, the authors explore the theoretical and practical aspects of using text mining approaches supported by machine learning for the automatic interpretation of bulk literature on a contemporary issue—that of climate change risk analysis. The strengths, weaknesses, and opportunities associated with these approaches are investigated. Text mining provides a way to automate and enhance the analysis of text data. However, contrary to popular belief, text mining analysis is not a completely automated process. As with computer-assisted (or -aided) qualitative data analysis software (CAQDAS), it is an iterative method requiring input from a researcher with expert knowledge and a deliberate approach to the analysis. Given the heterogeneity that generally characterizes climate disclosures, the authors postulate that hybrid methodologies are ideal for analysing textual data related to climate change discourse. The authors also demonstrate that text mining is an open and evolving field, in the sense that it can be combined with other approaches to shed new light on the climate discourse.
Chapter Preview
Top

1. Introduction

Climate change is an extremely complex global phenomenon but is often experienced locally. It has gone from being analysed as a mainly physical phenomenon to being a social, cultural, political, ethical and communicational phenomenon (Zaccai 2012, Moser 2016, Pearce, Grundmann et al. 2017, Atwoli, Muhia et al. 2022, Kumpu 2022, Toivonen 2022).

Understanding the scope of the problem and leading thinking of identifying and managing key elements of the problem is challenging given the wealth of academic and grey literature and other bodies of knowledge that exist on a given topic and that often grow exponentially. The sheer volume and diversity of data make it difficult for human experts to process it and gain meaningful insights.

Text documents play a significant role among existing information resources. Analysis and processing of such resources are relatively simple tasks for an expert, while automation is difficult. The main reasons for this are the low coherency of text structures and a lack of obvious methods for their interpretation.

Consequently, communication related to climate-related risks and policies is becoming a major challenge for experts in various fields, including economics, discourse analysis and, in particular, language and computer sciences.

In the many debates that take place in different areas, one observes a multitude of voices, points of view, values, and interests related to climate change. Many actors seek to identify risks, formulate key questions and decide on priorities for action. Their points of view may vary from country to country and from company to company.

Language plays an essential role in the conceptualization and framing of discourses related to global climate change risks. These discourses continue in various text documents and constitute a heterogeneous material, occurring through different types of disclosures.

In this chapter, given this heterogeneity in climate disclosures, we postulate that hybrid methodologies are ideal for the analysis of textual data related to the climate change discourse. We would also like to show that text mining is an open field, in the sense that it blends with other approaches like machine learning and, thereby, helps to shed new light on the climate discourse.

Text mining is defined here as a set of concepts, methods and algorithms for processing textual resources that facilitate the automated processing of documents written in natural languages. It covers the field of data mining methods and techniques and combines them with the logic of text content analysis (Jo 2019). This approach is interdisciplinary as it draws on knowledge from fields such as:

  • a)

    Data mining (DM),

  • b)

    Machine learning (ML),

  • c)

    Natural language processing (NLP),

  • d)

    Information extraction (IE) and information retrieval (IR),

  • e)

    Statistics,

  • f)

    The theory of algorithms and data structures.

The relationships between these areas of research are presented in Fig. 1.

Figure 1.

An overview of the disciplines that contribute to the field of text mining

978-1-6684-8634-4.ch004.f01

At this point, it is worth briefly discussing the similarities and differences between text mining and each of the fields as specified in Fig. 1.

Key Terms in this Chapter

Information Retrieval (IR): IR is the process of searching for and locating textual information in a database based on a user query. Searches can be based on a document's existing metadata, its full text or its content indexing.

Deep Learning: Deep learning is a subfield of machine learning. Deep neural networks (DNNs) are the extension of artificial neural networks (ANNs). Deep learning methods and techniques scale up the size and complexity of ANNs to produce increasingly richer functionality and very often archive better results in disciplines such as text mining or computer vision.

Statistics: In the context of text mining, 'statistics' refers to the quantitative methods used to analyze and interpret text data. This can involve various techniques such as frequency analysis (counting the occurrence of words or phrases), topic modeling (identifying the main topics in a text), and association rules (finding relationships between words or phrases). These statistical methods help to transform unstructured text data.

Information Extraction (IE): IE is the process of identifying and extracting content from documents written in natural language and is based on analytically generated or predefined knowledge patterns

Tokenization: Tokenization of a text document is the division of the input text into sentences, words, punctuation marks (commas, full stops, etc.) and non-text characters. Other elements of a document that have the status of regular expressions can also be used as tokens. Tokenization is usually an automatic process, depending only on the formal language of the text mining programme.

Lemmatisation: Morphological analysis of the dictionary and reduction of similar lexical forms of words to one basic form – a lemma . A lemma is the canonical, simplest form of a lexeme used for its dictionary representation.

Entropy: In information theory, average amount of information attributable to a single message from information source.

Probability Distribution: Function used to compute the likelihoods of occurrence of various observation outcomes.

Prediction: In machine learning, calculating the value of output variable for input variable values falling within training data values.

Labelled Data: In machine learning, data that have already been categorized.

Algorithmics: A subfield of computer science. Study of design and efficiency of algorithms.

Complete Chapter List

Search this Book:
Reset