3 Methodology

3.1 Introduction to Methodology

A thorough understanding of the data collection, research design, and analysis procedures is involved to ensure the validity and reliability of the study’s conclusions. As a result, this chapter describes the study’s methodology, which attempts to analyze the similarities and contrasts linking leaders’ speeches and features of wisdom leadership.

A mixed-methods strategy is used in this study, integrating qualitative and quantitative text analysis tools to assess a broad collection of commencement addresses. This method allows us to derive insights from the speeches, discover common themes and patterns, and investigate their relevance to wisdom leadership. We may triangulate our findings and overcome the limits of individual approaches with the mixed-methods design, resulting in a more robust knowledge of the subject matter. This chapter is divided into parts that describe the methods of the study, such as data collection and selection, research design, text-mining techniques, data analysis procedure, ethical issues, and limits.

3.2 Research Design: Selection of Methodology

In this study, we use a mixed-methods research strategy that combines qualitative and quantitative text-mining tools to examine a collection of commencement speeches. This technique enables us to identify commonalities across these addresses and extensively investigate their relationships to dimensions of wisdom leadership in a comprehensive manner.

Our analysis will follow a systematic method that includes data collection, preparation, analysis, and result interpretation to achieve the study objectives. Our study design is intended to ensure that the findings are credible and valid.

The Quantitative Analysis of our study focuses on discovering and measuring trends within the speeches. Word frequency analysis, for example, provides objective and quantifiable statistics by emphasizing the most frequently used words and phrases across the speeches. Furthermore, we can quantify topic occurrence and sentiment scores, allowing us to compare and contrast speeches and identify statistically significant trends. The quantitative analysis provides a more systematic and objective view on the similarities between the speeches, reinforcing the qualitative research’s findings.

Typically, many text-mining techniques operate on the bag-of-words model. As explained in Chapter 2, this approach essentially simplifies text into a collection of n-grams. However, the bag-of-words model has its limitations. Primarily, it ignores the context and order of words, focusing instead on the frequency or presence of words. This results in the extraction of n-grams might not make much sense or represent the text accurately. Thus, interpreting these findings without accounting for the broader context in which these n-grams are found can indeed be unrealistic and even impractical.

A more robust approach would be to couple this with qualitative analysis, examining the context in which these words and phrases appear to grasp the nuances of the text. For example, sentiment analysis can help determine the emotion or opinion conveyed in a text, while topic modeling can uncover the underlying themes. Incorporating a mixed-methodology approach indeed enhances the depth of speech assessment as it blends the strengths of both qualitative and quantitative techniques.

3.3 Research instrument: Data Collection and Selection

The data collection and selection procedure constitute a critical phase in our study as it forms the foundation for our future analysis and interpretation of the speeches and topic modeling outcomes. In this section, we outline the criteria employed in selecting the speeches for our study, the sources of our data, and the data preprocessing techniques we deployed to ensure data consistency and quality.

3.3.1 Speech Selection Criteria

To ensure a diverse and representative sample of speeches, we selected 32 commencement addresses (summarized in table 3.2) from a range of influential figures spanning the following fields: law and politics, business, arts and literature, entertainment, and academia. Table 3.1 shows the number of influential figures within each profession. The number of speeches was limited by time constraints but was deemed sufficient for the study’s purposes. More information about the speeches can be found in Appendix I.

Table 3.1: The number of speakers associated with each profession
profession	same_pro
Business	9
Law and Politics	8
Entertainment	7
Arts and Literature	5
Academia	3

Initially, we contemplated analyzing biography books of renowned leaders such as Gandhi, Mandela, and Steve Jobs. However, we pivoted towards commencement addresses for several reasons. Firstly, the text preparation process for books, especially those not available in digitized format, necessitates substantial time and resources. Extracting text from physical books involves manual transcription or scanning and optical character recognition, all of which prolong the data collection phase of our research. This is in stark contrast to commencement speeches, which are readily accessible online on university websites, enabling us to expedite the data collection process significantly. Secondly, commencement addresses are typically more concise and focused, with speakers endeavoring to convey their wisdom to graduates, while books often encompass extensive non-insightful information that could potentially confound our study, given that text-mining largely operates with the bag-of-words approach and frequency counts.

The selection of speakers for this study was based on their established recognition as influential characters within their respective fields, demonstrating exceptional vision and impacting their communities or the world significantly. We relied on the evaluations of various universities in choosing these leaders, under the assumption that such institutions would be highly discerning in selecting individuals possessing valuable qualities and wisdom to address their graduates.

The choice of speakers was also influenced by the recognition and reputation of the universities from which the commencement speeches were sourced. For instance, 12 speeches were derived from Harvard, and 9 from Stanford, both of which are globally renowned for their academic excellence and high standards. It was our assessment that prestigious universities, ranked highly in global education rankings, frequently invite distinguished individuals of profound wisdom and accomplishment to deliver commencement addresses.

Furthermore, the frequency of speeches from certain institutions also arose from our intent to gather as comprehensive a dataset as possible. Given the prestige of these universities and the high-profile nature of their chosen speakers, their commencement addresses were often readily available and well-documented, allowing for easier access. Therefore, the selection of speakers for this research was not only based on their individual merits as leaders but also on the prestige of the inviting universities. This dual criterion aimed to ensure that the chosen speeches would provide insightful, high-quality data for our analysis of wisdom leadership across various professional fields.

Table 3.2: The list of commencement addresses
speaker	location	profession	year	title	SpeechLength
Chimamanda Ngozi Adichie	Harvard	Arts and Literature	2018	Cultivating a sense of empathy	1823
Jacinda Ardern	Harvard	Law and Politics	2022	The Fragility of Democracy	1777
BarackObama	Notre Dame	Law and Politics	2009	The World as It Should Be	1692
Martin Baron	Harvard	Law and Politics	2020	Imperfect though	1678
Jeff Bezos	Princeton	Business	2010	Two key questions	1647
BillGates	Harvard	Business	2007	Global inequality, technology and innovation	1603
Mike Bloomberg	Johns Hopkins	Business	2021	Communication in the Digital Age	1541
Sterling K. Brown	Stanford	Entertainment	2018	Let Your Light Shine	1538
Ken Burns	Stanford	Arts and Literature	2016	Reflections on History	1507
Tim Cook	Stanford	Business	2019	Responsibility and Building	1436
France Cordova	Stanford	Academia	2020	Navigating a Changing World	1429
Mariano Florentino Cuellar	Stanford	Entertainment	2017	The Gift of Progress	1353
Ellen DeGeneres	Tulane	Entertainment	2009	Follow Your Passion, Stay True to Yourself	1334
Richard Engel	Stanford	Law and Politics	2015	Taking the Leap	1323
Neil Gaiman	University of the Arts	Arts and Literature	2012	Make good art	1319
Atul Gawande	Stanford	Academia	2021	Finding Purpose in Life	1304
Reed Hastings	Stanford	Business	2022	Keys to Progress and Change the World	1283
Steve Jobs	Stanford	Business	2005	Commencement Address at Stanford	1248
John Lewis	Harvard	Law and Politics	2018	Importance of equality	1189
Angela Merkel	Harvard	Law and Politics	2019	Anything can change	1170
MichelleObama	Oberlin	Law and Politics	2015	Engage with the world around	1102
Elon Musk	Caltech	Business	2012	Think Big and Dream Even Bigger	1099
Conan OBrien	Dartmouth	Entertainment	2011	Failure and Invention	1091
Natalie Portman	Harvard	Entertainment	2015	Embracing Inexperience	1076
J.K. Rowling	Harvard	Arts and Literature	2008	The Fringe Benefits of Failure	1074
Sheryl Sandberg	BERKELEY	Business	2016	Finding Gratitude and Appreciation	1022
Ruth J. Simmons	Harvard	Academia	2021	Fight inequality	999
Steven Spielberg	Harvard	Entertainment	2016	A villain to vanquish	992
David Foster Wallace	Kenyon	Arts and Literature	2005	This is Water	884
Oprah Winfrey	Harvard	Entertainment	2013	An internal emotional G.P.S	837
Fareed Zakaria	Harvard	Law and Politics	2012	We live in an age of progress	742
Mark Zuckerberg	Harvard	Business	2017	Purpose and Community	591

3.3.2 Data Source Evaluation

The speeches were collected from various sources, including official university websites. Although we initially collected over 40 speeches, we wanted to lessen any potential bias from our personal perceptions of the speeches by listening to speeches or reading transcripts. Consequently, we narrowed down our selection to 32 speeches obtained directly from university websites and reliable sources to ensure authenticity and accuracy.

The data collected from these sources presents several advantages:

Rich and Varied Content: The speeches cover a wide range of topics, providing ample material for identifying common themes and analyzing their connection to dimensions of wisdom leadership.

Authenticity: The speeches represent the speakers’ genuine thoughts and experiences, providing an authentic and reliable source of information.

Transfer of Life Experience and Wisdom: Commencement addresses provide prominent speakers with an important platform to share their life experiences, wisdom, and insights with a new generation of graduates eager to learn and grow. These speeches offer valuable insights and advice drawn from the speakers’ personal and professional journeys, which can inspire and guide the graduates as they embark on their own life paths. This transfer of wisdom broadens the graduates’ worldview and lays the groundwork for their future success.

Importantly, the dimensions of wisdom leadership that emerge from our analysis should be understood as reflecting the perspectives of these speakers and may not necessarily be objective. Further discussions on the biases and limitations of this study can be found in the subsequent section.

By providing this in-depth overview of our data collection and selection procedure, we aim to establish a transparent approach to our study.

3.3.3 Data Preparation

To prepare the data for analysis, we performed several steps to ensure data consistency and quality:

Data cleaning: We removed any irrelevant information from the transcripts, such as annotations or stage directions, and tried to retain only the speaker’s words.

Formatting and standardization: We standardized the transcripts by converting them to a consistent .txt format, ensuring uniformity in the text representation.

Language verification: We verified that all speeches were delivered in English.

Metadata Extraction: We extracted the metadata, such as speaker name, year, and location, from the text.

Following these data collection and preprocessing steps, we obtained a high-quality dataset of speeches that reflects the diversity of leadership perspectives and is suitable for our text-mining analysis.

3.3.4 NLP Techniques for Data Preprocessing

Natural Language Processing (NLP), as indicated in chapter 2.8, is concerned with the development of algorithms that allow computers to interpret and produce human language (Kalyanathaya, Akila, & Rajesh, 2019). NLP plays a crucial role in text mining and analysis, as it provides the necessary tools to preprocess, manipulate, and analyze textual data. In this study, to prepare the speeches for subsequent topic modeling and sentiment analysis, we employed various NLP techniques that we described in Chapter 2. The NLP preprocesses methods are the following: text cleaning, stop words removal, tokenization, and lemmatization. In this study, we opted for lemmatization over stemming, as it provides a more accurate representation of the speeches’ content and ensures a higher quality analysis of the commonalities among great leaders’ speeches and their connection to dimensions of wisdom leadership.

The complete code for preparing our data frame can be found in Appendix II. By employing these NLP techniques, we were able to preprocess and transform the speeches into a structured format suitable for subsequent topic modeling and sentiment analysis. The application of NLP techniques not only facilitated the extraction of meaningful insights from the speeches but also contributed to a more robust and accurate analysis of the commonalities among great leaders’ speeches and their connection to dimensions of wisdom leadership.

3.4 Data analysis

In our study, we have chosen word frequency analysis, topic modeling, and sentiment analysis as the primary text-mining techniques to extract commonalities among great leaders’ speeches and analyze how these commonalities connect to dimensions of wisdom leadership. These techniques and the process through which they will help achieve our research objectives are as follows:

3.4.1 Word Frequency Analysis

Word frequency analysis is a fundamental text-mining technique that involves counting the occurrence of words or word combinations in a text corpus (Rayson, 2015). In this study, we employed word frequency analysis to identify the most commonly used unigrams or n-grams in the selected speeches, providing valuable insights into the leaders’ communication patterns, focus areas, and ideas. We used unigrams, bigrams, and trigrams to capture different levels of word relationships and context. As such, we were able to capture a comprehensive view of the language used in the speeches, highlighting the most prevalent words or n-grams. This approach provided us with a better understanding of the speakers’ communication models, the key topics they addressed, and the linkages between the different aspects of Wisdom Leadership.

3.4.2 Topic Modeling

Topic modeling is an unsupervised machine-learning technique that aims to discover hidden patterns or themes in a collection of documents. In this study, we employed three popular topic modeling algorithms, Latent Dirichlet Allocation (LDA) (Blei et al., 2003), Top2Vec (Le & Mikolov, 2014), and STM (Roberts et al., 2019), to identify and analyze commonalities among the selected speeches. These algorithms were chosen for their ability to effectively extract meaningful topics from large text corpora and their complementary strengths in revealing thematic structures within the speeches.

By incorporating LDA, Top2Vec, and STM topic modeling techniques in this study, we were able to effectively identify and analyze commonalities among the selected speeches. This approach not only allowed for a more robust extraction of topics but also provided complementary perspectives on the thematic structure of the speeches, eventually contributing to a greater understanding of the connection between great leaders’ speeches and dimensions of wisdom leadership.

3.4.3 Sentiment Analysis

As mentioned in section 2.8, Sentiment analysis (opinion mining) or emotion AI, is a domain of natural language processing that aims to determine the sentiment or emotional tone behind a series of words. In the context of this research, sentiment analysis was employed to identify and quantify the emotional tone present in the selected speeches, offering insights into how great leaders convey their messages and the emotions they evoke in their audiences.

The sentiment analysis process involved several steps:

Sentiment Scoring: Using the trained machine learning models and the lexicon-based methods, each word or phrase in the speeches was assigned a sentiment score. These scores typically range from negative sentiment to positive sentiment, with 0 representing a neutral sentiment.

Aggregation and Visualization: The sentiment scores were aggregated for each speech to provide an overall sentiment score, which was then visualized using bar charts or other graphical representations. This allowed for the comparison of sentiments across different speeches and professions.

Interpretation: The sentiment analysis results were interpreted in the context of broader research objectives, examining how the identified sentiments relate to the dimensions of wisdom leadership and the commonalities among great leaders’ speeches.

The Sentiment of Words Associated with Topics: To assess the sentiment of each topic, the sentiment scores of the words assigned to each topic were aggregated. This allowed for the determination of the average sentiment score for each topic, reflecting the general emotional tone associated with the underlying theme. By comparing the sentiment scores across topics, we could identify which themes were more positively or negatively charged.

Sentiment of Documents in Topics: For each document classified under a specific topic, a sentiment analysis was performed to determine the overall sentiment score of the document. This information was then used to calculate the average sentiment score for all documents within each topic. This approach provided a more granular understanding of the sentiment distribution within each topic and highlighted potential variations in sentiment among the speeches associated with a particular theme.

By analyzing the sentiment of both the topic words and the documents classified under each topic, this study offers a deeper understanding of the emotional content of the speeches and their relationship with the identified themes. This additional layer of analysis can help illuminate potential connections between the emotional tone of the speeches and the dimensions of wisdom leadership, as well as inform our understanding of how great leaders effectively communicate their messages and engage their audiences.

3.5 Data Analysis Process

In this section, we outline the data analysis process, describing the software and tools used, the model selection and parameter tuning, and the validation and reliability measures employed to ensure robust and accurate findings.

3.5.1 Software and Tools

The data analysis process employed various software tools and programming languages to conduct the text-mining techniques described in the methodology. The primary software and tools used in this study include:

R: R is a widely-used open-source programming language and software (R Core Team, 2021). Table 3.3 shows the details of the R version we used.

Table 3.3: The version details of the R
Property	Value
platform	x86_64-w64-mingw32
arch	x86_64
os	mingw32
crt	ucrt
system	x86_64, mingw32
status
major	4
minor	2.2
year	2022
month	10
day	31
svn rev	83211
language	R
version.string	R version 4.2.2 (2022-10-31 ucrt)
nickname	Innocent and Trusting

RStudio: RStudio is an R and Python integrated developing environment (IDE). It comes with a console, a syntax-highlighting editor with direct code execution, and tools for graphing, history, debugging, and workspace management (Posit team, 2023). Table 3.4 shows the details of the RStudio version we used.

Table 3.4: Version details of RStudio
Version	2023.03.1+446
Release Name	Cherry Blossom
OS (Mode)	desktop

R was used for Latent Dirichlet Allocation (LDA), word frequency analysis (including unigrams, bigrams, trigrams, and skipgrams), and semantic analysis. Several packages were employed to facilitate the analysis, including tidytext (0.4.0.9000) (Silge & Robinson, 2021) for text manipulation and analysis, quanteda (3.3.0) (Benoit et al., 2021) for text processing and analysis, tm (0.7-11) (Feinerer, Hornik, & Meyer, 2021) for creating and manipulating text documents, text2vec (0.6.3) (Selivanov & Feuerriegel, 2021) for text modeling and analysis, ggplot2 (3.4.0) (Wickham, 2021) for data visualization, tidyverse (1.3.2) (Wickham & RStudio, 2021) for data wrangling and manipulation, and memoiR (1.2-2) (Marcon, 2022) offers templates for publishing properly organized publications in HTML and PDF. The complete list of the packages in R is in Appendix III.

The use of these software tools and programming languages, along with their associated packages, enabled the efficient and accurate execution of the text-mining techniques outlined in the methodology, ultimately providing valuable insights into the commonalities among great leaders’ speeches.

3.6 Ethical Considerations

In conducting this research, it was imperative to carefully consider and address ethical concerns to ensure that the study was conducted responsibly and with integrity. The following ethical considerations were considered during the research process:

Privacy and Confidentiality: The speeches and commencement addresses used in this study are publicly available and delivered by well-known public figures. Nonetheless, it was crucial to respect the privacy of the individuals whose speeches were analyzed. There is no personal information or sensitive content in the data, analysis and subsequent discussions.

Data Source Attribution: To maintain transparency and give proper credit, all speeches and commencement addresses were appropriately cited, with clear references to the original sources. This practice ensures the acknowledgment of the intellectual property of the authors and respects their rights to their work.

Data Manipulation and Bias: The text-mining techniques employed in this research were designed to be objective and unbiased. We carefully chose our methods and validated the results to minimize the potential for bias or misinterpretation in the analysis. Furthermore, we acknowledged and discussed any limitations in the dataset and the potential biases that could arise from the selection of speeches and commencement addresses.

Research Integrity: Throughout the research process, we adhered to principles of research integrity, ensuring that the study was conducted honestly and transparently. The methodology and data analysis process were clearly outlined, and any assumptions or potential limitations were acknowledged and discussed.

3.7 Limitations and Assumptions

In conducting this research, several limitations and assumptions were identified that may have influenced the results and interpretations. Recognizing and addressing these factors contributes to the transparency and validity of the study. The following limitations and assumptions were considered:

Limited Sample Size and Dataset Accessibility: This research was conducted based on a dataset of 32 speeches, which limits the scope of the analysis and the generalizability of the findings. During the research process, a larger dataset containing 300 commencement addresses was discovered but could not be accessed despite efforts to obtain it from the dataset’s owner. Consequently, the results and conclusions drawn from this study are constrained by the relatively small dataset and may not be representative of commencement addresses at large.

Speech Selection: The speeches and commencement addresses selected for this study represent a diverse group of influential public figures. However, the selection process may have unintentionally introduced biases or excluded other important speeches. The results and conclusions should be interpreted with the understanding that they are based on the specific dataset utilized in this study and may not be generalizable to all leaders’ speeches.

Language and Cultural Context: The speeches included in this research are exclusively in English, which may limit the ability to detect commonalities in speeches delivered in other languages or cultural contexts. Furthermore, cultural nuances and rhetorical styles may differ across various countries and regions, which could affect the interpretation of the commonalities identified in this study.

Text-mining Techniques: The text-mining techniques employed in this research, including natural language processing, topic modeling, sentiment analysis, and word frequency analysis, have inherent limitations. The accuracy and interpretability of the results depend on the quality of the data, the choice of algorithms, and the parameter tuning. Additionally, the techniques may not always capture the subtlety and complexity of human language, leading to potential misinterpretations or oversimplifications.

Subjectivity in Interpretation: While the text-mining techniques used in this study were designed to be objective and unbiased, some level of subjectivity may still be present in the interpretation of the results. Our preconceptions and biases could potentially influence the understanding and presentation of the findings.

Temporal Context: The speeches and commencement addresses analyzed in this research span a range of time periods, and the societal context and events at the time of their delivery may have influenced their content. The analysis may not fully account for the impact of temporal context on the commonalities identified in the speeches.

By acknowledging and addressing these limitations and assumptions, the study aims to enhance the transparency, validity, and reliability of the research. Moreover, recognizing these factors can help inform future research in this area, guiding the development of more robust and comprehensive methodologies to explore the commonalities among famous individuals’ speeches and their connection to dimensions of wisdom leadership.

2 Literature Review

4 Results and Discussion