From bcfbfebb83cb209eb6ea00561ebf4ca7a5d31b59 Mon Sep 17 00:00:00 2001 From: odelmarcelle Date: Fri, 26 Jun 2020 15:42:53 +0200 Subject: [PATCH] Update bins of sentences --- docs/articles/isa.html | 177 +++++++++--------- .../figure-html/unnamed-chunk-28-1.png | Bin 0 -> 25632 bytes vignettes/isa.Rmd | 108 ++++++++--- 3 files changed, 172 insertions(+), 113 deletions(-) create mode 100644 docs/articles/isa_files/figure-html/unnamed-chunk-28-1.png diff --git a/docs/articles/isa.html b/docs/articles/isa.html index cb14f9b..681d4b3 100644 --- a/docs/articles/isa.html +++ b/docs/articles/isa.html @@ -162,7 +162,7 @@

The sentometrics package introduces simple functions to quickly compute the sentiment of texts within a corpus. This easy-to-use approach does not prevent more advanced analysis, and the sentometrics functions remain a solid choice for cutting-edge research. This tutorial will present how to go beyond the basic sentometrics settings in order to analyse the intratextual sentiment structure of texts.

-Intratextual Sentiment Structure

+Intratextual sentiment structure

Does the position of positive and negative words within a text matter? That’s a question investigated by Boudt & Thewissen, 2019 during their research regarding sentiment implied by CEO letters. Based on a large dataset of letters, they analyze how sentiment-bearing words are positioned within the text. They find that CEOs tend to emphasize sentiment at the beginning and the end of their letter, in the hopes of leaving a positive impression to the reader.

Their results confirm generally accepted theories of linguistics saying that readers remember best the first (primacy effect) and the last (recency effect) portions of a text, and that the end of the text contributes the most to the reader’s final feeling.

One can wonder whether other types of texts follow a similar structure? Indeed, the world is full of different text media, from Twitter posts to news articles, and most of them are less cautiously written than CEO letters. Let’s investigate together one of these with the help of the sentometrics package!

@@ -173,7 +173,7 @@

As part of this tutorial, you will learn how to:

@@ -194,13 +194,13 @@

## -1 1 ## 605 344

The variable s indicates whether the news is more positive or negative, based on an expert’s opinion. We are going to try to predict this value at the end of the tutorial.

-

We can already prepare a sento_corpus and a sento_lexicon for our future sentiment computation. For the sento_corpus, we will also create a dummyFeature filled with 1’s. Since sentiment computations are multiplied by the features of a sento_corpus, we want this dummy feature to observe the whole corpus’s sentiments. This dummyFeature is created by default whenever there’s no feature at the creation of the sento_corpus.

+

We can already prepare a sento_corpus and a sento_lexicon for our future sentiment computation. For the sento_corpus, we will also create a dummyFeature filled with 1’s. Since sentiment computations are multiplied by the features of a sento_corpus, we want this dummy feature to observe the whole corpus’s sentiment. This dummyFeature is created by default whenever there’s no feature at the creation of the sento_corpus.

Finally, we remove the feature s from the sento_corpus, as we do not need it for sentiment computation.

usnews2Sento <- sento_corpus(usnews2) # note that the feature 's' is automatically re-scaled from {-1;1} to {0;1}
 usnews2Sento <- add_features(usnews2Sento, data.frame(dummyFeature = rep(1, length(usnews2Sento))))
 
 docvars(usnews2Sento, "s") <- NULL # R-removing the feature
-

We will use a single lexicon for this analysis, the combined Jockers & Rinker lexicon, obtained from the lexicon package. However, we will prepare a second and different version of this lexicon where the sentiments assigned to words are all positive, regardless of their original signs. This second lexicon will be useful to better detect the sentiment intensity conveyed.

+

We will use a single lexicon for this analysis, the combined Jockers & Rinker lexicon, obtained from the lexicon package. However, we will prepare a second and different version of this lexicon where the sentiment assigned to words are all positive, regardless of their original signs. This second lexicon will be useful to better detect the sentiment intensity conveyed.

We used the data.table operator [] to create the second lexicon in a very efficient way. Most sentometrics objects are based on data.table and this allows to perform complex data transformations. If this is the first time you are seeing the data.table way of using [], we recommend you to have a look at their Introduction vignette and enjoy this powerful tool!

lex <- lexicon::hash_sentiment_jockers_rinker
 
@@ -230,10 +230,10 @@ 

A review of sentiment computation with sentometrics

compute_sentiment() is at the base of sentiment analysis with sentometrics. That’s also the function we are going to use to analyse intratextual sentiment. This requires, however, to play with the most advanced features of the function. Before doing that, let us review the different computation settings to really understand what’s going on.

-
+

-Default computation - from words to document sentiments

-

When using the default settings (i.e., only specifying the how argument), the sentiment for each word within a text will be determined according to the provided lexicons. These word sentiments are then aggregated using the method defined by the how argument, aggregating up to the document level to form a sentiment value for the document.

+Default computation - from words to document sentiment

+

When using the default settings (i.e., only specifying the how argument), the sentiment for each word within a text will be determined according to the provided lexicons. These word sentiment are then aggregated using the method defined by the how argument, aggregating up to the document level to form a sentiment value for the document.

sentiment <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional")
 head(sentiment)
##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature
@@ -243,11 +243,11 @@ 

## 4: 830981681 1972-01-28 158 0.025316456 0.09493671 ## 5: 830981684 1973-02-15 174 -0.004022989 0.03160920 ## 6: 830981702 1973-05-31 227 0.009251101 0.06784141

-

In this case, the how = "proportional" simply sum words’ sentiments then divide it by the number of words in a document. The different settings for how can be accessed using the get_hows() function. We are going to present the use of a more complex setting at the end of this tutorial.

+

In this case, the how = "proportional" simply sum words’ sentiment then divide it by the number of words in a document. The different settings for how can be accessed using the get_hows() function. We are going to present the use of a more complex setting at the end of this tutorial.

-
+

-Setting do.sentence = TRUE - from words to sentences sentiments

+Setting do.sentence = TRUE - from words to sentences sentiment

A drastic change in the behaviour of compute_sentiment() can be induced by specifying do.sentence = TRUE in the function call. If true, the output of compute_sentiment will no longer return a sentiment value for each document, but each sentence. Sentiment values within each sentence are still computed using the method provided in the how argument, but the function stops there.

sentiment <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional", do.sentence = TRUE)
 head(sentiment)
@@ -258,13 +258,13 @@

## 4: 830981632 4 1971-01-12 33 0.01666667 0.04696970 ## 5: 830981632 5 1971-01-12 16 -0.04687500 0.07812500 ## 6: 830981632 6 1971-01-12 24 0.04166667 0.06250000 -

The new column sentence_id in the output is used to identify the sentences of a single document. This result can be used as-is for analysis at the sentence level, or sentences sentiments can be aggregated to obtain documents sentiments, as in the default setting. One way to aggregate sentences sentiments up to documents sentiments is to use the aggregate() method of sentometrics.

+

The new column sentence_id in the output is used to identify the sentences of a single document. This result can be used as-is for analysis at the sentence level, or sentences sentiment can be aggregated to obtain documents sentiment, as in the default setting. One way to aggregate sentences sentiment up to documents sentiment is to use the aggregate() method of sentometrics.

Trick with bins in a list, do.sentence and tokens

-

Analyzing the sentiment of individual sentences is already a nice approach to observe intra-document sentiment, but sometimes it is better to define a custom container for which sentiments are going to be computed. This is the approach used by Boudt & Thewissen, 2019, where they define bins, equal-sized containers of texts. The idea is to divide a document into equal-sized portion and to analyze each of them independently. Let’s say we decide to split a document of 200 words into 10 bins. To do so, we are going to store the first 20 words in the first bin, the words 21 to 40 in the second bin, and so on… This way, each bin will account for 10% of the text. By repeating the procedure for all texts of a corpus, we can easily compare specific text portions (e.g., the first 10%) between multiples documents.

+

Analyzing the sentiment of individual sentences is already a nice approach to observe intra-document sentiment, but sometimes it is better to define a custom container for which sentiment are going to be computed. This is the approach used by Boudt & Thewissen, 2019, where they define bins, equal-sized containers of texts. The idea is to divide a document into equal-sized portion and to analyse each of them independently. Let’s say we decide to split a document of 200 words into 10 bins. To do so, we are going to store the first 20 words in the first bin, the words 21 to 40 in the second bin, and so on… This way, each bin will account for 10% of the text. By repeating the procedure for all texts of a corpus, we can easily compare specific text portions (e.g., the first 10%) between multiples documents.

Let’s split our documents into sets of bins. The first step is to obtain a vector of characters for each document. This is done easily with the tokens function from the quanteda (remember that sentometrics objects are also based on quanteda, letting us free to use most functions from this package).

usnews2Toks <- tokens(usnews2Sento, remove_punct = TRUE)
 usnews2Toks <- tokens_tolower(usnews2Toks)  # changing all letters to lowercase is optional but recommended
@@ -331,13 +331,13 @@

-Exposing Intratextual Sentiment Structure with bins +Exposing intratextual sentiment structure with bins

-

In their analysis of CEO letters, Boudt & Thewissen, 2019 identified an intratextual sentiment structure: CEOs would deliberately emphasize sentiments at the beginning and end of the letter, and pay attention to leave out a positive message and the end. Our dataset of news articles is radically different from these letters so we don’t expect to find a similar structure. However, based on our knowledge of news, we can formulate a hypothesis: news articles tend to use strong sentiments in their headlines to attract readers’ eyes. Let’s investigate this using our bins!

+

In their analysis of CEO letters, Boudt & Thewissen, 2019 identified an intratextual sentiment structure: CEOs would deliberately emphasize sentiment at the beginning and end of the letter, and pay attention to leave out a positive message and the end. Our dataset of news articles is radically different from these letters so we don’t expect to find a similar structure. However, based on our knowledge of news, we can formulate a hypothesis: news articles tend to use strong sentiment in their headlines to attract readers’ eyes. Let’s investigate this using our bins!

Absolute sentiment

-

We expect that the first bin in each article presents on average more sentiment than in the rest of the text. Since news can either be positive or negative, it will easier to identify sentiment intensity using the absolute value lexicon prepared earlier. This way, we avoid the cancelling effect between positive and negative sentiments. Simply plotting the mean sentiment values for each bin across documents can give us some insight on the intratextual structure. Once again, we rely on data.table’s [] operator to easily group sentiment values per sentence_id (remember, these represent the bin number!). In addition to this, a boxplot can be useful to ensure that the mean sentiments are not driven by extreme outliers.

+

We expect that the first bin in each article presents on average more sentiment than in the rest of the text. Since news can either be positive or negative, it will easier to identify sentiment intensity using the absolute value lexicon prepared earlier. This way, we avoid the cancelling effect between positive and negative sentiment. Simply plotting the mean sentiment values for each bin across documents can give us some insight on the intratextual structure. Once again, we rely on data.table’s [] operator to easily group sentiment values per sentence_id (remember, these represent the bin number!). In addition to this, a boxplot can be useful to ensure that the mean sentiment are not driven by extreme outliers.

par(mfrow = c(1, 2))
 
 plot(sentiment[, .(s = mean(`absoluteLex--dummyFeature`)), by = sentence_id], type = "l",
@@ -351,14 +351,14 @@ 

Herfindahl-Hirschman Index

-

Another way to study the intratextual sentiment structure is to compute the Herfindahl-Hirschman Index across all documents. This is a popular index of concentration, mainly used in measuring competition between firms on a given market. A value close to 0 indicates large dispersion between bins while a value of 1 indicated that all sentiments are found in a single bin. The formula to compute the index of a single document is:

+

Another way to study the intratextual sentiment structure is to compute the Herfindahl-Hirschman Index across all documents. This is a popular index of concentration, mainly used in measuring competition between firms on a given market. A value close to 0 indicates large dispersion between bins while a value of 1 indicated that all sentiment are found in a single bin. The formula to compute the index of a single document is:

\[H = \sum_{b=1}^{B} s_b^2\] where \(b\) are bin indexes and \(s\) the proportion of the document sentiment found in a single bin.

Using data.table, we can easily compute the index for the whole set of documents.

herfindahl <- sentiment[, .(s = `absoluteLex--dummyFeature`/sum(`absoluteLex--dummyFeature`)), by = id]
 herfindahl <- herfindahl[, .(h = sum(s^2)), by = id]
 mean(herfindahl$h)
## [1] 0.1445487
-

A result that shows there is concentration toward some bins! Note that this result is heavily dependent on the number of bins considered. Only index values computed with the same number of bins should be compared. Let’s show the index’s value if sentiments were uniformly positioned within the text:

+

A result that shows there is concentration toward some bins! Note that this result is heavily dependent on the number of bins considered. Only index values computed with the same number of bins should be compared. Let’s show the index’s value if sentiment were uniformly positioned within the text:

x <- data.table(id = sentiment$id, s = rep(1, nrow(sentiment)))
 
 herfindahl <- x[, .(s = s/sum(s)), by = id]
@@ -370,7 +370,7 @@ 

Computing sentiment with different weights

-

The sentometrics comes with a lot of different weightings methods to compute sentiment and aggregate them into document sentiments or even time series. These weightings methods can be accessed with the functions get_hows.

+

The sentometrics comes with a lot of different weightings methods to compute sentiment and aggregate them into document sentiment or even time series. These weightings methods can be accessed with the functions get_hows.

## $words
 ## [1] "counts"                 "proportional"           "proportionalPol"       
@@ -383,8 +383,8 @@ 

## ## $time ## [1] "equal_weight" "almon" "beta" "linear" "exponential" "own"

-

So far, we’ve been using the proportional method from the $words set. The $words set contains the valid options for the hows argument of compute_sentiment(). The other two sets are used within the aggregate() function, to respectively aggregate sentences sentiment into documents or document sentiments into time series.

-

With our earlier computation of sentiments using do.sentences = TRUE, we computed sentiments for sentences and bins. Now, for our next application, we need to aggregate these sentences and bins sentiments into documents sentiments. One option is to aggregate() using one of the methods shown above. Note the use of do.full = FALSE to stop the aggregation at the document level (otherwise, it would directly aggregate up to a time series).

+

So far, we’ve been using the proportional method from the $words set. The $words set contains the valid options for the hows argument of compute_sentiment(). The other two sets are used within the aggregate() function, to respectively aggregate sentences sentiment into documents or document sentiment into time series.

+

With our earlier computation of sentiment using do.sentences = TRUE, we computed sentiment for sentences and bins. Now, for our next application, we need to aggregate these sentences and bins sentiment into documents sentiment. One option is to aggregate() using one of the methods shown above. Note the use of do.full = FALSE to stop the aggregation at the document level (otherwise, it would directly aggregate up to a time series).

docsSentiment <- aggregate(sentiment, ctr_agg(howDocs = "equal_weight"), do.full = FALSE)
 
 lapply(list(sentiment = sentiment, docsSentiment = docsSentiment), head)
@@ -449,14 +449,14 @@

Application to news sentiment prediction

Let’s now put all of this in a concrete example. We’ve been using a modified dataset usnews2 since the beginning because we wanted to have a variable identifying whether the document is positive or negative. Our goal is now to try to predict this value.

-

To do so, we will consider 4 different approaches, in the form of four different weighting methods. We will study which weighting is the best to predict document’s sentiments. The four weighting methods will be:

+

To do so, we will consider 4 different approaches, in the form of four different weighting methods. We will study which weighting is the best to predict document’s sentiment. The four weighting methods will be:

  • The default weighting based on word frequencies, regardless of the position.
  • A U-shaped weighting of words, where words at the beginning or end of the text are given more weights.
  • -
  • A sentence-weighting, where word sentiments are proportionally weighted up to a sentence sentiment level, then sentences are aggregated with an equal weighting to obtain the document sentiment.
  • -
  • The bin based approach, where word sentiments are proportionally weighted up to a bin sentiment level, then bins are aggregated with our custom weights: the first bin given half the weight and the other bins sharing the rest.
  • +
  • A sentence weighting, where word sentiment are proportionally weighted up to a sentence sentiment level, then sentences are aggregated with an equal weighting to obtain the document sentiment.
  • +
  • The bin based approach, where word sentiment are proportionally weighted up to a bin sentiment level, then bins are aggregated with our custom weights: the first bin given half the weight and the other bins sharing the rest.
-

The U-shaped weighting is something we haven’t seen before. This is a weighting method for words as per get_words() that gives more weight to the beginning and end of a text. Its exact formulation can be found at the end of the Sentometrics vignette, along with the other available weighting. This weighting scheme can be visualized as follows:

+

The U-shaped weighting is something we haven’t seen before. This is a weighting method for words, as we can learn from get_hows(). This scheme gives more weight to the beginning and end of a text. Its exact formulation can be found at the end of the Sentometrics vignette, along with the other available weightings. This weighting scheme can be visualized as follows:

Qd <- 200 # number of words in the documents
 i <- 1:Qd
 
@@ -465,7 +465,7 @@ 

plot(ushape, type = 'l', ylab = "Weight", xlab = "Word position index", main = "U-shaped weight scheme")

-

Let’s compute sentiments with the four different weighting schemes. We will store the results in a list, sentimentValues.

+

Let’s compute sentiment with the four different weighting schemes. We will store the results in a list, sentimentValues.

sentimentValues <- list()
 
 sentimentValues$default <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional")
@@ -474,30 +474,18 @@ 

sentimentValues$bins <- compute_sentiment(usnews2Sento, sentoLexicon, tokens = usnews2Bins, how = "proportional", do.sentence = TRUE) -lapply(sentimentValues, head, n = 3)

+lapply(sentimentValues[c(1,3)], head, n = 3)

## $default
 ##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature
 ## 1: 830981632 1971-01-12        192          -0.010156250                0.10130208
 ## 2: 830981642 1971-08-04        243           0.036831276                0.08539095
 ## 3: 830981666 1971-08-24        326           0.007515337                0.03849693
 ## 
-## $uShaped
-##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature
-## 1: 830981632 1971-01-12        192         -0.0345837756                0.13075232
-## 2: 830981642 1971-08-04        243          0.0264033780                0.09045472
-## 3: 830981666 1971-08-24        326          0.0006524499                0.02734571
-## 
 ## $sentences
 ##           id sentence_id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature
 ## 1: 830981632           1 1971-01-12         28           -0.09285714                0.12142857
 ## 2: 830981632           2 1971-01-12         37            0.01081081                0.01081081
-## 3: 830981632           3 1971-01-12          6           -0.01666667                0.15000000
-## 
-## $bins
-##           id sentence_id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature
-## 1: 830981632           1 1971-01-12         20           -0.11250000                0.11250000
-## 2: 830981632           2 1971-01-12         19           -0.01842105                0.06052632
-## 3: 830981632           3 1971-01-12         19            0.02105263                0.02105263
+## 3: 830981632 3 1971-01-12 6 -0.01666667 0.15000000

Before going further, we need to aggregate the two last results to a document level sentiment measure. We are going to aggregate sentences using the aggregate() function while we will repeat the same operation as before to compute the bins aggregation with the custom weights.

sentimentValues$sentences <- aggregate(sentimentValues$sentences, ctr_agg(howDocs = "equal_weight"), do.full = FALSE)
 
@@ -517,7 +505,7 @@ 

## 1: 830981632 1971-01-12 194 -0.004965374 0.09939751 ## 2: 830981642 1971-08-04 243 0.035614035 0.08657895 ## 3: 830981666 1971-08-24 336 0.006670419 0.03841824

-

Finally, what remains to do is test our results against the variable s from usnews2. Since we know the number of positive and negative news in s, we can quickly and in a naive way measure the accuracy by ordering the documents by sentiment values.

+

Finally, what remains to do is to test our results against the variable s from usnews2. Since we know the number of positive and negative news in s, we can quickly and in a naive way measure the accuracy by ordering the documents by sentiment values.

table(usnews2$s)
## 
 ##  -1   1 
@@ -525,75 +513,96 @@ 

Thus, we classify the 605 documents with the lowest sentiment in each measure as negative, and the remaining documents as positive.

Let’s start by adding the s variable to the existing measures by merging each of them with usnews2. The use of lapply allows to do the operation of all measures at once.

sentimentValues <- lapply(sentimentValues, function(x) merge.data.frame(x, usnews2[, c("id","s")]))
-lapply(sentimentValues, head, n = 3)
-
## $default
-##          id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
+
+head(sentimentValues$default)

+
##          id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
 ## 1 830981632 1971-01-12        192          -0.010156250                0.10130208 -1
 ## 2 830981642 1971-08-04        243           0.036831276                0.08539095 -1
 ## 3 830981666 1971-08-24        326           0.007515337                0.03849693  1
-## 
-## $uShaped
-##          id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
-## 1 830981632 1971-01-12        192         -0.0345837756                0.13075232 -1
-## 2 830981642 1971-08-04        243          0.0264033780                0.09045472 -1
-## 3 830981666 1971-08-24        326          0.0006524499                0.02734571  1
-## 
-## $sentences
-##          id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
-## 1 830981632 1971-01-12        202          -0.016162828                0.11392764 -1
-## 2 830981642 1971-08-04        251           0.049212232                0.09111569 -1
-## 3 830981666 1971-08-24        349           0.009439859                0.04566347  1
-## 
-## $bins
-##          id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
-## 1 830981632 1971-01-12        194          -0.004965374                0.09939751 -1
-## 2 830981642 1971-08-04        243           0.035614035                0.08657895 -1
-## 3 830981666 1971-08-24        336           0.006670419                0.03841824  1
+## 4 830981681 1972-01-28 158 0.025316456 0.09493671 -1 +## 5 830981684 1973-02-15 174 -0.004022989 0.03160920 -1 +## 6 830981702 1973-05-31 227 0.009251101 0.06784141 1

Since we used merge.data.frame, we need to convert the objects back to data.table and then we can order each of these tables.

sentimentValues <- lapply(sentimentValues, as.data.table) # converting back to data.table
 
 sentimentValues <- lapply(sentimentValues, function(x) x[order(`baseLex--dummyFeature`)]) # order based on the baseLex sentiment values
 
-lapply(sentimentValues, head, n = 3)
-
## $default
-##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
+head(sentimentValues$default)
+
##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
 ## 1: 830981961 1976-02-20        123           -0.06707317                0.10691057 -1
 ## 2: 842616972 2011-11-23        206           -0.05412621                0.12305825 -1
 ## 3: 842616769 2010-11-20        186           -0.05322581                0.08978495 -1
-## 
-## $uShaped
-##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
-## 1: 842613535 1991-05-02        213           -0.07650705                0.10549752  1
-## 2: 830981961 1976-02-20        123           -0.07607828                0.09651269 -1
-## 3: 842615597 2003-11-24        202           -0.07382476                0.10713842 -1
-## 
-## $sentences
-##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
-## 1: 830981961 1976-02-20        125           -0.07707298                0.12142702 -1
-## 2: 830984376 1987-12-17        225           -0.06662902                0.10809238 -1
-## 3: 842614104 1994-02-18        205           -0.06074249                0.09806308 -1
-## 
-## $bins
-##           id       date word_count baseLex--dummyFeature absoluteLex--dummyFeature  s
-## 1: 830981961 1976-02-20        127           -0.07186235                0.10620783 -1
-## 2: 842616769 2010-11-20        186           -0.05663281                0.08476454 -1
-## 3: 842616972 2011-11-23        208           -0.04818296                0.11309524 -1
+## 4: 842614104 1994-02-18 195 -0.05256410 0.08384615 -1 +## 5: 830984835 1988-11-20 175 -0.04942857 0.07228571 1 +## 6: 842617451 2014-12-17 159 -0.04874214 0.08270440 -1

Finally, we compute the accuracy by counting the number of times the value of s is -1 in the first 605 documents and the number of time the value is 1 in the last 344 documents. We obtain a balanced accuracy measure by combining the true negative rate and the true positive rate.

index <- table(usnews2$s)[[1]]
 
 rates <- cbind(trueNegativeRate = sapply(sentimentValues, function(x){sum(x[1:index, s == -1]) / sum(x[, s == -1])}),
                truePositiveRate = sapply(sentimentValues, function(x){sum(x[(1 + index):nrow(x), s == 1]) / sum(x[, s == 1])}))
 
-cbind(rates, balancedAccuracy = (rates[,1] + rates[,2]) / 2 )
+cbind(rates, balancedAccuracy = (rates[, 1] + rates[, 2]) / 2 )
##           trueNegativeRate truePositiveRate balancedAccuracy
 ## default          0.7256198        0.5174419        0.6215308
 ## uShaped          0.7289256        0.5232558        0.6260907
 ## sentences        0.7256198        0.5174419        0.6215308
 ## bins             0.7272727        0.5203488        0.6238108

In this case, the U-shaped weighting performs best but we can already see the improvement brought by our custom weights in comparison with the default settings. In a supervised learning setting, it can be useful to optimize a custom weights scheme on a training dataset. An example of such a model can be found in the paper of Boudt & Thewissen, 2019, where bins weights are optimized to predict firm performance.

-

That’s the end of this tutorial. Want to go further? Have a try creating weird bins! They actually don’t have to be of equal size, their specification is up to anyone. Also, keep in mind that we have only covered news articles in this tutorial, which is not representative of all type of texts, feel free to investigate how sentiments are positioned within different types of documents.

+
+

+Hierarchical aggregation - bins of sentences

+

As we learned through this tutorial, we can always define more complex methods to compute and aggregate sentiment. The reason why we use different aggregation levels such as bins or sentences is that looking at words does not capture the semantic structure of the text. The most appropriate way to compute sentiment should be through sentences, as sentences usually convey a single statement.

+

Earlier, we implemented the bins approach by creating equal-sized containers of words. Each bin then contained a similar number of words. This naive split had the effect of cutting some sentences between two bins. From a semantic point of view, this not desirable. Hence, we’re going to define here a new bins approach that respects sentences integrity: bins of sentences.

+

This approach is similar to the previous one, but instead of dividing the texts into equal-sized containers of words, we are going to divide them into equal-sized containers of sentences. This means that each bin will contain approximately the same number of sentences.

+

To implement it, we will need to play a bit with data.table operations to aggregate from sentences to bins of sentences. The first step is to compute sentence sentiment using compute_sentiment(). Then, we’re going to add a column to the resulting sentiment object. This additional column will contain information about the future bin in which each sentence will be aggregated. This is a mapping from sentences to bins of sentences.

+

The following operation creating bin_id is slightly complex. The best way to understand it is by following the logic from the most internal part of the script up to the final apply(). The innermost function here is splitIndices, which is used to split the sentence_id of each document in equal-sized vectors. The second level, the sapply() function, determines to which split vector belongs each sentence_id and returns boolean vectors for each. Finally, the last apply() call the function which() on each of these vectors, resulting in the correct bin indices.

+
sentiment <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional", do.sentence = TRUE)
+nBins <- 5
+
+sentiment <- sentiment[, cbind(bin_id = apply(
+                                 sapply(parallel::splitIndices(max(sentence_id), nBins),
+                                        '%in%', x = sentence_id),
+                                 which,
+                                 MARGIN = 1
+                                 ),
+                               .SD), by = id]
+
+sentiment[id == 830981632, 1:6]
+
##           id bin_id sentence_id       date word_count baseLex--dummyFeature
+## 1: 830981632      1           1 1971-01-12         28           -0.09285714
+## 2: 830981632      1           2 1971-01-12         37            0.01081081
+## 3: 830981632      2           3 1971-01-12          6           -0.01666667
+## 4: 830981632      2           4 1971-01-12         33            0.01666667
+## 5: 830981632      3           5 1971-01-12         16           -0.04687500
+## 6: 830981632      4           6 1971-01-12         24            0.04166667
+## 7: 830981632      4           7 1971-01-12         24            0.07708333
+## 8: 830981632      5           8 1971-01-12         17           -0.18529412
+## 9: 830981632      5           9 1971-01-12         17            0.05000000
+

With this result, we can now use the new column bin_id for grouping. We cannot use the sentometrics functions here, as they are not built to take into account a bin_id column. Instead, we use a data.table operation similar to what we did to compute the bins aggregation with custom weights. This time, however, we will simply use the mean() function, meaning that each bin of sentences will contain the average sentiment value of the constituent sentences.

+
sentiment <- sentiment[, c(word_count = sum(word_count), sentence_count = length(sentence_id), lapply(.SD, mean)),
+                                             by = .(id, date, bin_id),
+                                             .SDcols = tail(names(sentiment), -5)]
+head(sentiment[, 1:6])
+
##           id       date bin_id word_count sentence_count baseLex--dummyFeature
+## 1: 830981632 1971-01-12      1         65              2           -0.04102317
+## 2: 830981632 1971-01-12      2         39              2            0.00000000
+## 3: 830981632 1971-01-12      3         16              1           -0.04687500
+## 4: 830981632 1971-01-12      4         48              2            0.05937500
+## 5: 830981632 1971-01-12      5         34              2           -0.06764706
+## 6: 830981642 1971-08-04      1         60              2            0.03981481
+

Finally, we can re-create the graphs used for our initial analysis of the intratextual sentiment structure, but using bins of sentences. In this case, there’s not much difference with the previous analysis. However, using bins of sentences paves the way to more complex and semantically accurate analyses.

+
par(mfrow = c(1, 2))
+
+plot(sentiment[, .(s = mean(`absoluteLex--dummyFeature`)), by = bin_id], type = "l",
+     ylab = "Mean absolute sentiment", xlab = "Bin of sentences")
+
+boxplot(sentiment$`absoluteLex--dummyFeature` ~ sentiment$bin_id, ylab = "Absolute sentiment", xlab = "Bin of sentences",
+        outline = FALSE, range = 0.5)
+

+

That’s the end of this tutorial. Want to go further? Have a try creating more weird bins! They actually don’t have to be of equal size, their specification is up to anyone. Also, keep in mind that we have only covered news articles in this tutorial, which is not representative of all type of texts, feel free to investigate how sentiment are positioned within different types of documents.

+

Acknowledgements

diff --git a/docs/articles/isa_files/figure-html/unnamed-chunk-28-1.png b/docs/articles/isa_files/figure-html/unnamed-chunk-28-1.png new file mode 100644 index 0000000000000000000000000000000000000000..bfe331059948452f7b0283051c59594331c42e67 GIT binary patch literal 25632 zcmeFZbzD?m)Hix&=*}M~4F;eHD%~R?79~iRsDPw|AU!CElwcw~grrCh(jch_4APyF z(jX0Y&k*`N@BQcgeP2F9oxRW5XYaMvcjevzYAQ+;q)em`1W{bSrf?gANWhD9jA}1nm;oF4!-07#e)~!a$$CeY6>b zHj_YyfsYv)v|$^;S0A>$5Bvsy?4$Sh+uGXt!X)~_%=*ya6NYOE!!}~S_Cc4v{eAGP z{l2jMKJ-4gguMxSZ9#927tD|3(KTI12x9fee#1h4Ej@rB4(PhV70oBna|53D+8WRc zu0@jhB&?LtnE5D2AMw>6gO5mJf{M6*Uw;$*#sBnh>AtAYxo^(nM10`}>3SKR>nl_g zHjNqg?@XIZzYTN;gWL;jCjh@{u0n7KnxIBP5Km5iS^x~A+% z!iX#Jz+?H81-$o*SJ$cJ!9N6WY6!ZhxaM0U+QL#fyD0tYLY-(|F0Qf4RoCwTSK_x|56C#EQQd6zFS z-Ny}1F{Sq4AC57Ur=It)S)sQ=hTl!Ej7xefE3G(7wmqNQY z6`n@1$pzoLE@7DAYX0F6akZAqDG2tO8Yxjmh#*G0F~f+%e4+MfVMs2!qmLMYJdKZf zDFDxw^^CERC}Ik&k2`BDSsD)EUz3AoHn^Z|jn1CWLqhY>5)$j4fZEubH}61yKBi|4 zux9m3c`Xjw!SFc{v~aaaZCH1Ba=EjhQ~cy3P50k>m!JG9z=ugzdhNprc1QDw6*CFQIiR^sIcR`)#-gzL zGR)I<314mR4ugAGjkswHMgo6C`FBgmz_Y$>c{-hvf9n@qGp8gbZ5Xeay#m+3p&|n ziR@{RPs^B)0{WhRI}A6%HgY;@k2}E2S&;1dhJQOu&B8XaJF2Y<7{Xpc^s8%-NB7L# z15PL1a)V^7*Dg-L;HNnu0xvw|K+lZH0YeOnLftW$lt5UpP2NOE!=4%AgQi6nsQX>! z9)KIrw6@3qmE?4k5hMBn((;Kp9y4d-!PThjVA;(213^%J%-b9CAduPG)OmBDkNAVL$Ctvn(4`E)=JRKFKNj6ZiY*bWu~m~tkSLc=!`xp zSZazW%)&gw@$l-6c+Ta}u;ipG(9E*+@eOhxica~&|B{_JGV6Q zym*A@@KJ+<1gMbA@&_3Od+ndtNt^Itsr7}h?q;p7+>YMhv`wxK`bgq`uIVM!GZeBO zUod6ks5vuP67E{U1WWB&hjrI!wQxb4{FjD2kvsBGFIF;Z4lW4~c_zq_Afzj3Dnol_ zbnrP$PQlg4u`8#~8U*26D?owq?v5uFkc-x9=iwMSn3;%2QvF2FCzw}t)EuRwu5K)Ovc^l6Z24UM!&xegAf7Qp)HoQ_Pu0&$URiOFih z#E473+YHdI976hK-vKM7>B`balEBs8uEV=)w6bzL^t>%Qsr<~ILiE#3P-oAK=JDd) zXGOwgReWtoh<|9M{L^u~pj7Q0?2+{vuYP9Jhe(0j%JBi9Zzq8H>#J*N)8m#>Y~GO` zbBjivnSQp`_JFhrWz3c59(c}dDUVq!w-zn*MGz6&FAdl!L4h02P^0GKw34xRbTdQ? zy9@|O&cL$VeFBD?UP9W<0dm7Bah49Yk_vJo*9$(7@0WIub+Af*Y(7Uek*Hl!58e8U zMhS6yzw(IT)Y&vJ9zr0i0SRo6a~NCX)&53eZekESym5YS5Q_9BMGP|Ddl5q=_%8Qeuk;t1=(4Ula z_9FaM%Q%8mOB?th`@i+Ljx;RZYF(f7N-5QvwzTm$Y&dPAj=b}|xHY8ZdbV;n`I?SC z;UR2Xt4x40-aby8whEO;;By!s-C(MVRPJ?aowHc2D6Y`gw_3upEWak~cUBIf&Ecy& zt6$cjW%Y_pzuy_3gQgK~o!b%g;Yb{4b1znM>Bqb{;PM}M^msTl;l=`t@Q^f&=XrM1 zojiM(T+fXBzi{FStQqUK_E4lzh^(jQ#%xE=x2}RN1==q~!F-iQPA;575%DC5zUef2 zp9d_+_euet=5L08AseRP!pH^ref%FwS3*XEJmla8L1RqTi zXqj!*X1TDX5j-(1O;b7Ea{MIH);}uNf92*>OaDpSD^ndH2A0vmLk8k(8)S*TQ9^!p z{H(C>oS-yk1cs2Hx} zv{;Y3!|?hNbg2>=n3|USg~}5Nt`#FdPd^MK*iOq!h{9;AQ{lFhE__2ecg%M+J zsSk-8-jp1yM@#05JYLjY{9KABtp5y4Ycc3uwZar6RI&dQJC)RhFILp-hsM~>OB)wb z#C{e&KI)7m>77$!&^Z6#Qj~yo;;(FtmsfvvpE?D#vLbDT9>>h^w^yZlWv|q`omI#x z874twhJ-Bf!hjfuMB}M_a5Onz7(aO6`XJ5Hi<`sWZAt}J-K-#>M)oo>k7>Zw0A7m| zp=KU6eY-z&#+Wbek;}^K-SA*Y!5flj?Ql#sZz5j3$T|C&7-55ZG=NerL;SK&ViHVQ$s9F?n#bHU8rM zrYcMR0G~rd2!*G=`y6^pBAg))6e|xIVw(#m52An%YS+=MuxmT7eLQ+*4*xeSg;s5( z_Ey)tw$`ckTuP=pHXOGz(6jcACO7d)duBv%?Bm|(#)jJ776_eXMwiRVPDPrUsaAAj z0iZ^!kP`YYF9(y`>m1IqjP)#J<*o}V8GM=^0c&olz@Ad&A9P!Wj-<=%*G zw?j+bllBWW?QJ9!gL8LSV8qnGnBat9%)eNW{q)TOjnLSzw@= zwJfWMOYp_}8EuF2=TK{gP3612&}K#U%ir?W_5MbgP87Fy$1Ls2`U=kW)Q@=%wRd)( zh@(Xw@RqW%cGvC?IA$(;Ogww*+k#8|yi|AFOVKx2;kSOw1Xt6)chBPnbiiGj+~Ka! z*8C>4-ckF5tzjprtpKt6`GFh3aJ^O>CsZITQkQ~{!dXc+40h38sa$GHVh! z$doyvGy)*peHP@?Sa?Rb4m-iQ-JNllIKQ>h_tOkWMt#GhSA_h`xH=UvEexMa-a_LOdnZdBVHR8FIV}n=X78FyL=a05)J+gR$?3TF5A%-xz&BJK zmCCD6yjW}~J%HJssQg3tB)j7-F(N`OPokYZ@>-I)PtiSu;I937cwM7Hi*R)LG-%NPf1if3pWQ#_evB3ORQ;+KCTb#Ai7z z2`NC9Hb7i`5C9CK-qEC5LB(kf?IXX%evVn*j*C8K#$#}NR zYuNcnj@#+euQtOPcF85TgmgDb`d$w>1wL~L@&3ZH^_&;5$@SGB)ElmUSm`n)g8z#Y zS@w8P!I!bJaUQFPX4r3@J)C~}i%ejK)D@Ne);}}2OKqj%_^`fJPKRzTB_isx*fmt+ z`e;+98MAyHEcF&YCmfT^@+&HHgB1^$`&r{G$lFA5ekw!3>xHy9ptIn5^%*v^1+|0F zQ&`kB&tK;-LM-mzC83X>bG}knC;%8yAVEC1MRIYiZgtJ=C9NxIa3*E9`rFSA%oa>9 z0cqTnAiV^j3rBz8X;LCo*Y`Uf7r%Gze{jX9H4w`Mi5Z_%{I5YE=A!oyi+w;M0u&-i zdYeo(Rz~{Y*y4pT*Ry%r@)!J&_2TX_wV||juH-jeeH(OnRm!zwU5Et+Jh8(u5?EoR zOJqMN?$?@Vs#^ZZi6v=W*mJ`!WpLlHXDHpeKh!SyeDKVtmvagDC=-8F#V46w^TnTe zVxZ5p=m2W@eqS0mEY*d)?4}NDUu`6b+oQ`Q_QY;(oxzA_bVFo831EDeO3Du;#cLPR zPAU39NpB)MUM)!>Y|AHXLVqs{9$CDu{*CD*0O9Rbd}xq=6rrEjn`Rm$Azn6QjW;by zJ1GcNy$kh`VaF*p@nttus$UTq|?{iDkP zi&t!j5oP!KJ9sA{AQnpDn+8-QG<`|vzz2<3F!;mP!sexfk4ZN zYR>PzN1u6aG@K5cP=me5D5nTr9*5JMFqnSrG#7h%YYr(QCXT$~%SeL#q@r)sMc>~9 z9pj|JeiQ#8Z+Gd?Fd}D`IL8$+U3N^3;>)r6Sk$5Y81%*JG3DMRzIs-d6Dea!pYS}?A5PK4|C7O25FtKUDI#T@I(xV zY{h>gLkRRI-K#pwm0><4#?sNRyK6cu`k_|k^l1*0s1wpj<(fs5h|+Tg>O#OxW!5PpPR^sdt={2P zlNM6WU8lDEcyTOo10q$i0NzB=Nil}c%_{qVNgtqE0Jp4$4WeXk{kC-cLO2DP0^H4#2nYCFA z>QiH`iGzb+sD8~9yJ(&`^}v_0&-dq@nLUVsM5gh+h3KP8At%XxA>+sEeQMfR$2)bQ zPiO zl)26X7*Ccq#YI!~QQYp%%+|&B8#_u%sXc{isoZci3pid_UVTW)`SMS(>dd99d>~p) zOq=1-UuItsPPiBG;2|do+dqR*AkMEqbN8?)hqplD_u7|+mUb$I>$dOR(bxeW0=i}p9(O#O3rHB*j+X-rMq%lqvN+ePdgoqN2mzKGjK6V?DGfEIsb&W6?L>q)PQiO>l85p zlvEk)Cg)N{)Q*UgM93jZ)6=-U&@!<0lpprd{qi~7Y;6`{+;Y(>pXg3i@$7ck z>9t9`+|(JG`H1wJl!MzBkVm)ySjgLZN_dE~`%F?6BZpqyvsT1>7 z7YMylR*vxut`;?3ZLP}*pb2f{gm#D(!huR(D!5(6#s!$sha=48HPSB%KQe=+X^ovf@Rm#@t;8Ho^@k zNv)j~8~bGLEakU2eWWSI34t$KB5FGr?6W_fcIEmB+cWJ@19iM8&ey2=HS;^EPy7im zEtHz#>|C-uL(M^W?|_Y7NQ zC^y>8O^dhl>Lzjt{=w&vt8VjTDa^|)KIu!xXa*MFcGjOz>r%sj_>9Yl$u-b5>@u?5 zZJt!arD!?(L^~tD_MuyHsNn!Rt1F`}MG1}7{mH95C-)t?9ppvhbo(DZ5{i=RU?p;U z_e6~LwUqB$SQ3Lb8|p32;=-!dc$fu)EC0+aNFJ{{%s@YLJLrzvdVpVEz7qjC38YYR zT+o;HIc9fO!@tIRLrEus@>&Ot#j(1vfE={i*yN_Y+8wKJv@eS6eH(0 z93f&UmBfhcV|@p}_O+aUk{(hBsM47Ey0$YhS1mYu$&}0XGg6OkO~r7gpWyI25rb$J z2iXT;j|1cDCDeXZbAtK~d6?X%0BR!cH^l7iAkJ{`-lw2~SMS8{-@&V4oaCB%@cpbD zeL@+geY+wM2gx+Z-iAPy+;rF|=vc1!;*Xrh!u~m|<&FjVQ|BET(g|M(T!q~1zx)O% z-Dm%+ozH>zJkg6LEyiJyjRR)GPB42!OGg}GjV^|cKoa{bn=kLQsHnp;*$_wi;IGo~ z_TQ5d`1xG_;TmMGI1AIOXFn$VBWP>+OX__2#@^1bl} zE-6k!k^hVd5RjFjwsE;WJPP?E1Th_Zi$}n&A%pk@cio>Xx1D3H4;ZH{VdIyV2sALH z>3vQASBgij_cy;`J!?VIr5g8RbPgI}` zJmL;)Hz;?`itXSqiu|r{;m`dob`1VyZ)6AL`uNPZ3)7p za^~})>~cgr3;r2G9Qwu~^pN-h*;G`fa@-N41Lc~mJ*FBmOPSTIBGgn(m6t2}Cz0-J4bix3?l>f@UIWNo^b-y z=7+pn9h$55FD{TzX#`UYg~&a4lx7-M!nvng|LfLJ`*fPU#;!(1BH|dFOQCVL>bwb# z9vDZJxV-HHgtw{nI%CK07Jl^3ZE3Q^F{u!>t^`5cJne~4bR;J<38Ct`2B8k;zf}#17=L(w=nPod>n@~;3-v z)c9D`-4V%*V-p)_FyVKybw z$^N^IpX=_qoSNAk$dM$H-JohIyvc6+v5Uy}qg_y(%J(Z4I}d+8pgwgh1VpH`l4tqb z!U*Gu71#bBS&FMaZjbuwFA=~wdHaU5hJ-%*{HgdD`;TC2am)SoVC=57hE;qyLwPl^ ze(>&%ok!=(5ge!QFnf6R7r8QbgG3$9f4~^R!{~2lrM-lNrX;AsqWs2&=jt7(w`@dU z{x=njsQ$c??f;%dO{~=R$B5lUCR)06(O&q&rnGkO(SX3hO7FXB9vkfxOSF|QGxm$E zQc!krY}>+ovpRE!)aMbs%|eI$tpI?#HeORiFEK%DkJ?!;}3?Bg5D)6FF9T2 z^=@L7VIliIC$iywMiI_g)7?FAaVeK27=*T_Z^W|3eE!Hbtpx`KNv6;t`f}Mx&HH5k9Xx4PzWy=fAe$)BtgztFo;dNJ zf=gCm%S`Wsh0?p6X)fpg!^MU$H3X>2SwYmi2JsF_jm9? zn?@qtY9SKOTKjTVKn&Tw&v4Q3Pct*aYq!GNph_wH4n&)fiJ^GsUh&Z;AfBve8# z*l<=-D^P>>=uz{@vxkYOR#&bXYAl%M7dG=^vkY9oL?96|tM}}W`E59vP!U(2w&3o9 z*Bd-Ja$2}bPC33(Cc%2@CS9&Gn-~1RMt2<`j4Q4S1b$4f)Qn^ACbil-ceA!0#%HY- zCC5$ZtetXuq#(tHpBfk_)15VLdw1 zAU;{mggr-OwM2AYB)F>874wl6c@X!&-o?}E$@a2Mcm5%Z{IYa(61VnL8vT^Yy0gk* zzyl!}l>{hns(N|P%BD{T z(if)_?C|#133QIW3{vnM@{KJ23bjG);TdD6d zQ676SwotK@b43jQOe1Q@AE@=|leXu-e<$z&wG)5pEqWH*n-qrr*!3Oj5>q@i&&4NR zBrCDnr?1BPpW?MygL{n{fRcliJYo_AJrgA`WPr_%F4xHpyfWy>$PcEYbh~$kU}uU1 z(HJW9$(}S=PVy9pj|xMEn$^BRG8LNlu1-eiNKcOFl$j4sZ?!(O$Y5NgyGOP!60yCJ ze4=#K9l8Citsq~J`KL1|r#y_=Q6Tc-ROZDn8D(a_7GwP7y&*0br(KTdq658K&t5uv zS;3R`QophJ$G#C$$I;ZoctLA|J9I;kN=a{ql+Z(As8Rq(qWVD1r4t=_H@rW2`>e_Q z6Qq=cVmcpwJh;Jh>l~poqvy8Qlef7Q0vUp>R8rU+#$jA10S~j|;VKS2c4x6na22C% z)4r5lGS#+<5#E&nFP}5T6vD!OXYfjH*yCGdxr!xs zjpemkI9--mSQG?W_R!S_;jN;KXDj=kR7{lj`8q?#euMh*@v~LJ2-Z-9|FQdeodU5E zAQs!P-?2pR0p8O-nb>7{5XU6t&T&R=$ z_t`3i7bJ1D@<&S@0NkIp#?}Sf-id|%d0F=rxyYl3ZonGWn=`|XVFE;}gHNwKN=nJK zFk!retsNm!Rgq}MV+Hl82Vu!@u#lIjl-$k*5jx+s8arlg1&}UZdm=UOwP%G6+gN!- zjLNe_#bnw<9nYQx$v+u-$#dCtY=6k`x2qUFMpqcq+=*8H=RMm*L&ikBwC8+3I(@mL zCxWT#O#ZhyW!|GBYoBF*VThKS<70K^&s-moGTYt^(aM(J!gQ-f$MXOjoXq01mnm~s zco${EMoIGaJw-k!wkSWdX8ZAVtBTq+Y_bTr*&pk@AA2$A5bM41?JV}0+oC`;F%L5r zex*Z}KFFt|zkLv-O5I`Y6q{<>e7G2O;t7L}qSZj9Ga7AuObA=ZGKqfZNAA0KP*UL| z6J~wpzfK)Cj7x|7^AY`jTrX<$GPH^U8Bu5AE{Zp-SZv%?!z9GZ|LpHH>{qDk0Qo}YXCbGS4i?x4qn(h zZ@AaWa*95ja_3rf)0X_$*^QJc+p5K5;R_%`fD~57knyj$`Ovt>1JBR-jsEmu56IOu z{vt1{G`3VPIOD4W9lo>O$E4{`+);}crO=(~9{r+2JB3E?-?$Pe zvvJ^VC&cau5MPy!NtTdL#fwnD%f5TnT@4=3RtzR=nFW<#C;xdw$7)~c`s!wC| zI{b0zW`}7a5cpXgvzTABuQxXNC8qT$cCnau=c?cUjhxF`r3C-e=3gMkbEyBUp>%;k z-6;m72y2PR3SV~M{EW}zOX%rk^V^r=pskxZdxGcKu*~usYcT=wE^ORP5u0%$j)aoR zJWTGqRo=>HaiRoX+ndmNk=urJoavjI!iNDEY*1X`lM;Iz(+vzKbT-}`v}wvYqkOnUKqJl&9Wog zMInpaf-95#ddG8OMViq)dh0#w!LZl(y`#N$_`!D6`@l9CD9sMS+_!qxTvT9r6n4es zL0}f|J^n-n;DX_xzT)&pZ|LU`eE3THT+c32Vyl0zcyPSm{rHV6pkmp~&+V=aSKO5q zW&f61j9uFz;@G_avDNg%3WN@Pa8%#7c4Hr-6KyP7PwzPhr2<>XN!|F5#t`9z`=p!N z;SCBsv9tbDKT=NJT<6LD;7YeUi{C=3#7?E#Ma!?>T;+WQ$T}Lf0Fn zf=u9=N)y&Tj_!tpB#HKSzcj|^bN&irbFjcu3(6L5Ne*+V_c5CDrWnNm-T~1?p(ewZ zSGhsJQ-Mm{*Po>qYUi?U%4={`XLko6{f9Sg-TOabcPrr7EaRzJ>iYWgF~P)^;FyOr zuLaPbvEdCnBpmjTMvqq6@4VaZiR~t;NBhtbH8y(-ftzY3^)p$NK)4* zVs<@l%YL(e~I779d)_J(3*nZ2FO1o@OGu1sKkuoOlWtrEwfHu}g5d)bh zsRVhpOlZ`)(UvWDtLj^ax1TtWPd!|)2q!Kel6;?*_3}eP>IWL1Cy`#O1*8boY6Dmj z4^$g}B&{IG9YKzW5W;(r$Ezx!R1p{M?X|z2xYA6vDe|zG9}nZYnbP%%P29!chy}4T zc>sO?$eOLv@?J{z75eM!#jsfH28~112k&M;h%)Gf&8}L1!)^yTly>X_MZtKYo>!tR zn3q@Gq}W_!_wQ_7<_oN{m@mA0X`^6g#U<`$pu;@+h=sAK%_iurvb>3Y62<276K7zo zPtKDSxf{|Gt*krJtdjowR%X!mWX|jh-=9l)wuzaqS%(tr3iA=}tM7BeEc}kx4x~J_ zy@T1FunJS2i-YmRy|2vW>05dj;Z9hw=eSeh-6skgU}daLoZ$#dfNG==v}D)}h6vQL z^2n*g_F}4ssl3bup9h!FV^I!2HVe0ux>n_PZ*4U9YyNx|4t6$B#JzUg*H%B^Enz5g zoE@%4d{UUexwbnvoNd2gJ+IOh6x>Xmhgs}u+{T9X|8yQ`R5F*Pkv@O%n$ihMvU7>; zVd@$t+P+O+g3pCKli8M~D3=IaeF^3N%a0&p=g#Cmg>jg;Zs zuLIQkvtdyF@;}^*b)p2alN?AX#0v)wgllYKgIEV*#*F-FxR%G4 zf(g^mH@TB!g~X}aiASr(@^H*dO48u|odY-~p+GcNRKEUxPUwEeiEjFPr)>A&WFjX9 zc3baC;BuEZE(E>ixu&#d_ao)BA~-YZDwBDbmj>+Z7j2h}NxXxxW+6^|fD(S~NXeJ= zBTU=ML_Mo-ah=h>pnm>!%(CiweT%>9AEw~sHyNPJ@8Qp1TDj5gpNuRq*Sls(;keAl|;pYGs&tg5N$I0;|tI+}#0LryNGxk=B~y~rAh zs*ivSD~M)zYGzG7(vcz~QI=P?t9$4uGTJQhDuS4W3p&I98GA)=$#Fa8*^tuV4S^ze z*X3<-nc>00ujef@MUPk%#KZBw*+AbjrL)iOF?j4IjL}SC%M7_*b>;nXoX`4~>HDC! z`tBym;}@Q|bv}ew&tztGnXi1|-C^T<@92H!A4&U7i2BWhub*IB>0tVmUhumJDKc99)&OIHsq0I+@4Y%YqYSd z@HNQtN@C%?$$QdtqiAfcImy9vz;gHdjGA}QjB9srj6Uv^`b6F)haG#@w!Fh|A#XMo zPwMq%Mq}+gr~NZ#C>0G2-s8o^v{w zb+lxRGHi9FaX%DlKYzrQK(7KP*lVV*(t}%-;#>=ir$wvkDRJL?Az}q02y;|DX16q0 zLrkE6yy{6$FNytyNdtUVz1YmB1oq}nom0Nb{aoV97CXXvAkFCsEfwjK!x@6h3m~r- zdV}UNjqbu_8-``ZZLpFqUJTu5@lixaOb-wi1WD5yyPWYGhHyAE@w^`^}_xhspl{Of-R9HrsfYsz~#cB>NpXq z-GPF%2u)TLRQ9X58qql~2TCCM4)VUr=qsFk&pGXuMEj@1xxubRnKNs9SqTSl8BuBO*iZ zInB4weYg_F4^d7 zao{ZO#Lv8atA#YVH);t{K@NFR8}*Mw7}bgcPz4l)Y|Az$9g;0e$|Q|?y4Xswkr>(C17)$0f4iwP`#T-{0X=@wf7cQ&; zjZTahVP+nYPx^ElgcY`mi|5T5zz!lQ@&ZQjLW-i>`5i`|YFQZrVmJBZ)UDodfNew8 zzErPrZ}@?OJPFm>Aocd<7yM>LzASnB2;uct>Q+ zXtgf~1)iF%<5YDM=~c#KY=M&nLGg-`0s1QU7IMM`R#D3pMHJ=Su7#B1xAU7~L0sHL zH>@j5TrsnW5k^d=4i97ICR{LpBKLGJ(wN#U+p;<&SwuZjkO8@UQ1KUsY!UD@Y%- z%6HL)+~T=^{2%U>2~OE`ksg8SchcR%ABW(e^55j$^Y<1Zb#6>My93PY&mIo%?^L23 zhHM?}*c}qBf?{HYPho~@*6+)}c<4U;#5hnPE{~TFObp75UmakO{A<16JV}Ck*i|Ty zb9+u?n~+NYEu!;x&uZ?1NslPCYfeEGEfU?P!3%E`ggm3HmtDYyFx@|_eE;(XUEuxq z4Y4osW$%p-$&J@^{<#75s;GV>1|>xDd#rr>3|!o3v^MVX9db=fO(Kx=$;{0dFv)rZ z-76~ns0^^7_uj5=vRQ~3GcvI1da;$4R$32Bu^>iayt;G366`M1k&Vx}S~$DSpFJx& zAKle5DENG>=Iew>d`N>|W@g9HX4nAY!m6&5Q^PU7qV-jCZmrV#Cl z9OyOX(*ZI?tS`Kxu80@dFRL`v>f+3sCT8 z=pML4$N8t}>iJiA`hDjPsFWK`WYO`P?*uH>{Awi0uC+}%KI&HX4f#e{$TOo&^=AN0 zh;`xrY>|RIC7I8v%C0-3_0P}wf)I-4A@RfkVoXmCsu1PAD}afj%k7#| z53B?Wjy5VM3_(J>_wv6jM8jv};*;eB^rNs3I@sLZ3hIvE$2V1}X~c(THnnS62_A1| zwbr#fpPyUsuGiA0uHEC6C8@G6^!CC{8$T^H`u1OXO@T^?49dc`s9u&^!VD5Uz1Jd$ zc<3y(K!rSAD>JfOcHxm|L|Rjo=DqkgnIUh%J}s?>7P`%N?VzIfynd$NRWw>1u69=E zV>}H~&ofHMFdQ~jy3;m$i5ZR7|6lFBz8~(ib~{^rD#Fxvo;{WTQ0Oz>q~jPLnyLI2 zY?_a~K?4Zg5q?*loT2UdK6WeSmdbgXPtQ(T@vr6-1RTm6aH>kMPEDM-_XyV%tB#dG zeolekpZ4Am|GM4P+u@?Tk2xgZ7=CbY$+P{`{WnlPp4FV62pWy?(qrb>C*EX8YJWu~ zYz2mVQ!Bkw8+q9kHZmmcS+41BP<=#*37^Os)j z1fa05Z!AG<3(O{@j6t4Gzc1yOI;41*-Bo#Y@Mjt#Ke{zIBl1(1FK+ z=YHY&Su|SpAEJ;WIw+Q69XYhfZ7B(($fd1qV$ph0mxPUz09n+5ThBZSUhHD`{i%4X zmFR;rTc|X$wqpIeI`nqJ4)&<0f>^AJ_l7duFz1W7j2^mf0huk0u&|D z!r3=e^DPp8sFA^=LF6G&e$Xsq#}h&6j8H}P(iM-2Z^yk&y~gX6m5jML;9=?>uK<(u zDi5_Y%BUI(yi}B3lO(OU1+~7cT3hxS=wEnn5cM)sxkkQfL zy>Fi>KmKCAm!d%_AL{WLo;y$__8;NHRqK}_R>(xW5 zWD7*J=fM8=PjMKX#6Dh9M36itg({`emQ=;(F@o?ZbtL+0(7VzVaAwAb$Tj)+2sL_S z`n_h`d|vN>CQ8MUTiETj)|8G49M@2m%sZD2nk9`r;u=1s7m%dlPCrnFH4M19EpC?I zkXy4^WdB0pMRlL!bcz*Tv=P57(sk?ZH7NJm!RBBgz|hR1D&L#$45rINQX2HpMUuwu%9jdi4%|WjDA$?Uzm3yK-FrfnLjyUNesqH zv?~GeTx^YUz>o&9)s|jR$N?rP=tAc_x5eD-b=;P>_4henG~E@vO#M?0-3p47clNM{ zx0K{|jw)?%$=_o`o#?jpenISuuW94pQh(ljmirCgHpgV1@^PO_cV@+ zoWF{1l)v(!ewA=x&h*0kmuoc&pI|v&KL9dRuCVOPNU!|vcBxL6<8u?qdzVtB{e)gs z=M#G#F=Ec6rHT$IanFX6RJyM_nMh#e;=~~n+f&M%iAwdhUfPw(TuiY*r2;t#el z2i{&ePSN{0l@N`iYVgd|SJxjbPAE$%cGJOHd#dSq!! zRn|SnZpM?uoISm7tqVl%9UKvma1OGkG{V13{WPb5wFjuG140`zj@7nOyboYZII3|RWfg$B#*Vw`trdZ&^MHA zVZ7FdyZ7bXq2-zhjO=A<=M-!CBCoa&0ee_q=DQwq3{Jo~AWkXJm}GgrNYcoaVWjGR z@8F*zI7wWB$q(q)q~ekSXN72op7H=^7Aa@eG90e!MeMqga8c{=_9Dt<1@Kz1yQtp# zG|aM*gI*$sy)IyT{dbuft&#uO;2#epC*(kdOpzj@pK1GR+?5n8dcA7R^v{t-ECHQu zde*go3M){-7)$Am`zVISc$vq-_R|j+lEv#k`2XN3Z@>^EGR26p-hcCS&oGT@uT@M) ziO1*QTU2}w>OgD7pEOy|brtq8I0x5qNGnH|^Aj#;?}ccX+{IC%V1%XA`Rbo({oh`1 z>u*rrZb!EjRMk;juh_tONhVloTUK*Uk&lfA!{xyUJL-w=vB8dE$rA=&3S>`|u}ZSfmRu}tuFSBsA;k%^jTli} z5R1!r#@Ewv7lO#M5hR+z9Zb4$l1y*66GeQGf)8#5SiuPAW=UmnLZhgozYWm z9lq1Lf^Z5I5A(#Xj@}JmljFEYEBaZ>jkE5HeuHmX-mA`TR=XvA{xiL{VJ*ZI12!&6 z1t}R?odWiK9n*k{23GED^B2m_OGp2n+32|>D^4GC$L09^9^Z7VhfWV!d2r^)eEWS4I9Msf@$(h(` z%HDnziD-vyQGmgYd7^I5m_$C$0*i(Nbx_$~u7OOH^v))`dUM3|^ny z8M}7Bv4)=34`_>Efy+YR1RCo?($hWilhGdUzge1Zk(y!;CR)PsNTAQ0mj4TPbHOU} zh__mOx??YKM)ZN0Ctn>ZF_ZN7KwmE>wa$E^pSf1QbqP4Tc&xYHcsVzojuR%)ev!Fm zhm%ZTPne>f@zObj-&*Y9|2C7eCf$v-yAR=E2EUvAjfY2#p)R_d68EHUT{ExNDc#b} zZP3aKTs$rD;i{qNKJhNMeokj}rd=)0Ld^Y-V;>y&b(Kitdlpfc$N$sbnT0iVZE^fW zCPfCVsLY|YP(TH>0T~lUQ6LBc6`=uC21No&K!zkh;Q}qQNI;FqXbhl$pbTNKHV`ON zWD){ICIQ1xg#eNO$vr`Q=<|K)ckg}O-`;EQ{aZCc%A z3Vwi;_3#1#Ilx+#l|<#h5%jk9_rc1w-32yMe{X2h7G(&qZ>2*sdnLDu0)%dZETw(Q zkZVG2&jyZK@Fq&%Zo}!^jC+5&0bvPPbI%191t=|$GWXxv+$o|0WE3eMxMJN{206C( zT`TDMqW+@_zgPBOHM1=q)QR?fy%F|7ghrT%et7wAkiTO*vb_??{lGcY?_`~yU##&W zW%;%j+t{eB;+%f9235_Qp@v_@`%RIFfn!Zl(WSctn$k z4&lPJYkuPB!UiPp(jEnxA|K1hKX_m^q8@0aJ4TI_AZA`Af(EkCy;k&^JkkYmNhC5Komq2)_;JD% zTu8ANge_evsbHP+0W)W2M!`)C2QTY{ zDCIMyk}ekC*zLomxU@ZSHYk?uTaj?4(4d?`Yj^XwWxy#?PFyqzg_r9-{q#w(%^sf~ zlqlD*BjY0-^MQ{k)8Qohm4Ji#C7j!7P?7CHe*rM#<1!M5>M7wv6^R}f4j(w{jyU|l zuFB5=Gt%+IS^i7^;jjJY04FqaQY;Bo=Y_xN&TC!0#lcSF=YsyIRVPlS!ug>Ll9hF; z*~j;4X*@*e9mlIM$CD(MTuEjMLjDY)Azr)-@}bf|;wZjih|x9KnyDac$uAqe#K>T* zyQJPPu(CeCdTSgf8q?fn zwvgxFh70SQONgzvo8otqO}(+pL*v1Hr%1(eCBZ2O%`=3~c=1L^o6;z*crwN`YFhi3 zhq{&R#Vd~%sVOA!GyORPg@@x~OFrs~0%62y(1V56KIU zKkdw`z>y|?nC=c}uO6)%iOX%UK_z;Nx5AE_6JnOETFDsrJlU}F&8~8cR7)@e`rFvp+3bw+!qubV;^PjMs!7)>j;Q zYuj~?bc37$76K_c2g!m~D{B)Zl)VND;uI-17(X2(s1Na2lUmD*K;M%dhVrvv(9lx@NTw7TwD@iDcTVhEw0_mz;Gktp5%;w`|%~J!UG! zw4Aq}HtjqA<`@e+B9hJ-t@trXZM1fs8*aZ>Sa~N}s-T7CCVIh~Ykj z6Q?#O3K`2ki%JYxZiRuWRWDib1m^?nJ-LA~E=Qi>d&Rc6t)Pnl2GtLotsnhcUNXE8&YN zeI17_9gM;+D&5wKsz>XIAhv>Bcztv>(Qt-;0!!yE2WDvr5AlY%s-;nZP2SX5p?z^n zVnO?&J1^-AqjJs!N|^9{hZJSx|4Q}cPmQy;q($>aIM`V}Wj$fec)lZ0$^=~=qJ5um zBJAa@rTJWGlS4FUn<;S0m-AgSS7wt6tawhp-@MYSQIZ`XA1$w4jF`M^LJnMZaf`eP z_~FqQ&g$63vC1_L@UO4i3NX*v3cfC75}WP8T#zk&b;YiOkGa6ht(w5irM$#^nnox- z%#M23SoV>AQ|OP^hM}HeYmX#5a&u$Hre8i;ZTR-DC*#`0ca>ZYP_$#S+SX#mV9Ot# znnaMJwR;#=<$5LYT@|7aU4(TX3ayeA^9jZJyYRZGCjD!}Eys^k@m;VTbF)I}@jGJN z|MjW7sJze{DuewPdNKvV3SKKfzWQBRryOO3*GopF{iu)?Rka>6z#S%9?pghzywQ7= z4LM8QiH`2q9ctfxr72uFH4C~3Vj05f2#my+fq zr`&#mL!#F?+t|&r+V;C}>s`vK)%4+qe- z0|iSQpm|4M12DT5^NF1ViGm&6s!0nh@vxZIeo4T7Tdf=ufM4Y5uXokBb|Uyt)I}GgUQOd@2|&`@V*Ya zH^F8$z?rp_Rl@AGXM8x-ZzMa)$Cmw{48k){81^&Q#|r~MCrB;<01Wsd3H+|OO literal 0 HcmV?d00001 diff --git a/vignettes/isa.Rmd b/vignettes/isa.Rmd index bbc756e..9f787b4 100644 --- a/vignettes/isa.Rmd +++ b/vignettes/isa.Rmd @@ -17,7 +17,7 @@ Tutorial contributed by [Olivier Delmarcelle](mailto:delmarcelle.olivier@gmail.c The **`sentometrics`** package introduces simple functions to quickly compute the sentiment of texts within a corpus. This easy-to-use approach does not prevent more advanced analysis, and the **`sentometrics`** functions remain a solid choice for cutting-edge research. This tutorial will present how to go beyond the basic **`sentometrics`** settings in order to analyse the intratextual sentiment structure of texts. -### Intratextual Sentiment Structure +### Intratextual sentiment structure Does the position of positive and negative words within a text matter? That's a question investigated by [Boudt & Thewissen, 2019](https://doi.org/10.1111/fima.12219) during their research regarding sentiment implied by CEO letters. Based on a large dataset of letters, they analyze how sentiment-bearing words are positioned within the text. They find that CEOs tend to emphasize sentiment at the beginning and the end of their letter, in the hopes of leaving a positive impression to the reader. @@ -30,7 +30,7 @@ One can wonder whether other types of texts follow a similar structure? Indeed, As part of this tutorial, you will learn how to: * Decompose your texts into *bins* (equal-sized containers of words) or sentences. -* Compute sentiments with a variety of weighting schemes. +* Compute sentiment with a variety of weighting schemes. * Create and use your own weighting scheme for a classification task. ## Preparation @@ -55,7 +55,7 @@ table(usnews2$s) The variable `s` indicates whether the news is more positive or negative, based on an expert's opinion. We are going to try to predict this value at the end of the tutorial. We can already prepare a `sento_corpus` and a `sento_lexicon` for our future sentiment computation. -For the `sento_corpus`, we will also create a `dummyFeature` filled with 1's. Since sentiment computations are multiplied by the features of a `sento_corpus`, we want this dummy feature to observe the whole corpus's sentiments. This `dummyFeature` is created by default whenever there's no feature at the creation of the `sento_corpus`. +For the `sento_corpus`, we will also create a `dummyFeature` filled with 1's. Since sentiment computations are multiplied by the features of a `sento_corpus`, we want this dummy feature to observe the whole corpus's sentiment. This `dummyFeature` is created by default whenever there's no feature at the creation of the `sento_corpus`. Finally, we remove the feature `s` from the `sento_corpus`, as we do not need it for sentiment computation. @@ -66,7 +66,7 @@ usnews2Sento <- add_features(usnews2Sento, data.frame(dummyFeature = rep(1, leng docvars(usnews2Sento, "s") <- NULL # R-removing the feature ``` -We will use a single lexicon for this analysis, the combined Jockers & Rinker lexicon, obtained from the **`lexicon`** package. However, we will prepare a second and different version of this lexicon where the sentiments assigned to words are all positive, regardless of their original signs. This second lexicon will be useful to better detect the sentiment intensity conveyed. +We will use a single lexicon for this analysis, the combined Jockers & Rinker lexicon, obtained from the **`lexicon`** package. However, we will prepare a second and different version of this lexicon where the sentiment assigned to words are all positive, regardless of their original signs. This second lexicon will be useful to better detect the sentiment intensity conveyed. We used the `data.table` operator `[]` to create the second lexicon in a very efficient way. Most **`sentometrics`** objects are based on `data.table` and this allows to perform complex data transformations. If this is the first time you are seeing the `data.table` way of using `[]`, we recommend you to have a look at their [Introduction vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) and enjoy this powerful tool! @@ -82,18 +82,18 @@ lapply(sentoLexicon, head) `compute_sentiment()` is at the base of sentiment analysis with **`sentometrics`**. That's also the function we are going to use to analyse intratextual sentiment. This requires, however, to play with the most advanced features of the function. Before doing that, let us review the different computation settings to really understand what's going on. -### Default computation - from words to document sentiments +### Default computation - from words to document sentiment -When using the default settings (i.e., only specifying the `how` argument), the sentiment for each word within a text will be determined according to the provided lexicons. These word sentiments are then aggregated using the method defined by the `how` argument, aggregating up to the document level to form a sentiment value for the document. +When using the default settings (i.e., only specifying the `how` argument), the sentiment for each word within a text will be determined according to the provided lexicons. These word sentiment are then aggregated using the method defined by the `how` argument, aggregating up to the document level to form a sentiment value for the document. ```{r} sentiment <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional") head(sentiment) ``` -In this case, the `how = "proportional"` simply sum words' sentiments then divide it by the number of words in a document. The different settings for `how` can be accessed using the `get_hows()` function. We are going to present the use of a more complex setting at the end of this tutorial. +In this case, the `how = "proportional"` simply sum words' sentiment then divide it by the number of words in a document. The different settings for `how` can be accessed using the `get_hows()` function. We are going to present the use of a more complex setting at the end of this tutorial. -### Setting `do.sentence = TRUE` - from words to sentences sentiments +### Setting `do.sentence = TRUE` - from words to sentences sentiment A drastic change in the behaviour of `compute_sentiment()` can be induced by specifying `do.sentence = TRUE` in the function call. If true, the output of `compute_sentiment` will no longer return a sentiment value for each document, but each sentence. Sentiment values within each sentence are still computed using the method provided in the `how` argument, but the function stops there. @@ -102,11 +102,11 @@ sentiment <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional", head(sentiment) ``` -The new column `sentence_id` in the output is used to identify the sentences of a single document. This result can be used as-is for analysis at the sentence level, or sentences sentiments can be aggregated to obtain documents sentiments, as in the default setting. One way to aggregate sentences sentiments up to documents sentiments is to use the `aggregate()` method of **`sentometrics`**. +The new column `sentence_id` in the output is used to identify the sentences of a single document. This result can be used as-is for analysis at the sentence level, or sentences sentiment can be aggregated to obtain documents sentiment, as in the default setting. One way to aggregate sentences sentiment up to documents sentiment is to use the `aggregate()` method of **`sentometrics`**. ### Trick with *bins* in a list, `do.sentence` and `tokens` -Analyzing the sentiment of individual sentences is already a nice approach to observe intra-document sentiment, but sometimes it is better to define a custom container for which sentiments are going to be computed. This is the approach used by [Boudt & Thewissen, 2019](https://doi.org/10.1111/fima.12219), where they define *bins*, equal-sized containers of texts. The idea is to divide a document into equal-sized portion and to analyze each of them independently. Let's say we decide to split a document of 200 words into 10 *bins*. To do so, we are going to store the first 20 words in the first *bin*, the words 21 to 40 in the second *bin*, and so on... This way, each *bin* will account for 10% of the text. By repeating the procedure for all texts of a corpus, we can easily compare specific text portions (e.g., the first 10%) between multiples documents. +Analyzing the sentiment of individual sentences is already a nice approach to observe intra-document sentiment, but sometimes it is better to define a custom container for which sentiment are going to be computed. This is the approach used by [Boudt & Thewissen, 2019](https://doi.org/10.1111/fima.12219), where they define *bins*, equal-sized containers of texts. The idea is to divide a document into equal-sized portion and to analyse each of them independently. Let's say we decide to split a document of 200 words into 10 *bins*. To do so, we are going to store the first 20 words in the first *bin*, the words 21 to 40 in the second *bin*, and so on... This way, each *bin* will account for 10% of the text. By repeating the procedure for all texts of a corpus, we can easily compare specific text portions (e.g., the first 10%) between multiples documents. Let's split our documents into sets of *bins*. The first step is to obtain a vector of characters for each document. This is done easily with the `tokens` function from the **`quanteda`** (remember that **`sentometrics`** objects are also based on **`quanteda`**, letting us free to use most functions from this package). @@ -154,13 +154,13 @@ head(sentiment) In this case, the `sentence_id` simply refers to the number of the *bin*. Let's now see what we can do with the *bins* we just computed. -## Exposing Intratextual Sentiment Structure with *bins* +## Exposing intratextual sentiment structure with *bins* -In their analysis of CEO letters, [Boudt & Thewissen, 2019](https://doi.org/10.1111/fima.12219) identified an intratextual sentiment structure: CEOs would deliberately emphasize sentiments at the beginning and end of the letter, and pay attention to leave out a positive message and the end. Our dataset of news articles is radically different from these letters so we don't expect to find a similar structure. However, based on our knowledge of news, we can formulate a hypothesis: news articles tend to use strong sentiments in their headlines to attract readers' eyes. Let's investigate this using our *bins*! +In their analysis of CEO letters, [Boudt & Thewissen, 2019](https://doi.org/10.1111/fima.12219) identified an intratextual sentiment structure: CEOs would deliberately emphasize sentiment at the beginning and end of the letter, and pay attention to leave out a positive message and the end. Our dataset of news articles is radically different from these letters so we don't expect to find a similar structure. However, based on our knowledge of news, we can formulate a hypothesis: news articles tend to use strong sentiment in their headlines to attract readers' eyes. Let's investigate this using our *bins*! ### Absolute sentiment -We expect that the first *bin* in each article presents on average more sentiment than in the rest of the text. Since news can either be positive or negative, it will easier to identify sentiment intensity using the absolute value lexicon prepared earlier. This way, we avoid the cancelling effect between positive and negative sentiments. Simply plotting the mean sentiment values for each *bin* across documents can give us some insight on the intratextual structure. Once again, we rely on `data.table`'s `[]` operator to easily group sentiment values per `sentence_id` (remember, these represent the *bin* number!). In addition to this, a boxplot can be useful to ensure that the mean sentiments are not driven by extreme outliers. +We expect that the first *bin* in each article presents on average more sentiment than in the rest of the text. Since news can either be positive or negative, it will easier to identify sentiment intensity using the absolute value lexicon prepared earlier. This way, we avoid the cancelling effect between positive and negative sentiment. Simply plotting the mean sentiment values for each *bin* across documents can give us some insight on the intratextual structure. Once again, we rely on `data.table`'s `[]` operator to easily group sentiment values per `sentence_id` (remember, these represent the *bin* number!). In addition to this, a boxplot can be useful to ensure that the mean sentiment are not driven by extreme outliers. ```{r,fig.width = 12, fig.height = 5} par(mfrow = c(1, 2)) @@ -176,7 +176,7 @@ We can see that the first two *bins* of articles tend to show a larger absolute ### Herfindahl-Hirschman Index -Another way to study the intratextual sentiment structure is to compute the Herfindahl-Hirschman Index across all documents. This is a popular index of concentration, mainly used in measuring competition between firms on a given market. A value close to 0 indicates large dispersion between *bins* while a value of 1 indicated that all sentiments are found in a single *bin*. The formula to compute the index of a single document is: +Another way to study the intratextual sentiment structure is to compute the Herfindahl-Hirschman Index across all documents. This is a popular index of concentration, mainly used in measuring competition between firms on a given market. A value close to 0 indicates large dispersion between *bins* while a value of 1 indicated that all sentiment are found in a single *bin*. The formula to compute the index of a single document is: $$H = \sum_{b=1}^{B} s_b^2$$ where $b$ are *bin* indexes and $s$ the proportion of the document sentiment found in a single *bin*. @@ -189,7 +189,7 @@ herfindahl <- herfindahl[, .(h = sum(s^2)), by = id] mean(herfindahl$h) ``` -A result that shows there is concentration toward some *bins*! Note that this result is heavily dependent on the number of *bins* considered. Only index values computed with the same number of *bins* should be compared. Let's show the index's value if sentiments were uniformly positioned within the text: +A result that shows there is concentration toward some *bins*! Note that this result is heavily dependent on the number of *bins* considered. Only index values computed with the same number of *bins* should be compared. Let's show the index's value if sentiment were uniformly positioned within the text: ```{r} x <- data.table(id = sentiment$id, s = rep(1, nrow(sentiment))) @@ -201,15 +201,15 @@ mean(herfindahl$h) ## Computing sentiment with different weights -The **`sentometrics`** comes with a lot of different weightings methods to compute sentiment and aggregate them into document sentiments or even time series. These weightings methods can be accessed with the functions `get_hows`. +The **`sentometrics`** comes with a lot of different weightings methods to compute sentiment and aggregate them into document sentiment or even time series. These weightings methods can be accessed with the functions `get_hows`. ```{r} get_hows() ``` -So far, we've been using the `proportional` method from the `$words` set. The `$words` set contains the valid options for the `hows` argument of `compute_sentiment()`. The other two sets are used within the `aggregate()` function, to respectively aggregate sentences sentiment into documents or document sentiments into time series. +So far, we've been using the `proportional` method from the `$words` set. The `$words` set contains the valid options for the `hows` argument of `compute_sentiment()`. The other two sets are used within the `aggregate()` function, to respectively aggregate sentences sentiment into documents or document sentiment into time series. -With our earlier computation of sentiments using `do.sentences = TRUE`, we computed sentiments for sentences and *bins*. Now, for our next application, we need to aggregate these sentences and *bins* sentiments into documents sentiments. One option is to `aggregate()` using one of the methods shown above. Note the use of `do.full = FALSE` to stop the aggregation at the document level (otherwise, it would directly aggregate up to a time series). +With our earlier computation of sentiment using `do.sentences = TRUE`, we computed sentiment for sentences and *bins*. Now, for our next application, we need to aggregate these sentences and *bins* sentiment into documents sentiment. One option is to `aggregate()` using one of the methods shown above. Note the use of `do.full = FALSE` to stop the aggregation at the document level (otherwise, it would directly aggregate up to a time series). ```{r message=FALSE} docsSentiment <- aggregate(sentiment, ctr_agg(howDocs = "equal_weight"), do.full = FALSE) @@ -223,6 +223,7 @@ But as we have seen, some *bins* are more likely to present strong sentiment val This is exactly the situation where we would like to test a specific weighting scheme! Say that instead of giving 10% importance to each *bin* in the document sentiment computation, we would give only about 5% importance to the first one and share the rest between the remaining *bins*. Sadly, **`sentometrics`** does not directly provide us with the tool for this kind of computation, we will need to create our weighting scheme and aggregate by hands. Luckily, the use of `data.table` makes these customisations painless. First, we define our customized weights for *bins*: + ```{r} w <- rep(1 / (nBins - 0.5), nBins) w[1] <- w[1] * 0.5 @@ -261,15 +262,15 @@ class(docsSentiment) Let's now put all of this in a concrete example. We've been using a modified dataset `usnews2` since the beginning because we wanted to have a variable identifying whether the document is positive or negative. Our goal is now to try to predict this value. -To do so, we will consider 4 different approaches, in the form of four different weighting methods. We will study which weighting is the best to predict document's sentiments. +To do so, we will consider 4 different approaches, in the form of four different weighting methods. We will study which weighting is the best to predict document's sentiment. The four weighting methods will be: * The default weighting based on word frequencies, regardless of the position. * A U-shaped weighting of words, where words at the beginning or end of the text are given more weights. -* A sentence-weighting, where word sentiments are proportionally weighted up to a sentence sentiment level, then sentences are aggregated with an equal weighting to obtain the document sentiment. -* The *bin* based approach, where word sentiments are proportionally weighted up to a *bin* sentiment level, then *bins* are aggregated with our custom weights: the first *bin* given half the weight and the other *bins* sharing the rest. +* A sentence weighting, where word sentiment are proportionally weighted up to a sentence sentiment level, then sentences are aggregated with an equal weighting to obtain the document sentiment. +* The *bin* based approach, where word sentiment are proportionally weighted up to a *bin* sentiment level, then *bins* are aggregated with our custom weights: the first *bin* given half the weight and the other *bins* sharing the rest. -The U-shaped weighting is something we haven't seen before. This is a weighting method for words as per `get_words()` that gives more weight to the beginning and end of a text. Its exact formulation can be found at the end of the [Sentometrics vignette](https://doi.org/10.2139/ssrn.3067734), along with the other available weighting. This weighting scheme can be visualized as follows: +The U-shaped weighting is something we haven't seen before. This is a weighting method for words, as we can learn from `get_hows()`. This scheme gives more weight to the beginning and end of a text. Its exact formulation can be found at the end of the [Sentometrics vignette](https://doi.org/10.2139/ssrn.3067734), along with the other available weightings. This weighting scheme can be visualized as follows: ```{r} Qd <- 200 # number of words in the documents @@ -281,7 +282,7 @@ ushape <- ushape/sum(ushape) plot(ushape, type = 'l', ylab = "Weight", xlab = "Word position index", main = "U-shaped weight scheme") ``` -Let's compute sentiments with the four different weighting schemes. We will store the results in a list, `sentimentValues`. +Let's compute sentiment with the four different weighting schemes. We will store the results in a list, `sentimentValues`. ```{r} sentimentValues <- list() @@ -292,7 +293,7 @@ sentimentValues$sentences <- compute_sentiment(usnews2Sento, sentoLexicon, how = sentimentValues$bins <- compute_sentiment(usnews2Sento, sentoLexicon, tokens = usnews2Bins, how = "proportional", do.sentence = TRUE) -lapply(sentimentValues, head, n = 3) +lapply(sentimentValues[c(1,3)], head, n = 3) ``` Before going further, we need to aggregate the two last results to a document level sentiment measure. We are going to aggregate sentences using the `aggregate()` function while we will repeat the same operation as before to compute the *bins* aggregation with the custom weights. @@ -307,7 +308,7 @@ sentimentValues$bins <- sentimentValues$bins[, c(word_count = sum(word_count), l lapply(sentimentValues[3:4], head, n = 3) ``` -Finally, what remains to do is test our results against the variable `s` from `usnews2`. Since we know the number of positive and negative news in `s`, we can quickly and in a naive way measure the accuracy by ordering the documents by sentiment values. +Finally, what remains to do is to test our results against the variable `s` from `usnews2`. Since we know the number of positive and negative news in `s`, we can quickly and in a naive way measure the accuracy by ordering the documents by sentiment values. ```{r} table(usnews2$s) @@ -319,7 +320,8 @@ Let's start by adding the `s` variable to the existing measures by merging each ```{r} sentimentValues <- lapply(sentimentValues, function(x) merge.data.frame(x, usnews2[, c("id","s")])) -lapply(sentimentValues, head, n = 3) + +head(sentimentValues$default) ``` Since we used `merge.data.frame`, we need to convert the objects back to `data.table` and then we can order each of these tables. @@ -329,7 +331,7 @@ sentimentValues <- lapply(sentimentValues, as.data.table) # converting back to d sentimentValues <- lapply(sentimentValues, function(x) x[order(`baseLex--dummyFeature`)]) # order based on the baseLex sentiment values -lapply(sentimentValues, head, n = 3) +head(sentimentValues$default) ``` Finally, we compute the accuracy by counting the number of times the value of `s` is -1 in the first 605 documents and the number of time the value is 1 in the last 344 documents. We obtain a balanced accuracy measure by combining the true negative rate and the true positive rate. @@ -340,12 +342,60 @@ index <- table(usnews2$s)[[1]] rates <- cbind(trueNegativeRate = sapply(sentimentValues, function(x){sum(x[1:index, s == -1]) / sum(x[, s == -1])}), truePositiveRate = sapply(sentimentValues, function(x){sum(x[(1 + index):nrow(x), s == 1]) / sum(x[, s == 1])})) -cbind(rates, balancedAccuracy = (rates[,1] + rates[,2]) / 2 ) +cbind(rates, balancedAccuracy = (rates[, 1] + rates[, 2]) / 2 ) ``` In this case, the U-shaped weighting performs best but we can already see the improvement brought by our custom weights in comparison with the default settings. In a supervised learning setting, it can be useful to optimize a custom weights scheme on a training dataset. An example of such a model can be found in the paper of [Boudt & Thewissen, 2019](https://doi.org/10.1111/fima.12219), where *bins* weights are optimized to predict firm performance. -That's the end of this tutorial. Want to go further? Have a try creating weird *bins*! They actually don't have to be of equal size, their specification is up to anyone. Also, keep in mind that we have only covered news articles in this tutorial, which is not representative of all type of texts, feel free to investigate how sentiments are positioned within different types of documents. +## Hierarchical aggregation - *bins* of sentences + +As we learned through this tutorial, we can always define more complex methods to compute and aggregate sentiment. The reason why we use different aggregation levels such as *bins* or sentences is that looking at words does not capture the semantic structure of the text. The most appropriate way to compute sentiment should be through sentences, as sentences usually convey a single statement. + +Earlier, we implemented the *bins* approach by creating equal-sized containers of words. Each *bin* then contained a similar number of words. This naive split had the effect of cutting some sentences between two bins. From a semantic point of view, this not desirable. Hence, we're going to define here a new *bins* approach that respects sentences integrity: *bins* of sentences. + +This approach is similar to the previous one, but instead of dividing the texts into equal-sized containers of words, we are going to divide them into equal-sized containers of sentences. This means that each bin will contain approximately the same number of sentences. + +To implement it, we will need to play a bit with `data.table` operations to aggregate from sentences to *bins* of sentences. The first step is to compute sentence sentiment using `compute_sentiment()`. Then, we're going to add a column to the resulting sentiment object. This additional column will contain information about the future *bin* in which each sentence will be aggregated. This is a mapping from sentences to *bins* of sentences. + +The following operation creating `bin_id` is slightly complex. The best way to understand it is by following the logic from the most internal part of the script up to the final `apply()`. The innermost function here is `splitIndices`, which is used to split the `sentence_id` of each document in equal-sized vectors. The second level, the `sapply()` function, determines to which split vector belongs each `sentence_id` and returns boolean vectors for each. Finally, the last `apply()` call the function `which()` on each of these vectors, resulting in the correct *bin* indices. + +```{r} +sentiment <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional", do.sentence = TRUE) +nBins <- 5 + +sentiment <- sentiment[, cbind(bin_id = apply( + sapply(parallel::splitIndices(max(sentence_id), nBins), + '%in%', x = sentence_id), + which, + MARGIN = 1 + ), + .SD), by = id] + +sentiment[id == 830981632, 1:6] +``` + +With this result, we can now use the new column `bin_id` for grouping. We cannot use the **`sentometrics`** functions here, as they are not built to take into account a `bin_id` column. Instead, we use a `data.table` operation similar to what we did to compute the *bins* aggregation with custom weights. This time, however, we will simply use the `mean()` function, meaning that each *bin* of sentences will contain the average sentiment value of the constituent sentences. + +```{r} +sentiment <- sentiment[, c(word_count = sum(word_count), sentence_count = length(sentence_id), lapply(.SD, mean)), + by = .(id, date, bin_id), + .SDcols = tail(names(sentiment), -5)] +head(sentiment[, 1:6]) +``` + +Finally, we can re-create the graphs used for our initial analysis of the intratextual sentiment structure, but using *bins* of sentences. In this case, there's not much difference with the previous analysis. However, using *bins* of sentences paves the way to more complex and semantically accurate analyses. + +```{r,fig.width = 12, fig.height = 5} +par(mfrow = c(1, 2)) + +plot(sentiment[, .(s = mean(`absoluteLex--dummyFeature`)), by = bin_id], type = "l", + ylab = "Mean absolute sentiment", xlab = "Bin of sentences") + +boxplot(sentiment$`absoluteLex--dummyFeature` ~ sentiment$bin_id, ylab = "Absolute sentiment", xlab = "Bin of sentences", + outline = FALSE, range = 0.5) +``` + +That's the end of this tutorial. Want to go further? Have a try creating more weird *bins*! They actually don't have to be of equal size, their specification is up to anyone. Also, keep in mind that we have only covered news articles in this tutorial, which is not representative of all type of texts, feel free to investigate how sentiment are positioned within different types of documents. ## Acknowledgements