The Sentiment Analysis module in Communalytic is designed to detect the polarity of posts in your dataset; that is, to determine if posts in your dataset express neutral, negative, or positive sentiments. The module can conduct sentiment analysis on text in the following languages: English, French, German and Russian using one or more of the following three popular sentiment analysis libraries: VADER (EN), TextBlob (EN, FR, DE) and Dostoevsky (RU).
Posts in French or German will only be analyzed by TextBlob. Post in Russian will only be analyzed by Dostoevsky. Posts in English will be analyzed by both VADER and TextBlob. Researchers with a predominantly English language dataset will have the option to inspect conflicting polarity scores generated by these two different sentiment analysis libraries and decide which library is better suited/ more accurate for analyzing their dataset. Posts in other languages are currently not supported in Communalytic and will be skipped.
These lexicon-based libraries were selected for the following reasons:
- They have been empirically validated and used by many other researchers.
- They have been shown to work across different domains.
- They are sufficiently fast and easy to integrate within a web application like Communalytic.
All three libraries analyze text based on a set of predefined dictionaries and rules and then they assign a polarity score between -1 and 1 to the text:
- Polarity scores close to 0 (usually between -0.05 and 0.05) represent neutral sentiments.
- Negative polarity scores (-0.05 or lower) represent negative sentiments.
- Positive polarity scores (0.05 or above) represent positive sentiments.
Which polarity scores should you use in your study: VADER’s or TextBlob’s?
As noted earlier, Communalytic can conduct sentiment analysis on text in the following languages: English, French, German and Russian. Posts in French or German will only be analyzed by TextBlob. Post in Russian will only be analyzed by Dostoevsky. Posts in other languages are currently not supported in Communalytic and will be skipped.
Posts in English will be analyzed by both VADER and TextBlob. Researchers with a predominantly English language dataset will have the option to inspect conflicting polarity scores generated by these two different sentiment analysis libraries and decide which library is better suited/ more accurate for analyzing their dataset. In general, if your dataset consists of mostly English language posts, we suggest using VADER’s scores since VADER has been shown to slightly outperform TextBlob (at least in one study). Also, in comparison to Textblob, VADER is more widely cited with over 7k citations verses about 2.6k citations for TextBlob. Furthermore, in a separate study that compared 24 different sentiment analysis methods (not including TextBlob), VADER was among the top three sentiment analysis algorithms for classifying the sentiments of social media posts.
However, the accuracy of any algorithm can vary from dataset to dataset, if your dataset consists of mostly English language posts, we also suggest examining all or a sample of the polarity scores produced by both libraries to cross-validate results. Doing this exercise will help you to determine which of the two sentiment analysis algorithms is more accurate at detecting polarity of posts in your particular dataset. To accomplish this, download the dataset after the sentiment analysis is complete, and then use either Excel or Google Sheet to manually review cases where VADER and TextBlob disagree, especially in cases where posts were assigned opposite polarity scores.
How to use the Sentiment Analysis Module in Communalytic
The step-by-step instructions below will show you how to use the Sentiment Analysis module in Communalytic.
Step 1: To start the Sentiment Analysis module, go to the My Datasets page and click on the meter icon under the column called “Sentiment Analysis”.
For this tutorial, we’ll use a small sample dataset of 2567 posts and replies from a public subreddit dedicated to Marvel Studios and the Marvel Cinematic Universe called r/Marvelstudios.
Step 2: On the next screen, click “Start Analysis”.
Step 3: Once the analysis has begun, you can monitor the progress using the progress bar screen. You may close this window and visit it later to review/download the results.
Step 4: Once the analysis is complete, the results will be displayed as the summary table showing the counts of negative, neutral and positive posts based on the calculated polarity scores. Following Bonta et al. (2019), polarity scores close to 0 (between -0.05 and 0.05) represent neutral sentiments, negative polarity scores (-0.05 or lower) represent negative sentiments, and positive polarity scores (0.05 or above) represent positive sentiments.
Note: Depending on your use case and research questions, you can set your own thresholds to determine how to translate polarity scores (from -1 to +1) into sentiment labels (neutral, negative or positive). Here is an example of how you can calculate the “optimal” thresholds with your own dataset (See “Step #5: Evaluate the sentiment analysis results”).
The polarity scores are also displayed as a distribution chart (see below). VADER’s scores are shown in the green color and TextBlob’s scores are in the purple color. From this chart we can see that most of the posts express positive sentiments. We know this because the distribution of polarity scores are skewed to the right and the majority of posts (~50%) were assigned the polarity score of 0.05 or above.
In addition, for posts in English, the results page provides a so-called Confusion Matrix showing both agreement and disagreement counts across sentiment labels as determined by VADER and TextBlob (see below).
Step 5: To download the dataset as a CSV file, navigate to the sidebar and click on “Download Dataset“.
Step 6: After the CSV file has been downloaded, you will be able to access and see different scores generated by the two sentiment analysis libraries.
In addition to storing the two polarity scores with values between -1 and +1 (‘textblob_polarity‘ and ‘vader_sentiment_compound‘), the file will also include separate values for VADER’s neutral, negative and positive scores (values between 0 and 1):
They are included in the downloaded file in the interest of completeness and in case some researchers find them useful. As per VADER’s documentation, these scores represent “ratios for proportions of text that fall in each [sentiment] category” and are based on the “raw categorization of each lexical item (e.g., words, emoticons/emojis, or initialisms) into positive, negative, or neutral classes“. The main limitation is that these ratios “do not account for the VADER rule-based enhancements”; as a result, we suggest relying on the VADER’s normalized & weighted composite score – vader_sentiment_compound, especially when comparing the results with the related ‘textblob_polarity‘ score generated by TextBlob.
Posts written in a non-supported language (not in English, French, German or Russian) will have “N/A” under the sentiment analysis-related columns. As a reminder, TextBlob can analyze posts in English, French and German. VADER can only analyze posts in English and Dostoevsky can only analyze post in Russian.
Step 7: (Optional) Depending on your research, you can also choose to visualize and explore the polarity scores using Communalytic’s built-in Network Visualizer. To do this, go to the Network Analysis module and generate a network.
Once the network is generated, click on the “Visualize Network” button.
Important: If you have already created a network from the data in your dataset prior to running the Sentiment Analysis, you will need to remove the previously generated network file by clicking on the “Reset Network File” button first.
Step 8: The Sentiment Analysis filter is available under the Node and Edge Filter tab in the Control Panel on the right side of the screen. Use this filter to hide edges outside a specified range of VADER’s or Textblob’s polarity scores as shown below.
To adjust the Sentiment Analysis filter, slide the squares to the desired range between -100 (likely negative posts) and 100 (likely positive posts). This filter will then hide posts (represented as edges in the network) outside the specified range of the polarity scores.
You may notice that while the actual polarity scores are between -1 and 1, for the network visualization we scale this range to [-100, 100]. This is to make it easier for users to select a desired range with the available interface elements.