The Topic Analysis Module can automatically identify and group together social media posts that are semantically similar (i.e., similar in their meaning). It can spot latent topics in a dataset (i.e., abstract topics that may not be directly observable from just reading the posts).
The module is entirely web-based and requires no special programming or coding experience. It is designed to help researchers make sense of their social media dataset without having to scroll through endless Excel files, read and review every post, or even have prior knowledge about the dataset’s content.
Creating and Clustering Embeddings #
The Topic Analysis Module uses a sentence-transformers model to represent posts as vector embeddings. These embeddings capture the semantic meaning of the text and can be used for various natural language processing tasks, such as sentence similarity, clustering, and retrieval. For more information on embedding, see here and here.
Communalytic analyzes posts using VoyageAI multilingual embedding models (Voyage-3 in Communalytic EDU and Voyage-Multilingual-2 in Communalytic PRO). Voyage-3 is a general-purpose model optimized for multilingual retrieval. Voyage-Multilingual-2 is a more advanced model shown to outperform similar models and was tested on texts in 27 languages: Arabic, Bengali, Czech, Danish, Dutch, English, French, Georgian, German, Greek, Hungarian, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Polish, Portuguese, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Urdu, Vietnamese.
Since embeddings are vectors in a multidimensional space, Communalytic uses a dimension reduction technique called UMAP to reduce the dimensions of embeddings to 3 for visualization purposes. The final step is to group posts/embeddings located close to each other in a 3D space based on a clustering algorithm. Communalytic currently supports the following clustering options: HDBScan, KMeans, and Gaussian Mixture. See more details in the “How To” section below.
Visualizing Embeddings #
Once embeddings are projected into a 3D space and grouped based on semantic similarity, they are visualized using a built-in visualizer – the 3D Semantic Similarity Map. Below is a sample visualization. Each dot in the Map represents a post, and colors are automatically assigned based on the selected clustering algorithm.
To examine latent topics within your dataset, you can manually review sample posts in each cluster and assign a descriptive label. To help you with this process, the 3D Semantic Similarity Map has a feature to automatically suggest a label based on one of three LLMs (llama-3.1-8b-instruct, mistral-7b-instruct-v.2-lora, or bart-large-CNN).
How To #
1) Log in to your Communalytic account and click on the ”Topic Analysis” icon located next to the dataset you want to analyze (found under “My Datasets”):
2) Click the “Visualize Embeddings” button to generate and visualize embeddings using the default clustering settings. If you are dissatisfied with the resulting visualization based on the default settings (e.g., getting too few or too many clusters), you can change the clustering algorithm or adjust its settings as outlined below.
Communalytic currently supports three clustering algorithms:
- Fast HDBScan (the default) is best for data with varying densities and outliers but requires parameter tuning.
- KMeans is efficient and easy to use but assumes spherical clusters and is sensitive to outliers.
- Gaussian Mixture offers flexibility in cluster shapes with probabilistic memberships but assumes Gaussian distributions.
Choosing the “right” clustering algorithm depends on the nature of your data and the specific requirements of your analysis. If you have an expectation about the approximate number of clusters based on your familiarity with the dataset content or wish to manually fine-tune it, we suggest using either KMeans or Gaussian Mixture. Otherwise, use Fast HDBScan to determine the “optimal” number of clusters automatically.
When using Fast HDBScan for clustering, you can adjust the following parameters to control the number of resulting clusters:
- Epsilon controls the distance for neighborhood consideration, impacting how close points need to be to form a cluster. Hint: Values closer to 1 will produce fewer clusters. We suggest starting with 0.1 (the default) and then gradually increasing it by 0.1 until satisfied with the number of resulting clusters. (You will know that the value is too high if you end up with a single cluster.)
- Minimum Cluster Size sets a threshold for the minimum number of posts required to form a cluster. This helps to avoid creating clusters with too few posts, capturing overly granular topics that are time-consuming to review and label manually. Hint: This parameter largely depends on the size of your dataset. For smaller datasets (fewer than 1k posts), use the default value of 10. However, if you are getting too many clusters (over 100), consider increasing this value.
- Minimum Sample Size determines the density requirement for a point to be considered a core point and thus part of a cluster. Hint: Setting higher values will result in more outliers (posts that are not assigned to any cluster), but this is not a problem if the goal is to identify groupings of strongly similar posts. However, if the aim is to reduce the number of isolates, lower this value to 10 (the default) or below. This will assign ‘borderline’ posts (semantically speaking) to the most relevant cluster.
Below are some additional considerations when setting Minimum Cluster Size and Minimum Sample Size for HDBSCAN:
Desired Cluster Configuration | Minimum Cluster Size | Minimum Sample Size |
---|---|---|
More clusters (highly specific) | Small (10-50) | Small (1-10) |
Fewer clusters (generalized clusters with some specificity) | Large (>50) | Small (1-10) |
Very general clusters (more posts labelled as ‘outliers’) | Large (>50) | Large (>10) |
Choosing and iteratively trying out different parameters is crucial in guiding the clustering algorithm to identify the optimal number of clusters, striking a balance between having too many clusters representing granular topics and too few overly abstract ones. See more details about the Parameter Selection for HDBSCAN here.
3) Once the process starts, you will see a progress bar. Since generating embeddings and projecting them in a 3D space is computationally intensive, Communalytic supports the analysis of only three datasets in parallel. If you are a fourth user starting a new analysis, your request will be placed in a queue and automatically begin when it is your turn. You can close the browser and check on its progress later.
4) When the data processing is done, you will see a screen with three buttons: “Open Visualization”, “Change Clustering Parameters”, and “Download Embeddings & Clusters”.
- The “Open Visualization” button will open the 3D Semantic Similarity Map within your browser, allowing you to visualize embeddings in a 3D space.
- The “Change Clustering Parameters” button allows you to adjust the clustering parameters to fine-tune the current visualization.
- The “Download Embeddings & Clusters” button will create and help you download your embeddings and cluster labels for the complete dataset as a CSV file. Note: Due to the potentially large size of the output CSV file, it will be exported as a ZIP file.
5) The 3D Semantic Similarity Map is an interactive visualization that allows you to examine the resulting map from different angles and at different Zoom-in levels within your browser without the need to download any additional software. For the best user experience, we recommend using one of the latest browsers (e.g., Chrome, Edge, or Firefox).
Any posts not containing text (such as those featuring only photos or videos) will be excluded from the map. This means that the number of records visualized in the map may be smaller than the total dataset. Retweets are also excluded from Twitter data to focus on unique topics.
The 3D view is controlled with a mouse or touch screen. Below is the list of navigation options:
- Left click mouse and move to rotate the camera.
- Right click mouse and move to pan the camera.
- Scroll mouse wheel to zoom in and out.
- Hover on a dot to preview the corresponding post.
In order to increase or decrease the number of clusters, go to the “Adjust Clustering Parameters” section in the right-side panel to change the clustering parameters and click the “Apply Changes” button.
6) In the 3D Semantic Similarity Map, manually review sample posts in each cluster and assign a descriptive label following the steps below:
- Use the drop-down menu in the “Examine & Label Clusters” panel to select and zoom in on a specific cluster.
- Under the “Posts in the Selected Cluster” panel, preview up to 1k posts within a given cluster using the “<“ and “>” buttons.
- In the text box under “Label Cluster“, enter a short description of the selected cluster (up to 100 characters), then click “Save Cluster Label” to associate the label with all posts in the selected cluster. This label will be stored as part of your dataset and can be renamed later.
7) To help you with this process, you can use one of three LLMs (llama-3.1-8b-instruct, mistral-7b-instruct-v.2-lora or bart-large-CNN) to automatically suggest a label. To use this feature, select a cluster and click the “Suggest a label” cluster. Communalytic will generate a summary description using a sample of posts from the selected cluster. The sample size is set to 10% of the cluster, with a minimum of 10 posts and a maximum of 100 posts.
During this feature’s preview phase, the usage cap is set at up to 30 calls per day in Communalytic EDU and up to 100 calls in Communalytic PRO. The cap will be adjusted based on the actual resource utilization and the cost of using an external LLM service for data processing.
In addition to using Communalytic’s built-in 3D visualization, you can explore your dataset and visualize it with external visualization tools such as Nomic Atlas. This third-party tool allows users to represent and explore embeddings in a 2D space, with features complementary to Communalytic’s built-in visualization, such as semantic search and automatic topic labelling at multiple levels of granularity. You can learn more about this option here.