How to Generate & Visualize Embeddings

The Topic Analysis Module can automatically identify and group together social media posts that are semantically similar. It can spot latent topics in a dataset (i.e., abstract topics that may not be directly observable from just reading the posts).

The module is entirely web-based, requires no special programming or coding experience to use. It is designed to help researchers make sense of their social media dataset without having to scroll through endless Excel files, read and review every post or even have prior knowledge about the content of the dataset.  

Creating Embeddings

To represent social media posts as embeddings, the module uses sentence-transformer models from Hugging Face (a platform and community repository for sharing machine learning models) to transform human-readable text such as social media posts into computer-readable vectors of numbers known as embeddings. For more information on embedding see here and here

Please note:

  • The EDU version of Communalytic uses the Glove model (trained on English texts); while the PRO version uses the more robust and multilingual MiniLM-L12-v2 model.
  • These models were primarily trained on English data and may not perform as well on datasets in other languages.
  • As these models are intended for encoding sentences or short paragraphs, embeddings will be generated using only the first 256 words from each post.

Visualizing Embeddings

To visualize the result of the embedding vectors, the module uses Nomic Atlas, a third-party tool that allows users to represent and explore embeddings in a 2D multi-dimensional space, turning the vectors into an interactive topic map with automatic topic labels added to each grouping of posts. Posts that are located close to each other in a multi-dimensional space are considered semantically similar (i.e., similar in their meaning). Once semantically similar posts are grouped and visualized, researchers can use the resulting interactive topic map to explore their dataset and examine latent topics.

Below is a visualization of a sample dataset collected and processed by Communalytic and then imported into Nomic Atlas. Each dot in a topic map represents a post, and colors are automatically assigned based on their semantic similarity. Posts that are similar in their meaning are automatically grouped, and each grouping of posts (cluster) is then automatically assigned a label that nominally summarizes the main topics discussed in each cluster.  Researchers have the option of using the suggested labels for each groupings or clusters of semantically similar posts. However, researchers can also override the suggested labels and manually explore posts and relabel the clusters as desired.

More technical details about how Nomic Atlas works are available here.

How To

1) Log in to your Communalytic account and and click on the ”Topic Analysis” icon located next to the dataset you want to analyze (found under “My Datasets”):

2) On the next page, click the “Calculate Embeddings” button to generate embeddings using the content of posts in your dataset. This process will represent your data in a multidimensional semantic space. Depending on the size of the selected dataset, the analysis may take anywhere from a few minutes to several hours to complete due to the computational complexity. You can safely close the browser and return later to check the progress status.


3) Once the embeddings are calculated, you can visualize them using Nomic Atlas. To do so, create a free Nomic Atlas account at https://atlas.nomic.ai/ and generate a Nomic Atlas API token at https://atlas.nomic.ai/cli-login. (You only need to generate a token once.)

4) Copy and paste the generated token into the input box in Communalytic (under Topic Analysis).

The entered token will be saved automatically in your Communalytic account for future use with other datasets.

5) Click the “Visualize with Nomic Atlas” button to start the process of transforming and transferring your dataset (embeddings + post content + toxicity/sentiment analysis scores, if available) into Nomic Atlas. 

A free Nomic Atlas account can only store up to 250,000 posts; as a result, if your dataset is larger than this, only the first 250k posts will be visualized.

Note for Communalytic EDU users

You can only visualize one dataset from Communalytic EDU with Nomic Atlas at a time. If you’ve already visualized another dataset using your Communalytic EDU account, the Nomic Atlas visualization for the previous dataset will be replaced.

This means that if you go back to a dataset where you had previously conducted a topic analysis, the Nomic Atlas link for that dataset will no longer be visible. However, you can still recreate the visualization using the embeddings stored in Communalytic for that dataset.

Note for Communalytic PRO users

In Communalytic PRO, you can store and visualize multiple datasets using your Nomic Atlas account simultaneously.

However, if you have reached the limit of 250k records in your Nomic Atlas account, you won’t be able to see new maps in Communalytic unless you complete one of the following steps:

  • Delete some of your existing maps in your Nomic Atlas account (under Settings) as shown below:
  • Alternatively, enter and use another token from a different Nomic Atlas account in Communalytic (under Topic Analysis) as shown below:

6) After Communalytic successfully generates your Nomic Atlas map, you can access it by using the link provided on the Topic Analysis page (as highlighted below). The link is confidential but publicly accessible, enabling you to share it with others. You can explore different latent topics that are discussed in your dataset.

When analyzing data from Twitter, retweets are not included in the analysis to focus on unique topics. This means that the number of records visualized in the Nomic Atlas map may be smaller than the total dataset. In addition, any posts that do not contain text (such as those featuring only photos or videos) will also be excluded from the map. For example, in the screenshot below even though the dataset contains over 300k records (see the left side panel), the number of records visualized in the Nomic Atlas map is three times smaller due to the removal of RTs and any empty (non-text) posts.

Automated labelling/tagging: Depending on the size of the dataset, the process of generating labels/tags by Nomic Atlas might take a few minutes. If you do not see any labels/tags appearing on top of the visualization, allow some time (approximately 5-10 minutes) and then refresh the page. In order to ensure the accuracy of the automatically assigned labels/tags, we recommend clicking on multiple dots/posts within each cluster to review their content. This will help you determine whether some labels/tags are either too specific or too broad.

Please refer to the following tutorial for detailed instructions on using and interpreting the resulting Nomic Atlas maps: How to Explore Embeddings in Nomic Atlas – Communalytic