Learning about color from language – Communications Psychology

Spread the love

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Communications Psychology volume 3, Article number: 60 (2025) Cite this article
6087 Accesses
3 Citations
78 Altmetric
Metrics details
Certain colors are strongly associated with certain adjectives (e.g. red is hot, blue is cold). Some of these associations are grounded in visual experiences such as seeing glowing red embers. Surprisingly, despite having no visual experience, many congenitally blind people show very similar color associations which are likely learned through language. We show that these associations are indeed embedded in the statistical structure of language. We apply a projection method to word embeddings trained on corpora of spoken and written language to identify color-adjective associations as they are represented in English. These projections were predictive of color-adjective associations reported by blind and sighted English speakers. The most predictive projections were generated by embeddings derived from a corpus of fiction, which outperformed even the state-of-the-art large language model, GPT-4. By augmenting the training corpora in various ways we discover the types of sentences most responsible for conveying the color-adjective associations to the models. We find that word embedding models learn these associations from indirect (second-order) co-occurrences, and that when prompted, people are able to identify some of the words that are most informative for associating colors with specific adjectives. Learning through linguistic co-occurrences is one way word meanings can be continually aligned across language users despite large variations in perceptual experience.
Much of what we know about the world we learn through personal sensory experience. This can lead to a presumption that people who lack certain sensory experiences cannot have the knowledge these experiences impart. For example, it was long thought that blind people have no knowledge or understanding of color (see e.g.,¹). Locke, for example, writes of a “studious blind man” who “bragged one day, that he now understood what scarlet signified … it was like the sound of a trumpet.” According to Locke, “Just such an [erroneous] understanding [one will have who gets the idea] only from a definition, or other words made use of to explain it”².
Behavioral studies, however, show that despite having no direct visual experience, congenitally blind people possess richly structured visual knowledge³. For example, despite never seeing colors, some congenitally blind people judge the similarity and difference of color words in ways nearly identical to sighted people (refs. ^4,5; cf. ⁶). Their judgments of words referring to other visual properties such as “sparkle” and “shimmer” are likewise practically indistinguishable from those of sighted people⁷. Congenitally blind people’s judgments of what colors are typical of various objects are broadly similar to that of sighted people⁸, and they have similar intuitions about which objects have consistent colors (e.g., fire trucks, traffic lights, police uniforms) and which do not (e.g., lunch boxes, cars) (ref. ⁹; cf. ¹⁰). Recently, Saysani and colleagues¹¹ examined patterns of associations between colors and various adjectives (e.g., red-ripe, blue-cold, black-heavy), again showing substantial overlap between sighted and congenitally blind people’s judgments.
Taken together, these findings provide an empirical counterpoint to philosophical speculation about the empty nature of visual knowledge in the absence of direct perceptual experience and suggest that languages are far richer repositories of perceptual information than has been generally acknowledged^12,13. But where in language is such information and what form does it take? We examined whether color-adjective associations provided by blind and sighted people are predicted by word-embedding models. Noted that word embeddings are not the only way to recover color semantics from text. Simpler methods, like comparing how often colors appear with certain adjectives in the same sentence, can also predict how people associate colors and adjectives (see SI section 1 for details). For example, one could look at how much more likely “red” is to appear in sentences with “hot” versus “cold”. However, we chose word embeddings for their versatility in handling various augmentations and their ability to capture higher-order associations beyond first-order relationships, which we later show are not necessary for making color learnable from text.
Prediction-based word embedding models such as word2vec instantiate–at scale–the dictum of distributional semantics that the meaning of a word “is the company it keeps”¹⁴. While there is certainly more to word knowledge than its relationship to other words, there is good reason to think that people are able to extract semantic information from large-scale distributional patterns in broadly similar ways to the predictive models we use here for generating word embeddings,^{15,16,17,18,19}. We then examined what type of language best predicts these judgments. Is the relevant information most represented in a particular genre of language such as spoken language or fiction? Next, we asked what kinds of co-occurrences are most important. For example, if blind people rate “red” as a hotter color than “green”, is it because they are exposed to phrases like”red hot coals” or “green fruits are unripe” that directly link the two words? Or, might the associations stem from higher-order co-occurrences such that color words with more similar associations (e.g., the “hotter” colors) tend to have more similar contexts?
In Experiment 1, we predicted blind and sighted people’s color-adjective associations from word embedding models trained on different corpora of English. In Experiment 2, we examined what type of language most contributes to predicting human color-adjective associations. In Experiment 3, we augmented text corpora to remove possible sources of color-adjective associations to determine what kinds of statistics are most important for learning color-adjective associations from language. In Experiment 4, we examined the formation of color-adjective associations during training to understand what kinds of sentences are most informative for producing human-like associations.
All experiments received ethics approval from the University of Wisconsin-Madison’s Minimal Risk Research Institutional Review Board (approval number 2020-0683) as the experiments pose minimal risk to participants. All participants provided informed consent prior to participating. None of the experiments were preregistered.
Saysani and colleagues¹¹ measured association strength between colors and various adjectives by using a semantic differential task. On each trial, participants were asked to position a color word (e.g. “blue”) on a semantic dimension anchored by a pair of antonyms (e.g. “hot” and “cold”). Here, we use semantic projections²⁰, a method that has shown strong correlations with human judgments across a wide range of categories and semantic dimensions, to predict participants’ ratings from the distributional statistics of these colors and adjectives in large text corpora. To implement this, we use word embeddings to define an axis for each semantic dimension (e.g., from hot to cold) by subtracting the respective word vectors. Instead of using the raw difference vector as in²⁰, we modify their approach by normalizing the semantic axis by its L2-norm before computing the inner product, which is mathematically equivalent to using cosine similarity. We then project each color’s word embedding onto this axis, computing the cosine similarity between the color vector and the normalized axis vector. For instance, the projection of “blue” on the cold-hot dimension is given by (cos (vec{,{{rm{hot}}},}-vec{,{{rm{cold}}},},vec{,{{rm{blue}}},})). This provides us with a relative measure of word similarity, taken along the semantic dimension’s axis, that we can use to predict human ratings of color associations. Using semantic projection, we computed color associations encoded in 300-dimensional embeddings trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm^21,22.
Saysani and colleagues recruited 32 participants, all native speakers of English, 20 of whom had normal, trichromatic vision (mean age = 29, range = 21–35). The remaining 12 were congenitally blind with no residual experience of vision (mean age = 39.6, range = 18–69). Saysani and colleagues have made their data accessible through the Open Science Foundation²³.
Participants were asked to rate each of nine color terms (red, orange, yellow, green, blue, brown, purple, black, and white) on 17 semantic dimensions, each defined by two antonyms placed at the poles of a seven-point Likert scale (happy-sad, calm-angry, submissive-aggressive, relaxed-tense, exciting-dull, selfless-jealous, active-passive, like-dislike, alive-dead, fast-slow, new-old, unripe-ripe, soft-hard, light-heavy, fresh-stale, clean-dirty, and cold-hot).
We used linear mixed-effects models²⁴ to predict participants’ color-adjective association ratings from word embedding projections. We controlled for the written frequency of the adjectives anchoring each dimension as well as their concreteness estimates. We did not have a priori predictions about their influence, but we included them as nuisance predictors. The model included participant, color, and dimension level as random intercepts. Our random effect structure balances accounting for the nested data structure and model parsimony. An alternative structure including a random intercept for each color-dimension pair yields smaller effect sizes but doesn’t alter main conclusions: fiction embeddings remain the best predictors of human ratings, and removing common mediators is still the most effective augmentation (discussed in SI Section 2).
We next sought to better understand where in language this information is stored. We began by replicating Experiment 1 with a larger sample of sighted participants. We also asked participants to estimate what ratings other participants were assigning to the same items. Analysis of these other-ratings is presented in SI section 4. In Experiment 2, we analyzed the color-adjective associations using projections computed from embeddings trained on different corpora of English. If some corpora allow for better prediction of human ratings, we can conclude that they contain language that conveys more human-like color-adjective associations. In Experiment 3, we then sought to determine what type of co-occurrence information is most important for being able to predict human ratings. To find out, we trained word embedding models on systematically augmented corpora (e.g. removing all sentences containing both color and adjective terms). To the extent that learning about color-adjective associations from language requires encountering certain types of sentences, the ability of a model trained on a corpus lacking those sentences should decrease.
To better understand where in language the models may be learning these human-like color-adjective associations, we first examined whether word embeddings learned from different language genres differ in predicting the ratings of blind and sighted participants.
We applied the semantic projection method to 300-dimensional word embeddings trained on several English text corpora: (a) Embeddings (pre)trained on Web Common Crawl and English Wikipedia using the fastText implementation of the continuous bag of words (CBOW) model²⁵, (b) embeddings pretrained on the OpenSubtitles corpus using the fastText implementation of the skipgram model²¹, and (c) embeddings trained on each section of the Corpus of Contemporary American English (COCA). We separately trained models on the fiction, news, academic texts, spoken texts, and magazine article components. We cleaned the corpora (removing extraneous whitespace, punctuation, and non-standard characters), ran 5 iterations of the word2phrase algorithm with decreasing threshold to convert common bigrams like “New York” into single tokens²⁶. We then ran the fastText implementation of the skipgram algorithm²² with the following parameters: minimum word occurrences = 5; minimum length of subword ngram = 3; maximum length of subword ngram = 6; sampling threshold = 0.0001; learning rate = 0.05; learning rate update rate = 100; dimensions = 300; context window size = 5; epochs = 10; negative samples = 10. As an additional point of comparison, we also elicited responses from GPT-4, a state-of-the-art large transformer language model with many orders of magnitude more parameters and training input. We prompted GPT-4 to rate colors on the same dimensions as our human participants and assessed how GPT-4 predicts human judgments compared to word embedding models. For each color-dimension rating, we prompted GPT-4 10 times with a 0.8 temperature setting and took the average of the responses. To complement these ratings and better approximate GPT-4’s underlying probability distributions, we also implemented a sampling-based method, where we presented GPT-4 with binary choices about colors (e.g., “Is red cold or hot?”, “Is yellow ripe or unripe?”), sampling 500 responses per color-dimension pair using a temperature setting of 1.0. By treating these responses as Bernoulli trials, we could estimate the model’s probabilistic preferences through the relative frequency of choosing each option.
We recruited 118 sighted undergraduate psychology students from the student participant pool at the University of Wisconsin-Madison.
Stimuli and procedure were identical to those used in Experiment 1, with three exceptions:
The experiment was carried out online rather than in person.
We asked 30 participants (mean age = 18.6, age range = 18–22; self-reported gender: 12 men, 18 women) to provide not only their own color-adjective association ratings, but also the color-adjective associations they expected others would provide. See SI section 4 for analysis.
We asked 88 participants to complete a survey about their reading experience, including exposure to fiction and non-fiction text and the Author Recognition Test (ART²⁷;). Due to an error, demographic information was not collected during the study. However, demographic data were available for 41 participants (46.6% of the sample) from their participation in unrelated studies in the same academic term. Among these participants, the mean age was 18.4 years, age range: 18–21, with 23 self-reported women, 17 men, and 1 non-binary individual.
Our second attempt to better understand what aspects of language allow simple language models to learn human-like patterns of associations between colors and adjectives involved retraining the models on a corpus that has been augmented in various ways and tracking which augmentations cause the word embeddings to be less predictive of human ratings. Similar approaches of manipulating training data to probe model learning has been productively applied to study gender bias²⁸, word order effects²⁹, and the acquisition of rare syntactic patterns^30,31. If a particular augmentation causes a drop in the model’s predictiveness, it is likely that the information that was augmented was causally important. These analyses offer clues about the parts of language that are most informative for learning perceptual information through language alone, though there is no guarantee that the same types of sentences are most important for human learning.
We considered two broad hypotheses of what linguistic signals are most responsible for the color-adjective associations being learned by the word embedding models:
First-order co-occurrences: The models learn from encountering a color word and the adjectives that anchor each scale in the same sentence (e.g. “the fire was red hot”).
Second-order co-occurrences: The occurrence of color words and semantic dimension words in similar contexts (i.e. color words and semantic dimension words may not co-occur, but share words that they co-occur with, e.g. “Southern cooking uses green tomatoes” and “Southern cooking uses unripe tomatoes”). Words form second-order linkage with many other words (e.g. almost every word co-occurs with words like “the” and “of”, which means every word has a second-order to every other other word through that linkage) so only certain second-order co-occurrences can be informative for learning color-adjective associations. We test three specific versions of the second-order co-occurrence hypothesis:
A weak version of the semantic neighbor hypothesis: Color-adjective associations could be learned from co-occurrences between a color or adjective and semantic neighbors of adjectives and colors. For example in the sentence “The forest was white with snow”, snow is in the same semantic neighborhood as cold, which might lead to an association between white and cold. We can remove this information by removing sentences that contain either a color word and a semantic neighbor of an adjective, or an adjective and a neighbor of a color.
A strong version of the semantic neighbor hypothesis: The semantic associations of a word could also be learned from seeing how its neighbors are being used in context. This information could be removed by removing sentences that contain a semantic neighbor of color words or adjectives. For example the sentence “There was a chill in the morning” contains the word “chill” which is a semantic neighbor of “cold”.
For both the strong and weak versions of the semantic neighbor hypothesis, we restricted semantic neighborhoods to the 25 closest neighbors, i.e., the 25 words with the smallest cosine distances to the color or adjective.
Mediation by psychologically salient words: It is possible that some color-adjective associations may be mediated by specific words. For example, when rating where to place yellow on the unripe-to-ripe dimension, people may think of a yellow (and therefore ripe) banana. We do not know a priori what these mediator words are, but we can elicit them from participants (see “Methods”). We then remove from the training corpus, sentences containing commonly mentioned mediators.
We recruited a secondary sample of 30 sighted undergraduate psychology students through Amazon’s Mechanical Turk (MTurk) crowdsourcing platform (mean age = 40, age range = 22-70; self-reported gender: 16 men, 13 women, 1 non-binary). These participants did not perform the color-adjective rating task. Instead, they were presented with the color-adjective pairs and asked to generate a word that they associate with both.
To test the first-order co-occurrence hypothesis, we removed from the COCA-fiction corpus any sentence containing both a color word and one of our semantic dimension words.
To test the semantic neighborhood hypothesis, we removed from the COCA-fiction corpus any sentence containing one of the 25 nearest neighbors of each semantic dimension word and each color word. The second-order co-occurrence hypothesis proved to be infeasible to test directly because the number of sentences containing shared words is vastly larger than the number of sentences containing first-order co-occurrences. Indiscriminately removing all of these shared words (and the sentences they occur in) reduces the size of the training corpus by an order of magnitude.
To test the salient word mediation hypothesis, we presented participants with color-adjective pairs and asked them to respond with a word that is well described by that color and the adjective (e.g. “What is something that is red and fast?”). Each participant completed 102 trials, rating three colors and 34 adjectives (order was blocked by color). Each color-adjective combination was labeled by 7–13 participants. We computed modal name agreement (fraction of participants providing the modal label, M = 0.20, SD = 0.12) and Simpson’s diversity (M = 0.04, SD = 0.08; for details see³²) for the mediator words participants provided for each color-adjective pair. Out of 153 color-dimension pairs, 27 do not have a mediator word for both ends of a given dimension (e.g., no agreed word for ‘passive-brown’ and ‘active-brown’). Meanwhile, 9 color-dimension pairs had the same mediator words for both ends of a dimension. For example, ‘bird’ was listed for both ‘passive-blue’ and ‘active-blue’, and ‘cat’ was listed for both ‘happy-brown’ and ‘sad-brown’. We then removed from the COCA-fiction corpus any sentence containing a mediator provided by at least two participants.
To test the effect of the different augmentations of the training corpus, we again use the embedding projections to predict participant ratings, but rather than collecting new semantic differential ratings, we pool the ratings from Experiments 1 and 2 (including the additional fiction exposure sample, for a total of 138 sighted participants and 12 blind participants). The effect size of the COCA-fiction projections differed somewhat across the first three experiments, appearing larger in the Experiment 1 than in Experiment 2. These differences are likely due to sampling error. Pooling participant ratings provides a more robust estimate of the overall effect size (in sighted participants, at least) and the effect size for the embeddings trained on the augmented corpora.
In the final experiment, we sought to validate the key result of Experiment 3—that the linguistic information most important for learning color-adjective associations from language comes from sentences containing words mediating the relationship between a color and an adjective. To do this, we recorded which sentences moved the learned representations of colors and adjectives closer to the model’s final, post-training representation. For example, people rate “blue” as being more “cold” than “hot”. Which sentences in the model’s training are most responsible for moving “blue” and “cold” closer together?
To measure the impact each individual sentence in the training corpus had on the embedding projections, we modified the word2vec²⁶ implementation included in gensim³³. The modified word2vec implementation computes and records the embedding projections of interest after every training cycle (i.e. reading a training sentence, computing and back-propagating the error, and computing the updated embeddings). We then used the final embedding projections (after training is completed) as a reference and calculated how much each training sentence reduced the relative distance (between the previous projection and the final projection).
Because the model is trained for multiple epochs (i.e. it “sees” every training sentence multiple times) and the learning rate decays with each new sentence (see the original word2vec paper for details²⁶) an uncorrected measure of training sentence impact would simply report the first training sentences are the most impactful (i.e. the first moves from a random initialization to some sort of ordered state would represent the largest shifts in embedding projections). We therefore corrected for the decaying learning rate by dividing the magnitude of the movement towards the final projection by the current learning rate and corrected for the lack of initial structure by only examining the final epoch, when the internal structure of the semantic space has largely stabilized and words no longer move around much inside the space unless their embedding is the one being updated.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Using word embeddings trained on the OpenSubtitles corpus, we were able to predict color-adjective ratings (e.g., placing yellow closer to ripe than unripe) with a standardized effect size of .6 (95% CI = [0.51, 0.67], p < 0.001) for sighted participants and 0.21 (95% CI = [0.11, .29], p < 0.001) for blind participants. Although these effect sizes are substantial, they do not tell us what is driving the associations. Could it be that the color-adjective ratings we are examining depend predominantly on linguistic exposure even in sighted people? Some of the color-adjective combinations such as jealous–green are indeed purely metaphorical, with no obvious perceptual basis except for occasional drawings and cartoons inspired by the metaphor itself. However, this is unlikely to be true for the color-adjective pairs we tested as a whole. To better quantify the extent to which different color-adjective pairs derive from perceptual vs. linguistic experience, we recruited a separate group of sighted participants (n = 47, mean age = 39, age range = 23–71; self-reported gender: 19 men, 28 women) and asked them if they thought they had learned each color-adjective association—to the extent that it exists—from language or from visual experience. To respond, participants placed a dot inside a diamond-shape (see Fig. S4), with the top and bottom corners representing visual and linguistic experience, the right corner representing both types of experience and the left corner representing neither. We make no claims about the psychometric properties or validity of this probe in assessing the ground truth of how participants actually learned a given association. On average, 49% of the color-adjective associations were rated as being based on direct visual experience than from language (M = 0.50). Average ratings are shown on the x-axis of Fig. 1. Of note, there was a small, but significant correlation between source of knowledge (as rated by sighted participants) and the difference in the strength of ratings between sighted and blind participants (Pearson r = −0.18, 95%CI = [−0.28, −0.08], p < 0.001). Sighted participants rated pairs they deemed to be learned more from visual experience (e.g. black/dirty) as more strongly associated compared to blind participants. Conversely, blind participants rated pairs deemed to be primarily learned from language (e.g., green/jealous) as more strongly associated compared to sighted participants.
This figure combines ratings from Experiments 1 and 2.
If blind people obtain their knowledge of color-adjective associations entirely from language while sighted people rely also on their direct perceptual experience, one would expect that patterns of word co-occurrences to be more predictive of the responses of the blind than sighted participants. We find the opposite pattern. Where there are differences, it is sighted people’s ratings that are best predicted by patterns of word co-occurrence. One possibility is that the responses of blind participants are more varied than those of sighted people, as has been noted by⁵. Consistent with this finding, we find blind participants to have lower intraclass correlations (ICC (2,k), i.e., average-measure reliability³⁴) (ICC = 0.88, 95% CI = [0.85, 0.9]) compared to the sighted participants in Experiments 1 and 2 (ICC = 0.97, 95% CI =[0.96, 0.98]). The lower ICC among the blind has the effect of making their responses less predictable in general. Another possibility is that although language and visual experience sometimes pull in different directions (for example, language pulls green toward jealousy), as a whole learning from either input stream leads to similar representations. If true, sighted people are essentially getting more input and the greater predictability of their ratings is the result of the greater stability/convergence that comes from learning with more input.
Our sighted participants’ ratings were very similar to those of in the original study¹¹: Pearson r = 0.89, 95%CI = [0.82, 0.96], p < 0.001, averaged across colors/dimension pairs. There was, as before, a strong correlation between the ratings of the blind and sighted participants, Pearson r = 0.75, 95%CI = [0.66, 0.84], p < 0.001, see Fig. 2. We will use these aggregated data from this point forward.
Blind participant ratings are on the vertical axis; sighted participant ratings are on the horizontal axis. Several color-adjective associations are annotated for illustration. The diagonal represent perfect agreement between blind and sighted participants. This figure combines ratings from Experiments 1 and 2.
All the corpora we examined yielded word embeddings that were significantly predictive of the ratings made by blind and sighted participants (Fig. 3 and Table 1). The corpus that was most predictive of the human ratings (i.e. had the largest standardized effect size) was the fiction section of the Corpus of Contemporary American English (COCA-fiction) (See Fig. 4 for the Pearson correlation between the COCA-fiction embedding projections and color ratings from blind and sighted participants for each dimension). Word embedding models trained on COCA-fiction performed better than models trained on much larger corpora: COCA-fiction (at 120 million tokens) outperformed OpenSubtitles (750 million tokens) and Common Crawl (600 billion tokens) though because this Common Crawl embedding was trained with a different algorithm, this particular comparison should be taken with a grain of salt. That said, it is clear–especially when we consider the performance of the state-of-the art GPT-4 model–that larger corpora do not necessarily lead to larger effect sizes. When we compare size-matched sections of the COCA corpus, COCA-fiction performed better than the other sections in predicting both sighted and blind ratings. The next best language model for predicting sighted people’s judgment was OpenSubtitles. GPT-4’s directly elicited ratings (rather than binary choices) was the second most predictive of blind ratings. We acknowledge that our methods of directly eliciting ratings and binary choices may not fully capture GPT-4’s latent knowledge, but given GPT-4’s closed-source nature, these methods were our best available approximations. It is important to note that absence of evidence (in this case, GPT-4’s lower performance) is not evidence of absence, and there remains a possibility that more creative prompting strategies may reveal better performance.
All models use the projection method described in Exp. 1, except GPT-4 which was prompted directly. Error bars represent 95% confidence intervals.
r_group is the mean Pearson correlation of each person with their group mean without that person in the group.
We also investigated the relationship between reading fiction and people’s responses on the color-adjective association task. We asked participants to fill out a survey to assess their exposure to fiction texts, including how many hours per week they spend reading fiction and nonfiction text and a series of questions meant to gauge reading motivation. Participants also completed the Author Recognition Test (ART)²⁷, a better-validated measure of print exposure (though one that is still likely too coarse for accurate estimates of fiction reading). We found no evidence that fiction exposure modulated color-adjective associations. (See Fig. S4 and Table 2). This finding suggests that while fiction corpora capture systematic patterns in color-adjective associations, these associations may not be primarily learned through direct exposure to written fiction.
Figure 5 shows effect sizes when predicting sighted and blind participants’ ratings from the COCA fiction corpus augmented in various ways, showing both the results when the corpora were equated in size through downsampling (B) and when they were not (A). See also Table 3 for all statistics in Result 3. Compared to the original non-augmented corpus (standardized coefficient of 0.57, 95% CI = [0.52, 0.62], for sighted participants), removing first-order co-occurrences did not meaningfully reduce the effect size of the COCA-fiction word embedding predictions (0.51, 95% CI = [0.46, 0.55]). The weak nearest-neighbor augmentation removed sentences containing a color or adjective and a semantic neighbor. This removed about 3% of the corpus, resulting in a somewhat smaller effect size (0.45, 95% CI = [0.4, 0.5]). The strong nearest-neighbor augmentation removed any sentence containing a semantic neighbor of the colors and adjectives (removed about 58% of the corpus) did not diminish the effect size more than the weak nearest neighbor augmentation, at least for sighted participants (0.37, 95% = [0.32, 0.42]). Finally, removing mediating words (e.g., “snow” as mediating white and cold – see Table 4 for additional examples) removed about 36% of the data but led to the most significant reduction of the effect size (0.12, 95% = CI [0.09, 0.15]). To make a more balanced comparison between augmentations, we re-ran the experiment with all corpora downsampled (randomly, without replacement) to the size of the smallest modified corpus, and it did not meaningfully change our findings (Fig. 5B). The effectiveness of different manipulations shows a similar pattern for blind participants, albeit on a smaller scale. In addition, the effect of various augmentations on blind participants’ ratings decreases more gradually compared to sighted participants. We discuss this difference in detail in SI section 3.
A Corpora are reduced to different sizes caused by various augmentations. B All corpora were reduced to the size of the smallest corpus after perturbation to ensure corpus size did not affect results. Error bars represent 95% confidence intervals.
Removing first-order co-occurrences had no measurable effect is consistent with the learning objective of the model we used: Word embedding models, specifically the skipgram models used here, are trained to predict the context a word occurs in. Strict first-order co-occurrence therefore does little to drive embedding similarity. While not causally involved in embedding models’ learning of color semantics, however, first-order co-occurrence may still reliably predict color-adjective associations. See SI section 1 for discussion.
Removal of sentences containing mediators – words that participants reported to be associated with both a color and a given adjective was strikingly effective at eliminating much of the signal the models were using to learn human-like associations between colors and adjectives. This result is all the more striking since the number of mediators generated by at least two participants (the threshold for inclusion in our corpus filtering procedure) was just 242, fewer than one mediator per color-adjective pair, on average. The effectiveness of the mediator augmentation suggests that most of the color-adjective associations are being conveyed by these mediators. Experiment 4 was designed to further validate this conclusion.
The sentences that most informed the final embedding projections were more likely to contain the adjectives than the color words. For example, in the top 1000 most informative sentences most responsible for moving the representation of “blue” toward its final position on the hot-cold dimension, there were 447 occurrences of “cold”, 326 occurrences of “hot”, and 303 occurrences of “blue”. A possible explanation for the higher frequency of adjectives in informative sentences is that color words tend to have higher frequency compared to the adjectives: the average Zipf frequency³⁵ for color words and adjectives in our dataset is 5.02 and 4.66, respectively. The greater frequency of color words diffuses their importance making their presence in a sentence less informative about specific associations than the presence of the relatively rarer adjective words.
Every sentence in the top 1000 contained at least one of these words, and only a few contained more than one. This pattern held for the top 1000 most informative sentences across all color/adjective pairs. The median number of adjectives was 611, the median number of colors was 418, and the median number of sentences in each top 1000 with neither a color nor an adjective was 0. This result provides converging evidence that the sentences that most influence the learned representations are not those containing first-order co-occurrences like “red hot coals”.
We next examined whether the most influential sentences were more likely to contain words that are psychologically salient mediators between colors and adjectives (e.g., “snow” as mediating the association between “white” and “cold”). As shown in Fig. 6, the most influential sentences were much more likely to contain words that people mentioned as being associated with both the color and the adjectives anchoring each dimension.
Note that for all colors, relative prevalence of related mediator words increases sharply at the 99th percentile of training sentence impact. Error bars on points other than the 99th percentile may not be visible because they are too small to see in this figure.
Overall, our results show that simpler language models learn color-adjective associations from second-order co-occurrence relationships, indirect relationships mediated by words that are associated with both a color and an adjective. The higher prevalence of mediator words (generated by participants for each color-adjective pair) in the most informative training sentences (see Fig. 6) suggests that our human participants are also tracking specific second-order relationships that are highly informative for learning color-adjective associations.
Some types of semantic knowledge can be learned through direct sensory experience (e.g. one can see that an elephant is large) while other types might require explicit instruction (e.g., that electrons are smaller than atoms). But some portion of our semantic knowledge, however may need neither direct experience nor explicit instruction, instead being learnable from the statistical structure of natural language itself. Blind people’s knowledge of color–an aspect of the world they cannot directly experience–offers a strong test of this hypothesis. While it is impossible to rule out that blind people’s knowledge of typical colors may come from direct instruction, such direct instruction is highly unlikely when it comes to color-adjective associations (how fast is yellow, how relaxed is red?). The finding that blind people’s color-adjective associations are strikingly similar to those of sighted people¹¹ suggests that they are learning this information from language itself. But how and from what kind of language?
Here, we show that the color-adjective associations captured in semantic differential tasks are present in the distributional statistics of English, as color-adjective ratings of both sighted and blind participants were predicted by patterns of word co-occurrence. Of course these patterns are only present in language because most people can see color. Once present in the language, these patterns can then convey knowledge to individuals who do not have direct experience of color. In short, language allows blind people to align their color semantics with those of sighted people. Although the alignment is not perfect, in the absence of language, such alignment would presumably not exist at all.
Color-adjective associations are recoverable from a variety of corpora, but they are best represented in a corpus of fiction. Why did the fiction corpus outperform other genres? We do not have a definitive answer, but we can offer some speculations. First, despite its size, the Common Crawl and spoken-language transcripts (COCA-spoken) contains a large amount of formulaic language and sentence-fragments that may be ineffective at learning these types of color-adjective relationships. Corpora based on news articles and academic text are much more structured, yet relatively narrow in scope. The fiction corpus has both broad semantic scope and rich, well-formed sentences. Another factor that may be responsible for the high performance of the fiction corpus is usage of idiomatic expressions such as someone turning blue (when they are cold), green (with jealousy), or red (with anger or embarrassment) which may be used less frequently or consistently in e.g. spoken language and academic writing.
Prior claims that language conveys only limited color knowledge (e.g.^9,36) were based on direct, first-order co-occurrence statistics, but probing the internal state of the distributional model revealed that the model instead acquires color-adjective associations by tracking second-order co-occurrence relationships, indirect color/adjective relationships mediated by a third word that is related to both color and adjective (e.g. “glowing” mediating the relationship between “red” and “hot”). Past attempts to have models learn color associations from these second-order co-occurrences yielded mixed results^12,37. Here, we used a vector projection method more analogous to the semantic differential task performed by the participants. While we cannot be certain that human participants track these same second-order co-occurrences, the model demonstrates the feasibility of learning color associations exclusively from this statistical information. One direction for future work is to explore whether human participants are also able to learn semantics of novel words purely from higher-order co-occurrence relationships in language.
This distinction between first- and higher-order co-occurrence relationships is critical when we consider learning from language statistics, because first-order co-occurrences are often misleading. It is generally pragmatically useful to mention an object’s color (or any other property) when it is atypical, not when it is expected. People remark on the white whale or the black sheep because these colors are unusual making reliance on such first-order associations potentially misleading³⁶. Arguing against the possibility of people learning from first-order co-occurrences, Kim et al.^3,9 suggested that blind people’s judgments on tasks probing visual knowledge require making explicit inferences. However, word embedding models which do not have any specialized mechanisms for making inferences, are nevertheless able to learn the type of word associations necessary to predict blind people’s judgments.
Blind people may have color knowledge that substantially overlaps with knowledge of sighted people, but do not necessarily use or weigh this knowledge in the same way. A repeated finding is that blind participants discount or fail to consider visual information when it is not directly required to complete a task^9,10. This is consistent with our finding in Experiment 1 that blind participants rated more “visual” associations as weaker relative to sighted people (who conversely rated more “linguistic” associations as relatively weaker) and helps explain some of the variation between sighted and blind participants.
This work has several important limitations. First, although our results demonstrate that color-adjective associations can be learned from language statistics, not all semantic dimensions showed equal predictability. This variation remains under-explored. It may stem from differences in the concreteness of the adjective pairs are, how closely they relate to perceptual experience, or how consistently they appear in linguistic contexts. Investigating these factors can shed light on the ways in which different aspects of semantic knowledge are encoded in language.
Second, while we identified mediator words that help bridge color and adjective concepts, these mediators were elicited only from sighted participants. We also observed that removing these mediators led to the largest performance drop specifically in modeling sighted participants’ responses. Because sighted people have access to visual knowledge, we cannot conclude that such mediator-based associations are learned from language. A more rigorous approach to testing language’s causal role would involve eliciting mediator words from blind individuals and examining whether removing these mediators similarly reduces the model’s performance for blind participants.
How can people with no direct experience of the visual world come to possess rich visual knowledge^3,7,9? Our findings suggest the answer is language, specifically, its distributional structure embedded in which is a rich repository of information about the perceptual world. This information can be accessed through associative learning mechanisms. In addition to helping explain the similarities in responses between blind and sighted people, our results suggest that the semantic alignment between sighted people may owe itself in part to learning from the distributional structure of natural language.
The data used for the analyses reported in this paper are available as an OSF repository at https://osf.io/namqg.
The analysis code is available as an OSF repository at https://osf.io/namqg.
Hume, D. A treatise of human nature (John Noon, London, 1740).
Locke, J. An essay concerning human understanding (Thomas Bassett, London, 1690).
Kim, J. S., Elli, G. V. & Bedny, M. Knowledge of animal appearance among sighted and blind adults. Proc. Natl. Acad. Sci. 116, 11213–11222 (2019).
Article PubMed PubMed Central Google Scholar
Marmor, G. S. Age at onset of blindness and the development of the semantics of color names. J. Exp. Child Psychol. 25, 267–278 (1978).
Article PubMed Google Scholar
Saysani, A., Corballis, M. C. & Corballis, P. M. Colour envisioned: Concepts of colour in the blind and sighted. Vis. Cognition 26, 382–392 (2018).
Article Google Scholar
Shepard, R. N. & Cooper, L. A. Representation of colors in the blind, color-blind, and normally sighted. Psychol. Sci. 3, 97–104 (1992).
Article Google Scholar
Bedny, M., Koster-Hale, J., Elli, G., Yazzolino, L. & Saxe, R. There’s more to “sparkle” than meets the eye: Knowledge of vision and light verbs among congenitally blind and sighted individuals. Cognition 189, 105–115 (2019).
Article PubMed Google Scholar
Lenci, A., Baroni, M., Cazzolli, G. & Marotta, G. Blind: A set of semantic feature norms from the congenitally blind. Behav. Res. Methods 45, 1218–1233 (2013).
Article PubMed Google Scholar
Kim, J. S., Aheimer, B., Manrara, V. M. & Bedny, M. Shared understanding of color among sighted and blind adults. Proc. Natl. Acad. Sci. 118, e2020192118 (2021).
Connolly, A. C., Gleitman, L. R. & Thompson-Schill, S. L. Effect of congenital blindness on the semantic representation of some everyday concepts. Proc. Natl Acad. Sci. 104, 8241–8246 (2007).
Article PubMed PubMed Central Google Scholar
Saysani, A., Corballis, M. C. & Corballis, P. M. Seeing colour through language: Colour knowledge in the blind and sighted. Visual Cognition 29, 63–71 (2021).
Lewis, M., Zettersten, M. & Lupyan, G. Distributional semantics as a source of visual knowledge. Proc. Natl. Acad. Sci. 116, 19237–19238 (2019).
Article PubMed PubMed Central Google Scholar
Lupyan, G. & Zettersten, M. Does vocabulary help structure the mind? (Minnesota Symposia on Child Psychology: Human Communication: Origins, Mechanisms, and Functions, 2021).
Firth, J. R. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis 1–32 (1957).
Lenci, A. Distributional models of word meaning. Annu. Rev. Linguist. 4, 151–171 (2018).
Article Google Scholar
Günther, F., Rinaldi, L. & Marelli, M. Vector-space models of semantic representation from a cognitive perspective: A discussion of common misconceptions. Perspect. Psychological Sci. 14, 1006–1033 (2019).
Article Google Scholar
Boleda, G. Distributional semantics and linguistic Theory. Ann. Rev. Linguistics http://arxiv.org/abs/1905.01896 (2020).
Andrews, M., Vigliocco, G. & Vinson, D. P. Integrating experiential and distributional data to learn semantic representations. Psychol. Rev. 116, 463–498 (2009).
Article PubMed Google Scholar
Lazaridou, A., Marelli, M. & Baroni, M. Multimodal word meaning induction from minimal exposure to natural text. Cogn. Sci. 41 Suppl 4, 677–705 (2017).
Article PubMed Google Scholar
Grand, G., Blank, I. A., Pereira, F. & Fedorenko, E. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nat. Human Behav. 6, 975–987 (2022).
Van Paridon, J. & Thompson, B. subs2vec: Word embeddings from subtitles in 55 languages. Behav. Res. Methods 53, 1–27 (2020).
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Corballis, P. M., Saysani, A. & Corballis, M. Seeing colour through language: colour knowledge in the blind and sighted osf.io/fz2at (2020).
Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 1–48, (2015).
Grave, E., Bojanowski, P., Gupta, P., Joulin, A. & Mikolov, T. Learning word vectors for 157 languages. arXiv [Preprint] https://arxiv.org/abs/1802.06893 (2018).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. arXiv [Preprint] https://arxiv.org/abs/1301.3781 (2013).
Acheson, D. J., Wells, J. B. & MacDonald, M. C. New and updated tests of print exposure and reading abilities in college students. Behav. Res. Methods 40, 278–289 (2008).
Article PubMed Central Google Scholar
Maudslay, R. H., Gonen, H., Cotterell, R. & Teufel, S. It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. arXiv preprint arXiv:1909.00871 (2019).
Sinha, K. et al. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. arXiv preprint arXiv:2104.06644 (2021).
Kallini, J., Papadimitriou, I., Futrell, R., Mahowald, K. & Potts, C. Mission: Impossible Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Vol. 1, 14691–14714 (2024).
Misra, K. & Mahowald, K. Language models learn rare phenomena from less rare phenomena: The case of the missing aanns. arXiv preprint arXiv:2403.19827 (2024).
Simpson, E. H. Measurement of diversity. Nature 163, 688–688 (1949).
Article Google Scholar
Řehůřek, R. & Sojka, P. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, http://is.muni.cz/publication/884893/en (2010).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Article PubMed PubMed Central Google Scholar
Brysbaert, M., Mandera, P. & Keuleers, E. The word frequency effect in word processing: An updated review. Curr. Direct. Psychol. Sci. 27, 45–50 (2018).
Article Google Scholar
Ostarek, M., Van Paridon, J. & Montero-Melis, G. Sighted people’s language is not helpful for blind individuals’ acquisition of typical animal colors. Proc. Natl. Acad. Sci. 116, 21972–21973 (2019).
Article PubMed PubMed Central Google Scholar
Bergey, C., Morris, B. C. & Yurovsky, D. Children hear more about what is atypical than what is typical. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 42 (2020).
Download references
This research was supported by NSF BCS grant 2020969, awarded to Gary Lupyan. This publication was supported by the Princeton University Library Open Access Fund, awarded to Qiawen Liu. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We would like to thank Armin Saysani and Michael and Paul Corballis for making available the semantic differential data we included in our analysis of Experiments 1–3.
These authors contributed equally: Qiawen Liu, Jeroen van Paridon.
Department of Psychology, University of Wisconsin-Madison, Madison, WI, 53706, USA
Qiawen Liu, Jeroen van Paridon & Gary Lupyan
Department of Computer Science, Princeton University, Princeton, NJ, 08540, USA
Qiawen Liu
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
J.v.P., Q.L. and G.L. conceptualized study; J.v.P., Q.L. and G.L. designed experiments; Q.L. and Jv.P. collected data; J.v.P. and Q.L. developed novel methods and analyzed data; J.v.P., Q.L. and G.L. wrote manuscript.
Correspondence to Qiawen Liu.
The authors declare no competing interests.
Communications Psychology thanks Katherine Erk and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Marike Schiffer. A peer review file is available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Liu, Q., van Paridon, J. & Lupyan, G. Learning about color from language. Commun Psychol 3, 60 (2025). https://doi.org/10.1038/s44271-025-00230-9
Download citation
Received: 23 January 2023
Accepted: 12 March 2025
Published: 14 April 2025
Version of record: 14 April 2025
DOI: https://doi.org/10.1038/s44271-025-00230-9
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
Communications Psychology (Commun Psychol)
ISSN 2731-9121 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

Learning about color from language – Communications Psychology – Nature

Leave a Comment Cancel Reply