Gender and culture bias in letters of recommendation for computer science and data science masters programs

Advanced in Tech & Business

Gender and culture bias in letters of recommendation for computer science and data science masters programs

In this section we analyze the LOR features across our study groups. We first compare the LORs by applicant gender and identify statistically significant features among the variables outlined in Table 1. We are also interested in learning if any observed applicant biases are present among the male and female recommenders. We next conduct a parallel study where we swap the roles of recommender and applicant gender and compare the LORs by recommender gender and examine these differences while controlling the applicant gender. Lastly, we present the findings across three culture groups (i.e., US, China, and India).

Comparing LORs by applicant gender while controlling for recommender gender

Table 3 summarizes the differences between LORs written for female and male applicants, with further analysis on the recommender’s gender. The results only include features for which the differences by applicant gender are statistically significant with a 95% confidence.

Without considering recommender gender, the results for “All Recommenders” show six features with statistically significant differences between the male and female applicant groups. Female applicants receive longer letters (4%) with higher positivity (11%) and specificity (7%), and with less anger (\(-2\)%). However, the tones of these letters are less polite (\(-5\)%) and sadder (4%).

We are interested in learning if these differences originated from the male or female recommenders. To determine this, we control the study samples by recommender gender. The results in the corresponding “Female Recommenders” and “Male Recommenders” blocks show that both female and male recommenders provide more specific and positive letters for female applicants. However, this difference is more pronounced among female recommenders, who show a 16% higher positivity and 12% higher specificity score for female applicants over male applicants, while the respective differences for male recommenders are only 10% and 6%. Since positivity and specificity are among the key features influencing the overall recommendation, our findings suggest that female applicants receive stronger letters than their male counterparts, especially from female recommenders. Although not directly comparable, our results agree with other studies that show women often receive more favorable LORs15.

Another notable observation is that female applicants tend to receive 7% longer letters than male applicants from female recommenders—a phenomenon that is not observed among male recommenders. This length difference is particularly notable because prior research has demonstrated that LOR length is positively related to its favorability29 and psychological experiments have shown that longer LORs are written when the object of the letter is viewed more favorably30. Interestingly, a study of LORs written by school counselors for college applicants31 shows that female recommenders wrote longer letters than their male counterparts for male applicants, while both groups wrote similar length LORs for female applicants, thus showing a different gender bias compared to our study. Our findings also indicate that the LORs for female applicants are associated with a less polite tone, observed across all three control groups in Table 3. Prior studies have not focused on politeness in writing styles and we feel this pattern is worthy of further research.

Table 3 Comparison of LOR features by applicant gender while controlling for recommender gender.
Table 4 Comparison of LOR features by recommender gender while controlling for applicant gender.
Table 5 LOR feature comparison between applicants from the United States (US), China (CH), and India (IN).

Comparing LORs by recommender gender while controlling for applicant gender

This section investigates the differences between the LORs written by female and male recommenders. The results, presented in Table 4, include only those features for which the differences by recommender gender are statistically significant with a 95% confidence.

Without considering applicant gender, the results for “All Applicants” show that ten features exhibit statistically significant differences between the female and male recommenders. Compared to the letters from male recommenders, LORs from female recommenders are longer (5%), more positive (6%), more specific (6%), and the emotional content has less anger (\(-2\%\)), less fear (\(-2\%\)), more joy (\(1\%\)), and less sadness (-1%). Female recommenders also tend to write with a less sad (\(-1\%\)), more excited (\(7\%\)) tone and the overall sentiment score is 3% higher than their male counterparts.

When we limit the applicants to “Female Applicants,” we observe that five out of the ten features from the “All Applicants” block remained statistically significant, i.e., length (\(8\%\)), positivity (\(8\%\)), specificity (\(8\%\)), anger emotion (\(-3\%\)), together with excited (\(9\%\)) and sad (\(-7\%\)) tones. Furthermore, the differences in these features are more pronounced than those in the aggregated group. On the other hand, if we limit the applicant pool to “Male Applicants,” then there are only marginal differences in fear (\(-2\%\)) and joy (\(1\%\)) emotions, but more substantial differences in excited (\(6\%\)) and sad (\(-12\%\)) tones, and overall sentiment (\(4\%\)).

A closer examination of the two sets of variables in the female and male applicant blocks reveals that all recommenders used a more excited and less sad tone for female applicants. However, the two recommender groups show unique differences in specificity, positivity, and LOR length. In particular, only female recommenders provided higher scores to female applicants, and this discrepancy is observed across all three categories. This finding that female recommenders write stronger letters for female applicants than their male counterparts is consistent with the results presented in the previous section. Lastly, female recommenders showed more positive sentiment (4%) than male recommenders for the male applicants; however, this pattern is not observed for the female applicant group.

Comparing LORs across culture groups

This section analyzes the differences in LOR characteristics across applicants from the United States (US), China (CH), and India (IN), which collectively cover most of our applicants. No other countries contribute a significant number of applicants.

We first identify variables whose group means are statistically different across the country-specific groups. To this end, we applied a one-way Analysis of Variance (ANOVA) test32 to each variable, which is an extension of the Student t-test for more than two study groups. Specifically, ANOVA compares the means among the groups and determines whether any of those means are statistically significantly different from each other. Formally, for our application, it tests the null hypothesis:

$$\beginaligned H_0: \mu ^i_US = \mu ^i_CH = \mu ^i_IN \endaligned$$

where \(\mu ^i_US\), \(\mu ^i_CH\), and \(\mu ^i_IN\) denote the average value of variable \(x_i\) over applicants in the US, China, and India subgroups, respectively. We also applied Bartlett’s test33 to ensure the homogeneity of variances assumption in the ANOVA analysis. Since a statistically significant result from an ANOVA analysis only indicates at least one group differs from the others, we next performed the Tukey HSD (“Honestly Significant Difference”) post-hoc test34 to identify where the significance lies in our culture groups.

Figure 1
figure 1

Characteristic features across country groups.

Table 5 presents the statistically significant features (Column 1) thresholded at a 95% confidence interval for ANOVA, Barlett’s, and Tukey HSD tests. The group means and p-values for each variable are displayed in columns 2–4 and 5–9, respectively. The last three columns present the pairwise p-values generated by the Tukey HSD test, from which we can infer the country group with a distinctive mean. For example, the pairwise p-values for specificity for “All Applicants” are 0.001, 0.001, and 0.8487 for the US vs. CH, US vs. IN, and CH vs. IN, respectively. Consequently, this feature is significant for US vs. CH and US vs. IN, but not statistically significant for CH vs. IN, suggesting that specificity is a distinctive feature for the US group. Figure 1 shows the inter-country comparison for a selected set of features for the all applicants group with three scales for clarity.

An analysis of the results for the “All Applicants” block in Table 5 yields the following key observations:

  • The LORs for the US group are distinguished by low specificity and positivity.

  • The Chinese group stands out for its low impolite score for the underlying tone.

  • The India group is marked by more positive sentiment, less sad and disgust emotions, and a more excited tone.

  • Three variables (length, joy, fear) are statistically different for all country groups.

  • The US group has the shortest LORs (2163), followed by China (2456), and India (2715). Not only are all differences statistically significant, but they are substantial: US vs. CH (12.7%), US vs. IN (22.6%), and CH vs. IN (10.0%).

Our findings for LOR length are consistent with another study of a graduate STEM population, which showed that postdoctoral geoscience fellowship LORs originating from US recommenders were statistically significantly shorter than LORs from all other regions1. Interestingly, the pattern we found for LOR length is replicated for positivity, with India having the highest values, followed by China, and then followed by the US with the lowest LOR length and positivity values.

There is a general lack of prior cross-cultural results for the other LOR characteristics, but the low impoliteness score for China conforms to the more general observation of Chinese linguistic politeness, which is a result of Chinese tradition and the teachings of Confucianism that emphasizes propriety and humility35. Overall, the US group exhibits the highest “impolite” score with the India group a close second. One might expect a reverse pattern for politeness, but this is not observed as the US and China groups both have relatively high scores while India has a significantly lower politeness score. This discrepancy can be explained by the NLU definition of “polite,” listed in the Methods section as “displaying rational, goal-oriented behavior”; in this case “polite” and “impolite” are not antonyms.

We next delve deeper into the variations within each gender group (“Female Applicants” and “Male Applicants”) across three countries. Our results indicate that most observations from the previous analysis hold for both gender groups, although there are some exceptions. Additionally, in certain instances, these observations are more pronounced for one gender group. For example, the US group has the lowest specificity for both gender groups, but the difference between the US group and the China and India groups is more pronounced for the male applicants than the female applicants. This pattern continues for positivity, with statistically significant results observed only for male applicants. Furthermore, The low impolite score previously found for the China group is only statistically significant for the male applicants. The high positive sentiment previously observed for the India group holds for both applicant genders and the short LORs found for the US group also holds for both genders.

By comparing the gender-based differences across countries, we observe that the impact of gender is not always consistent. One notable difference is in the length of LOR, which we believe is one of the most important LOR features. Specifically, in the US and China groups, male applicants tend to receive slightly shorter LORs compared to their female counterparts. In contrast, for the India group, male applicants tend to receive significantly longer LORs, with an average of 2928 characters compared to 2419 characters for female applicants. A similar pattern is observed for specificity, where the male applicants for the US and China groups have lower LOR positivity values than those of their female counterparts. Conversely, in the India group, the male applicants receive higher specificity scores (1.8539 vs. 1.7031). Thus, for both features, the US and China groups have the more positive values assigned to the female applicants, while for the India group the more positive values are assigned to the male applicants. These findings highlight the varying impact of gender across different countries, which is consistent with the results from the Women’s Workplace Equity Index36. This index measures gender inequities on equality under the law in seven categories. Under the “finding a job” category, which is most relevant to this study, the US and China have similar scores, 57.1 and 53.3, respectively, while India has a much lower score of 37.536. These gender differences across countries is worthy of additional study.