Yesterday The Wall Street Journal broke news of an informal study run by a Facebook engineer which concluded that code submitted by women engineers at the company was rejected 35% more than code submitted by male engineers. Facebook spokespeople responded that this original study and the data it relied on was “incomplete,” and that their own follow up study — which attempted to account for differences in how men and women spanned job levels — failed to find evidence of bias.

Uncovering gender bias in research

The effort required to conclusively establish or refute the presence of gender bias inside a company is immense. Arguably the cleanest way to isolate gender bias has been in lab experiments run by academics looking at differences in the evaluation of work product (particularly in male-dominated fields). Essentially, research participants are asked to evaluate individuals (from material on paper, online, or in a film) that are identical except for their gender. Differences found in evaluations of these fake candidates, which have been repeatedly documented by scientists, can therefore be attributed to gender bias (see here and here).

Outside the lab in the messy real world, it’s close to impossible to find identical work produced by men and women in order to expose the presence of gender bias. Some researchers in sociology and economics have overcome this problem through audit studies, which send fake resumes to real employers and measure responses (examples here, here, here, and here). Others have explored situations where real individuals are evaluated in the absence of information about their gender. Perhaps the most famous example is the placement of screens obscuring the gender of orchestra musicians. Researchers found that women were evaluated more favorably with the screen than without it, an outcome driven by gender bias.

Isolating bias within companies

On a shorter timeline and with real company data, it’s rare to have the opportunity to isolate the effects of gender bias as precisely as this academic research has. Anyone who has managed people knows that work product is complex and difficult to compare across individuals. Given the sheer amount of effort it would take to identify code that is, if not identical, then at least comparable, neither Facebook study seemed to have the opportunity to uncover what happens to a woman who submits almost identical code to that of her male co-worker. So we observers are left with a question that is unanswerable based on current information: were the women engineers at Facebook subject to gender bias in the evaluation of their code, or were they producing lower quality work?

Given the research that documents bias in similar settings (and stories we hear from real women regularly), it’s hard to hear about the results from this internal study and not conclude they result from gender bias. Still, let’s just assume, for the sake of argument, that women engineers in this experiment on average produced lower quality code, and that every piece of code that got rejected or accepted truly deserved it. (Suspend your disbelief for a moment and bear with me.) If that were true, what would it mean?

A host of alternative explanations

One easy way to explain gender differences in code quality — the explanation that Facebook itself is using as it continues to comment on this story — is that women at Facebook are disproportionately more junior (with respect to job level, not tenure). Facebook argued that the addition of job level as a control variable in their analysis wiped away the effect of gender on code rejections and that this can be explained by the fact that women engineers are more likely to be junior. As a crucial aside, there are reasons to be concerned with any conclusions that come from the addition of a control that may be strongly correlated to the main variable of interest (in this case, potentially job level with gender; see here, here, and here).

Given that women engineers are in fact more junior at Facebook, then the question becomes why is that so? Facebook may not be willing to hire or invest in senior women, or cannot retain women as they rise. Perhaps they fail to advance women, leaving them without the opportunities to improve their skillset that are available to men.

Or perhaps job level is not as good an indicator of code quality as Facebook says (see the concern with this type of control above). Maybe instead what’s driving differences in code quality is a lack of access to the feedback and mentorship necessary to grow, or more limited access to the type of insider information that could help women improve their output. Or perhaps women are pushed toward different types of projects than men that are, for whatever reason, more subject to code rejections.

All of this is possible even in the absence of gender bias in the evaluation of code submissions. And all of these possibilities are cause for alarm.

Asking different questions

Establishing definitive proof of bias in real world situations like this is difficult. The loss of statistical significance doesn’t prove without a doubt that gender bias wasn’t present, any more than the existence of it proves sufficiently that it was; statistical significance can’t account for the uncertainty that comes from improper measurement or unsophisticated study design. If companies want to find bias, they will, and if they don’t, they won’t (see here). Obsessing over this question alone — is bias driving this particular outcome or isn’t it? — can obscure some of the broader questions leaders should be asking: Are our hiring processes fair? What about promotion? Distribution of assignments? How do we know? What data are we relying on to make these decisions, and are these the right data? How do our employees feel? Do they believe they can succeed here? Why or why not? Why do people leave? Why do they stay? How do we help them grow? If employees have ideas for how we can improve, who do they share them with? Do they feel they can share without negative repercussions? Are we asking the hard questions or the easy ones? Do we like all of our answers? (Just to list a few).

Asking and acting on these questions, rather than losing statistical significance when adding a control variable, is the better test of a company’s efforts to eradicate gender bias.