INDEX
Explanations
references to various groups or categories of people and their characteristics
New Auto-Interp
Negative Logits
individuals
-0.70
människor
-0.66
mensen
-0.66
people
-0.65
individuals
-0.63
eseorang
-0.61
itself
-0.61
Itself
-0.59
itself
-0.59
somebody
-0.59
POSITIVE LOGITS
whom
0.98
whom
0.77
whose
0.63
who
0.58
whose
0.57
opposing
0.54
Whom
0.54
who
0.52
professions
0.52
Whom
0.51
Activations Density 0.690%