INDEX
Explanations
words related to relationships between different variables or concepts
phrases that indicate a connection or correlation between concepts
New Auto-Interp
Negative Logits
OGR
-0.79
tsky
-0.78
nell
-0.74
stal
-0.72
entimes
-0.72
kl
-0.72
fl
-0.71
di
-0.70
cca
-0.70
gn
-0.69
POSITIVE LOGITS
sexes
0.77
disparate
0.70
observable
0.66
scarce
0.63
genders
0.62
criminality
0.62
two
0.61
ãĤ¬
0.61
between
0.61
urst
0.59
Activations Density 0.030%