INDEX
Explanations
references to specific proper nouns, particularly names associated with crime or notable individuals
New Auto-Interp
Negative Logits
SIGN
-0.72
iments
-0.71
undai
-0.69
poons
-0.69
emade
-0.65
disadvant
-0.65
ombs
-0.64
ottest
-0.64
igree
-0.64
aston
-0.63
POSITIVE LOGITS
brush
1.47
grou
1.03
cliffe
0.90
sonian
0.87
lings
0.87
sage
0.84
ful
0.80
Advice
0.79
vana
0.79
fulness
0.79
Activations Density 0.006%