INDEX
Explanations
specific mentions of the word "gorillas"
references to gorillas
New Auto-Interp
Negative Logits
Hath
-0.76
çĦ
-0.72
pree
-0.68
NC
-0.65
ALE
-0.64
Adv
-0.64
Disability
-0.63
ŀ
-0.63
nder
-0.62
tie
-0.62
POSITIVE LOGITS
illas
1.48
terday
1.09
unta
0.90
cules
0.88
xon
0.83
uca
0.83
ques
0.83
ervatives
0.82
ervative
0.79
emonium
0.78
Activations Density 0.006%