INDEX
Explanations
names of political or public figures
references to specific individuals or names
New Auto-Interp
Negative Logits
Drop
-0.67
BW
-0.65
Slaughter
-0.64
Ther
-0.64
Rath
-0.63
stalls
-0.62
Fancy
-0.61
Suicide
-0.61
Sov
-0.61
Stall
-0.61
POSITIVE LOGITS
ennis
3.20
ribune
1.70
anish
1.69
kick
1.58
erek
1.52
aniel
1.33
iscovery
1.27
ENN
1.26
ampa
1.24
ixon
1.21
Activations Density 0.040%