INDEX
Negative Logits
liam
-0.79
atari
-0.77
Democr
-0.69
ugi
-0.65
seldom
-0.64
unden
-0.62
Millenn
-0.61
ende
-0.61
zhen
-0.60
perpetually
-0.60
POSITIVE LOGITS
nor
1.18
wrongdoing
1.07
harmed
1.03
anything
1.02
anybody
0.93
whatsoever
0.91
threatening
0.89
anything
0.89
any
0.88
anyone
0.87
Activations Density 0.360%