INDEX
Negative Logits
monitors
-0.26
SAFE
-0.26
harb
-0.25
soon
-0.25
Monitor
-0.25
Monitor
-0.24
himself
-0.24
ä¸ĩåħ¬éĩĮ
-0.23
Prem
-0.23
-options
-0.23
POSITIVE LOGITS
ering
0.31
åĹŁ
0.28
åı£
0.28
itudes
0.27
ifies
0.25
è¯įæĿ¡
0.25
æĮ¤
0.25
Sk
0.24
æ¡ij
0.24
eria
0.24
Activations Density 0.004%