INDEX
Negative Logits
代åĬŀ
-0.34
scre
-0.27
bothered
-0.27
sticker
-0.26
æĺķ
-0.26
atty
-0.26
hel
-0.25
å·¾
-0.25
oppose
-0.25
angu
-0.24
POSITIVE LOGITS
experiment
0.29
Experiment
0.29
Experiment
0.28
experiment
0.26
Narr
0.26
åĨįåİ»
0.26
æ¼Ķç»İ
0.25
èµĽåŃ£
0.25
å®ŀéªĮ
0.24
Narr
0.24
Activations Density 0.014%