INDEX
Explanations
phrases indicating capability or potential actions
New Auto-Interp
Negative Logits
aldo
-0.16
AVA
-0.14
elho
-0.14
adele
-0.14
/comment
-0.14
Tele
-0.14
ÑĢев
-0.13
noÅĽci
-0.13
arching
-0.13
batim
-0.13
POSITIVE LOGITS
-Ass
0.16
ima
0.15
Mattis
0.14
panse
0.14
aise
0.14
ojis
0.14
saja
0.14
URNS
0.14
azen
0.14
nil
0.13
Activations Density 0.152%