INDEX
Explanations
phrases indicating strength or robustness
New Auto-Interp
Negative Logits
buz
-0.16
zem
-0.15
zend
-0.15
ushi
-0.15
ogg
-0.15
rganization
-0.15
STANCE
-0.15
ikan
-0.14
åĸ
-0.14
upert
-0.14
POSITIVE LOGITS
holds
0.17
_weak
0.16
Strong
0.16
/we
0.15
assen
0.15
,strong
0.15
ASSES
0.15
strong
0.15
hitters
0.15
-strong
0.14
Activations Density 0.018%