INDEX
Explanations
words indicating limitation or exclusivity
New Auto-Interp
Negative Logits
contri
-0.18
ownership
-0.15
zan
-0.14
anner
-0.14
anners
-0.13
oggles
-0.13
ancel
-0.13
ÙĪØ³ÛĮ
-0.13
already
-0.13
igger
-0.13
POSITIVE LOGITS
HIR
0.17
缼
0.17
brains
0.16
only
0.15
hey
0.15
Broken
0.15
égorie
0.15
interested
0.14
really
0.14
toler
0.14
Activations Density 0.062%