INDEX
Explanations
phrases indicating a generalization or simplification of complex ideas
New Auto-Interp
Negative Logits
272
-0.18
orem
-0.18
ore
-0.17
orc
-0.16
446
-0.16
simples
-0.15
Ñģп
-0.14
ween
-0.14
();++
-0.14
Stub
-0.14
POSITIVE LOGITS
identical
0.23
-ÑĤаки
0.17
lesh
0.16
unchanged
0.16
å°±æĺ¯
0.16
же
0.15
imposs
0.15
speaking
0.15
ignored
0.15
raison
0.15
Activations Density 0.053%