INDEX
Explanations
phrases indicating knowledge or awareness
New Auto-Interp
Negative Logits
alance
-0.07
ephy
-0.06
urch
-0.06
icher
-0.06
Localization
-0.06
ismus
-0.05
Everyday
-0.05
iceps
-0.05
urai
-0.05
zell
-0.05
POSITIVE LOGITS
Masc
0.07
ardo
0.07
lys
0.07
iture
0.07
.scalablytyped
0.07
اÙĦسÙħ
0.07
CADE
0.07
ligt
0.07
_java
0.07
FRING
0.06
Activations Density 0.017%