INDEX
Explanations
words that indicate strong affirmations or agreement
New Auto-Interp
Negative Logits
ierarchy
-0.17
NV
-0.17
-vars
-0.16
osto
-0.15
contre
-0.15
ÑĢаÑĩ
-0.14
edor
-0.14
Conc
-0.14
.BUTTON
-0.14
ront
-0.14
POSITIVE LOGITS
indh
0.16
ãĥ«ãĥĪ
0.15
/MIT
0.14
lius
0.14
ruž
0.14
.routing
0.14
unan
0.14
Chance
0.14
(CG
0.14
azzi
0.14
Activations Density 0.000%