INDEX
Explanations
terms related to various types of "yes" or affirmative expressions
New Auto-Interp
Negative Logits
efe
-0.20
aland
-0.16
affen
-0.15
brick
-0.15
olly
-0.15
mmo
-0.15
pering
-0.15
uten
-0.14
otland
-0.14
PERT
-0.14
POSITIVE LOGITS
isseur
0.21
ymous
0.20
oit
0.18
ises
0.18
xious
0.17
elle
0.17
ise
0.16
Longer
0.15
iu
0.15
longer
0.15
Activations Density 0.027%