INDEX
Explanations
adjectives and quantifiers that express degree or quantity
New Auto-Interp
Negative Logits
raig
-0.17
etwork
-0.16
DO
-0.15
ahrain
-0.15
noop
-0.14
vil
-0.14
Rever
-0.14
nette
-0.14
juan
-0.13
lobs
-0.13
POSITIVE LOGITS
Maur
0.17
urdy
0.16
(Attribute
0.14
uros
0.14
JT
0.14
kla
0.14
dden
0.13
živ
0.13
addin
0.13
bury
0.13
Activations Density 1.118%