INDEX
Explanations
phrases indicating negation or the absence of something
New Auto-Interp
Negative Logits
arse
-0.15
öh
-0.14
hlas
-0.14
hn
-0.13
ali
-0.13
-utils
-0.13
brids
-0.13
hypotheses
-0.13
ETS
-0.12
uhan
-0.12
POSITIVE LOGITS
denying
0.26
question
0.23
way
0.23
disput
0.23
reason
0.23
need
0.21
room
0.21
deny
0.20
guarantee
0.20
telling
0.20
Activations Density 0.048%