INDEX
Explanations
phrases indicating negation or expressing skepticism about commonly held beliefs
New Auto-Interp
Negative Logits
undry
-0.16
endor
-0.15
umer
-0.15
lew
-0.15
visor
-0.14
sinon
-0.14
-addons
-0.14
ány
-0.14
uat
-0.14
agua
-0.14
POSITIVE LOGITS
true
0.21
true
0.21
fine
0.17
True
0.16
True
0.15
interesting
0.15
instruct
0.15
overs
0.15
perfectly
0.15
partially
0.15
Activations Density 0.279%