INDEX
Explanations
negative assertions or contradictions
negations and negative expressions
New Auto-Interp
Negative Logits
decency
-0.62
itiz
-0.59
selves
-0.58
camer
-0.57
civilisation
-0.56
ÙĴ
-0.56
ewitness
-0.56
velt
-0.55
etimes
-0.54
lycer
-0.54
POSITIVE LOGITS
shy
1.28
exactly
0.97
necessarily
0.96
hesitated
0.93
amused
0.88
alone
0.87
icably
0.86
yet
0.85
orious
0.85
thrilled
0.82
Activations Density 0.210%