INDEX
Explanations
confidence and certainty in statements
New Auto-Interp
Negative Logits
iless
-0.71
thood
-0.70
ories
-0.69
psey
-0.66
matically
-0.63
cially
-0.63
idas
-0.62
vati
-0.60
ilaterally
-0.60
inth
-0.59
POSITIVE LOGITS
surprises
0.70
Rampage
0.70
admire
0.67
plenty
0.66
delight
0.65
delighted
0.65
âĶĢ
0.65
adore
0.63
grinning
0.63
displeasure
0.62
Activations Density 3.198%