INDEX
Explanations
phrases related to likelihood or potentiality
phrases expressing perceptions or opinions
New Auto-Interp
Negative Logits
ilts
-0.76
zb
-0.76
estern
-0.75
ffen
-0.71
ainers
-0.70
cised
-0.66
ogi
-0.65
opers
-0.65
ests
-0.64
ourse
-0.62
POSITIVE LOGITS
innocuous
1.03
awfully
1.03
oddly
0.98
destined
0.93
strangely
0.93
unlikely
0.92
tailor
0.91
unstoppable
0.91
to
0.90
harmless
0.90
Activations Density 0.061%