INDEX
Explanations
adjectives or phrases emphasizing fairness or honesty
phrases emphasizing clarity, honesty, and fairness
New Auto-Interp
Negative Logits
merit
-0.70
berra
-0.66
Bur
-0.65
ajo
-0.64
Ger
-0.63
Han
-0.61
aleb
-0.61
Bel
-0.60
Comb
-0.60
overflow
-0.59
POSITIVE LOGITS
externalActionCode
0.80
WATCHED
0.73
Yourself
0.71
quished
0.71
ioned
0.68
uzz
0.66
Pryor
0.65
.--
0.64
onomic
0.63
someday
0.63
Activations Density 0.197%