INDEX
Explanations
phrases expressing opinions or viewpoints
New Auto-Interp
Negative Logits
answ
-0.71
mentioned
-0.68
ramids
-0.67
then
-0.64
aunted
-0.63
cies
-0.63
ersen
-0.62
rote
-0.62
ITH
-0.62
leaf
-0.61
POSITIVE LOGITS
synonymous
0.95
belonging
0.92
unbeat
0.85
credible
0.83
unfit
0.83
indispensable
0.82
embod
0.81
illegitimate
0.80
unethical
0.79
unworthy
0.78
Activations Density 1.001%