INDEX
Explanations
phrases indicating truthfulness or fairness
phrases that express honesty and truthfulness
New Auto-Interp
Negative Logits
chairs
-0.71
urated
-0.62
gotten
-0.61
into
-0.61
boro
-0.60
colored
-0.60
chair
-0.59
ende
-0.59
worth
-0.57
bled
-0.57
POSITIVE LOGITS
however
1.02
though
1.00
tho
0.77
adays
0.76
nown
0.73
there
0.72
meanwhile
0.70
pter
0.65
it
0.64
neither
0.64
Activations Density 0.143%