INDEX
Explanations
discrepancies between appearance or assertion and reality
phrases that highlight the contrast between perception and reality
New Auto-Interp
Negative Logits
uyomi
-0.70
inel
-0.65
erity
-0.63
luster
-0.63
rosso
-0.62
EDITION
-0.62
aunted
-0.62
ador
-0.61
tackle
-0.60
iculty
-0.60
POSITIVE LOGITS
indeed
0.90
untrue
0.81
actually
0.79
meant
0.77
intended
0.76
true
0.72
actually
0.71
REDACTED
0.70
existed
0.68
factual
0.68
Activations Density 0.973%