INDEX
Explanations
phrases related to lying or deception
New Auto-Interp
Negative Logits
ugal
-0.82
sylv
-0.74
ilation
-0.69
specificity
-0.66
iles
-0.65
runs
-0.65
night
-0.63
arthy
-0.62
oso
-0.61
Signature
-0.59
POSITIVE LOGITS
dormant
0.88
awake
0.87
uten
0.81
yss
0.78
pard
0.77
bling
0.77
utenant
0.74
detector
0.72
lie
0.71
asleep
0.70
Activations Density 0.017%