INDEX
Explanations
phrases related to dishonesty or deceit
references to deception or dishonesty
New Auto-Interp
Negative Logits
ugal
-0.81
runs
-0.70
sylv
-0.66
hens
-0.65
night
-0.62
Flavoring
-0.61
orsi
-0.61
uries
-0.59
icals
-0.59
}}}
-0.59
POSITIVE LOGITS
awake
1.04
uten
0.90
detector
0.90
dormant
0.90
utenant
0.78
pard
0.76
silently
0.76
quietly
0.75
asleep
0.72
yss
0.72
Activations Density 0.026%