INDEX
Explanations
phrases related to deception or false statements
instances of the word "lying," which indicates dishonesty or deceit
New Auto-Interp
Negative Logits
Ultra
-0.86
obs
-0.79
ISO
-0.78
ilation
-0.74
joining
-0.70
FN
-0.70
iles
-0.69
aldi
-0.69
Specific
-0.67
ORE
-0.65
POSITIVE LOGITS
horizont
0.78
lying
0.78
lie
0.72
liar
0.71
utenant
0.71
pills
0.70
siege
0.70
skelet
0.70
acies
0.70
lied
0.70
Activations Density 0.008%