INDEX
Explanations
phrases describing hidden agendas or deceptive actions
instances of deceptive language or misrepresentation
New Auto-Interp
Negative Logits
Sob
-0.66
chenko
-0.62
gra
-0.61
bent
-0.60
loads
-0.60
Survey
-0.59
ozy
-0.59
cedes
-0.58
below
-0.58
polled
-0.58
POSITIVE LOGITS
innocuous
0.92
invincible
0.79
innocence
0.78
OPA
0.76
UL
0.74
benign
0.72
harmless
0.66
rud
0.63
neutrality
0.62
regn
0.61
Activations Density 0.741%