INDEX
Explanations
words related to dishonesty, specifically focusing on the concept of lying
words and phrases related to lying and dishonesty
New Auto-Interp
Negative Logits
ugal
-0.80
joining
-0.71
orsi
-0.66
allows
-0.64
iles
-0.63
hens
-0.63
Alto
-0.62
aldi
-0.61
runs
-0.61
okemon
-0.61
POSITIVE LOGITS
detector
1.12
uten
1.11
bling
0.89
utenant
0.88
ulent
0.84
detectors
0.82
pard
0.78
ge
0.74
telling
0.73
deceive
0.73
Activations Density 0.031%