INDEX
Explanations
instances of falsehoods and deception
instances of dishonesty or lying
New Auto-Interp
Negative Logits
Switch
-0.82
pour
-0.77
Interstitial
-0.75
irez
-0.74
jee
-0.74
wcs
-0.73
eric
-0.73
omer
-0.72
largeDownload
-0.71
rian
-0.71
POSITIVE LOGITS
omission
1.22
falsely
0.92
innocence
0.91
conceal
0.88
incrim
0.86
accusation
0.84
accusations
0.83
dece
0.81
identities
0.80
witness
0.80
Activations Density 0.201%