INDEX
Explanations
phrases related to dishonesty or misleading information
New Auto-Interp
Negative Logits
ULTS
-0.75
kefeller
-0.72
ILY
-0.69
NetMessage
-0.69
Pigs
-0.63
Downloadha
-0.61
externalActionCode
-0.60
jack
-0.60
FE
-0.59
ilk
-0.59
POSITIVE LOGITS
ation
2.22
ations
2.07
ational
1.83
ative
1.59
ATIONS
1.47
ATION
1.38
atives
1.36
ated
1.32
ary
1.30
ators
1.26
Activations Density 0.058%