INDEX
Explanations
references to deception or falsehoods
New Auto-Interp
Negative Logits
Woodstock
-0.58
Normally
-0.56
onato
-0.56
PMA
-0.56
Carnaval
-0.55
GPP
-0.55
Ammon
-0.53
Carthag
-0.53
Davidson
-0.53
ECR
-0.53
POSITIVE LOGITS
lies
1.91
Lies
1.78
Lies
1.68
lie
1.41
lies
1.41
mentiras
1.27
mentira
1.09
Lie
1.04
lie
1.01
lying
0.97
Activations Density 0.007%