INDEX
Explanations
instances of hypocrisy and contradictions between stated beliefs and actions
New Auto-Interp
Negative Logits
resembl
-0.15
islav
-0.15
undy
-0.14
resemblance
-0.14
ús
-0.13
ÅĻiv
-0.13
ľ
-0.13
Viá»ĩc
-0.13
usk
-0.13
ÎŃα
-0.13
POSITIVE LOGITS
conflict
0.52
contrad
0.48
contrary
0.48
contradiction
0.48
conflicts
0.48
contradict
0.46
contr
0.44
CONTR
0.43
Contr
0.43
contradictory
0.42
Activations Density 0.380%