INDEX
Explanations
themes related to contradictions in beliefs and values
New Auto-Interp
Negative Logits
ÅĻiv
-0.17
erez
-0.15
islav
-0.14
Complaint
-0.14
oub
-0.13
.Close
-0.13
ús
-0.12
AREST
-0.12
reta
-0.12
NotFoundError
-0.12
POSITIVE LOGITS
contrast
0.48
contrary
0.48
opposite
0.47
Contr
0.45
diam
0.45
CONTR
0.45
contr
0.43
counter
0.42
Contr
0.41
opposed
0.41
Activations Density 0.204%