INDEX
Explanations
connections and inconsistencies between actions and beliefs, particularly in the context of claims being made
New Auto-Interp
Negative Logits
ÅĤu
-0.15
ennis
-0.14
lieu
-0.13
celed
-0.13
iji
-0.13
utra
-0.13
therm
-0.12
Pyramid
-0.12
ida
-0.12
á»Ń
-0.12
POSITIVE LOGITS
match
0.50
matches
0.49
align
0.47
match
0.40
-match
0.38
align
0.38
Align
0.38
matched
0.37
MATCH
0.37
matches
0.37
Activations Density 0.526%