INDEX
Explanations
statements expressing moral judgment or inconsistency over time
New Auto-Interp
Negative Logits
ÅĦ
-0.16
аÑĢÑĩ
-0.15
erte
-0.15
erer
-0.14
brero
-0.14
ivol
-0.14
ãĥ¼ãĥ³
-0.14
eria
-0.14
اÙĪÙĬ
-0.14
μη
-0.14
POSITIVE LOGITS
ä»Ĭ
0.19
continue
0.18
_now
0.17
current
0.17
today
0.17
continues
0.17
today
0.16
ä»Ĭ
0.16
.now
0.16
current
0.16
Activations Density 0.145%