INDEX
Explanations
statements related to politics or controversial figures
phrases related to decision-making and consequences
New Auto-Interp
Negative Logits
âĹ
-0.83
âĩ
-0.80
ãĤ´ãĥ³
-0.75
«
-0.75
¶
-0.75
âĦ¢:
-0.74
âĹ
-0.73
âĸ
-0.69
ortium
-0.69
ãĥ¯ãĥ³
-0.68
POSITIVE LOGITS
.")
1.70
,'"
1.63
..."
1.61
!'"
1.59
?'"
1.59
',"
1.55
â̦"
1.54
.'"
1.52
..."
1.45
)."
1.39
Activations Density 1.014%