INDEX
Explanations
phrases related to causality or consequences
instances of emotional or evaluative language
New Auto-Interp
Negative Logits
çīĪ
-0.75
STATS
-0.75
è£ħ
-0.70
gad
-0.69
omorphic
-0.68
racuse
-0.66
quished
-0.64
ãĥīãĥ©
-0.63
cyan
-0.63
messenger
-0.62
POSITIVE LOGITS
º
0.87
¡
0.85
Ĵ
0.80
ł
0.79
ĵ
0.79
£
0.78
¬
0.73
¼
0.72
¢
0.70
Ķ
0.69
Activations Density 0.372%