INDEX
Explanations
phrases related to negative consequences or issues
occurrences of a specific character or formatting that seems to represent special symbols
New Auto-Interp
Negative Logits
enegger
-0.87
ãģ®éŃĶ
-0.85
gow
-0.76
enburg
-0.74
ements
-0.70
compuls
-0.68
worthiness
-0.66
whichever
-0.63
PTS
-0.63
iator
-0.62
POSITIVE LOGITS
¹
1.77
³
1.71
¿
1.69
¦
1.64
¬
1.64
µ
1.54
¾
1.54
¥
1.54
¸
1.50
¡
1.46
Activations Density 0.015%