INDEX
Explanations
politically charged words and phrases; specifically, it seems to highlight strong or forceful statements
repeated characters or stylized characters in text
New Auto-Interp
Negative Logits
disadvant
-0.86
Gutenberg
-0.83
mathemat
-0.81
misunder
-0.70
inav
-0.69
geries
-0.69
merce
-0.68
carbohyd
-0.67
whiff
-0.64
raviolet
-0.63
POSITIVE LOGITS
ï¸ı
1.17
uth
0.95
女
0.90
¯¯
0.88
ï¸
0.87
§
0.87
ution
0.81
ãģ®éŃĶ
0.81
ãĤĭ
0.80
··
0.78
Activations Density 0.387%