INDEX
Explanations
mentions of people or organizations, potentially in a negative context
the symbol or character representation of certain expressions or emphasis
New Auto-Interp
Negative Logits
imitation
-0.89
Seym
-0.72
indo
-0.69
anium
-0.69
mathemat
-0.68
accompan
-0.66
constitu
-0.66
fortun
-0.66
arios
-0.65
disadvant
-0.64
POSITIVE LOGITS
ï¸ı
1.23
ï¸
0.94
VER
0.85
女
0.85
Balt
0.82
STEM
0.80
sure
0.76
legal
0.75
own
0.74
£
0.73
Activations Density 0.509%