INDEX
Explanations
contractions where the apostrophe is missing or replaced by unusual characters
instances of negation or expressions of inability
New Auto-Interp
Negative Logits
commons
-0.70
heroin
-0.69
kicker
-0.68
polio
-0.67
pyramid
-0.66
Robin
-0.65
black
-0.64
Lob
-0.63
Wilmington
-0.63
Taliban
-0.62
POSITIVE LOGITS
ï¸ı
1.18
¯¯
1.04
Ì
1.01
âĻ
1.01
iversary
1.01
̶
1.00
âĪ
0.95
âĢ
0.94
âĶĢâĶĢâĶĢâĶĢ
0.94
âĶĢ
0.94
Activations Density 0.180%