INDEX
Explanations
references to publications, news sources, and dates
the presence of vertical bar characters in the text
New Auto-Interp
Negative Logits
rons
-0.83
ifts
-0.83
anski
-0.78
ippi
-0.78
orical
-0.77
anium
-0.77
ifting
-0.77
ory
-0.77
enance
-0.76
raints
-0.75
POSITIVE LOGITS
cffff
1.02
|--
0.93
··
0.83
âĢ¢âĢ¢
0.73
thel
0.72
+---
0.71
cffffcc
0.71
Posted
0.70
////////////////////////////////
0.68
————
0.68
Activations Density 0.015%