INDEX
Explanations
sentences ending with a specific structure of punctuation
toxic or harmful statements and concepts
New Auto-Interp
Negative Logits
Manhattan
-0.74
Glou
-0.71
Syd
-0.67
Somerset
-0.67
scene
-0.64
Whit
-0.64
Roc
-0.59
reception
-0.59
Shattered
-0.59
RAD
-0.58
POSITIVE LOGITS
¬
1.00
âĢł
0.90
agree
0.85
Ĵ
0.85
âĹ¼
0.83
¯
0.82
ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ
0.82
0.80
§
0.78
ú
0.78
Activations Density 0.230%