INDEX
Explanations
mathematical expressions or formulas
New Auto-Interp
Negative Logits
[toxicity=0]
-0.81
-0.78
"
-0.78
}
-0.77
↵
-0.73
"
-0.72
-
-0.72
-0.72
_
-0.72
<eos>
-0.70
POSITIVE LOGITS
myſelf
1.45
ſelves
1.37
itſelf
1.34
Theſe
1.32
Anſ
1.30
himſelf
1.28
Monfieur
1.23
uſed
1.23
raiſ
1.16
ſind
1.16
Activations Density 0.567%