INDEX
Explanations
segments related to programming syntax, particularly variable names and function definitions
New Auto-Interp
Negative Logits
-
-0.81
[toxicity=0]
-0.60
p
-0.60
/
-0.60
“
-0.57
a
-0.56
dymyr
-0.55
or
-0.55
b
-0.55
(
-0.55
POSITIVE LOGITS
itſelf
1.64
myſelf
1.60
himſelf
1.55
ſelves
1.47
Anſ
1.47
themſelves
1.45
Reſ
1.39
Conſ
1.37
ſelf
1.36
Majefty
1.34
Activations Density 0.477%