INDEX
Explanations
mathematical expressions and symbols commonly used in formal proofs
New Auto-Interp
Negative Logits
s
-0.68
-
-0.65
}
-0.62
)
-0.62
_
-0.61
[toxicity=0]
-0.58
之
-0.58
-0.56
"
-0.55
Kell
-0.54
POSITIVE LOGITS
ſelves
1.02
purpoſe
0.94
raiſ
0.94
—,
0.94
uſ
0.94
ſche
0.93
iſt
0.92
ſelf
0.92
myſelf
0.91
ſtand
0.89
Activations Density 0.611%