INDEX
Explanations
income, gender, and large numbers
New Auto-Interp
Negative Logits
₂+
0.31
clesiastical
0.30
٢
0.29
Timothy
0.28
KON
0.28
𝟮
0.27
-
0.27
⮚
0.26
squarePos
0.26
:
0.26
POSITIVE LOGITS
0.54
0.54
0.53
0.46
0.46
0.46
0.46
0.45
0.44
0.44
Activations Density 0.030%