INDEX
Explanations
reconciling opposing concepts
New Auto-Interp
Negative Logits
س
1.13
та
1.02
ti
0.98
ta
0.96
ts
0.93
ty
0.86
ter
0.85
s
0.83
ten
0.82
sh
0.82
POSITIVE LOGITS
{0.85
(
0.83
ان
0.82
고
0.80
도
0.79
บุ
0.72
玹
0.66
ั
0.64
もら
0.63
an
0.62
Activations Density 0.003%