INDEX
Explanations
as to why/which/whether/what
New Auto-Interp
Negative Logits
There
1.16
ت
1.13
us
1.07
ل
1.04
т
0.96
νες
0.94
()
0.93
These
0.92
ic
0.91
ر
0.91
POSITIVE LOGITS
'
1.10
ի
0.97
も
0.96
ノ
0.91
of
0.90
インド
0.88
’
0.87
くる
0.84
지
0.82
luğ
0.81
Activations Density 0.006%