INDEX
Explanations
actions leading to outcomes
New Auto-Interp
Negative Logits
seseorang
0.97
我
0.94
icing
0.93
ผม
0.93
the
0.88
orting
0.88
溢
0.86
oration
0.84
あなたは
0.83
iding
0.83
POSITIVE LOGITS
נ
1.19
might
1.17
may
1.12
Με
1.07
μο
1.07
κα
1.07
א
1.06
בת
1.06
tends
1.05
must
1.03
Activations Density 0.005%