INDEX
Explanations
explanations of reasons why
New Auto-Interp
Negative Logits
uomo
0.99
kommen
0.95
penatibus
0.95
-------------
0.93
于
0.92
Behold
0.91
Besitz
0.91
u
0.91
Bad
0.89
人都
0.89
POSITIVE LOGITS
reasons
1.23
reason
1.22
why
1.20
Reasons
1.17
warum
1.11
detrás
1.03
razón
1.02
السبب
0.98
bypass
0.97
यंस
0.95
Activations Density 0.272%