INDEX
Explanations
narratives or phrases that reveal surprising outcomes or conclusions
New Auto-Interp
Negative Logits
Ợ
-0.61
illoin
-0.60
claimer
-0.59
ślę
-0.56
ztály
-0.55
οπο
-0.55
harapkan
-0.55
amemnon
-0.52
ftagPool
-0.51
ilosop
-0.51
POSITIVE LOGITS
ternyata
0.86
actually
0.85
原来
0.82
原來
0.81
bleek
0.76
مشين
0.74
Ternyata
0.73
blijkt
0.72
actually
0.71
Actually
0.71
Activations Density 0.380%