INDEX
Explanations
statements asserting the existence or truth of a subject or concept
New Auto-Interp
Negative Logits
fraught
-0.15
Äħż
-0.14
sui
-0.14
tte
-0.14
ague
-0.14
ãģĵãģ¡ãĤī
-0.13
lier
-0.13
ãģĿãģĵ
-0.13
LEGRO
-0.13
etwork
-0.13
POSITIVE LOGITS
why
0.39
why
0.30
true
0.29
WHY
0.25
true
0.24
为ä»Ģä¹Ī
0.23
pourquoi
0.23
Why
0.23
Why
0.23
True
0.21
Activations Density 0.130%