INDEX
Explanations
the letter 'w' in various contexts, indicating a focus on its frequency in the text
New Auto-Interp
Negative Logits
={({-0.79
.";
-0.75
"},
-0.67
)");
-0.67
XXXXX
-0.67
).)
-0.67
).]
-0.66
ᾶ
-0.66
)');
-0.65
:~
-0.64
POSITIVE LOGITS
w
2.49
w
2.27
𝐰
1.06
𝙬
0.97
wl
0.94
wh
0.93
ww
0.92
wh
0.90
𝑤
0.89
𝒘
0.88
Activations Density 0.074%