INDEX
Explanations
phrases that reference reading or written text
New Auto-Interp
Negative Logits
ÙĦب
-0.16
åĸ
-0.15
αÏģά
-0.14
struggles
-0.14
erap
-0.13
ct
-0.13
Ov
-0.13
abolic
-0.13
Bread
-0.13
åij
-0.13
POSITIVE LOGITS
커ìĬ¤
0.17
çŃij
0.16
Chr
0.16
Ral
0.16
Teach
0.16
needle
0.15
.ta
0.14
Ỽ
0.14
澤
0.14
agem
0.14
Activations Density 0.046%