INDEX
Explanations
train large language models
New Auto-Interp
Negative Logits
483
-0.09
some
-0.09
.tf
-0.08
ucas
-0.08
Recon
-0.08
Levels
-0.08
inati
-0.08
Intern
-0.08
inqu
-0.08
ulia
-0.08
POSITIVE LOGITS
situations
0.20
such
0.19
stuff
0.18
è¿Ļæł·çļĦ
0.18
guys
0.17
moments
0.16
such
0.16
ÑĤакиÑħ
0.16
åĥı
0.16
cases
0.15
Activations Density 0.101%