INDEX
Explanations
terms related to primitive concepts or states
New Auto-Interp
Negative Logits
ojÃŃ
-0.16
rg
-0.15
agged
-0.15
ampa
-0.15
oth
-0.14
scaling
-0.14
íĥĿ
-0.14
DOM
-0.13
iÄĩ
-0.13
esco
-0.13
POSITIVE LOGITS
SPATH
0.15
PARATOR
0.15
Rocket
0.14
swick
0.14
imon
0.14
/native
0.14
769
0.14
onds
0.14
USTER
0.13
SSF
0.13
Activations Density 0.009%