INDEX
Explanations
phrases that indicate relationships between actions and their consequences
New Auto-Interp
Negative Logits
anta
-0.21
ona
-0.17
ante
-0.16
uid
-0.16
ingham
-0.15
uj
-0.15
me
-0.15
ard
-0.14
par
-0.14
la
-0.13
POSITIVE LOGITS
è¿Ļä¸Ģ
0.22
è¿Ļ个
0.22
these
0.20
this
0.20
such
0.19
nÃły
0.19
è¿Ļç§į
0.19
ÑįÑĤого
0.19
ấy
0.19
è¿Ļæł·çļĦ
0.19
Activations Density 0.303%