INDEX
Explanations
verbs and their associated actions or states
New Auto-Interp
Negative Logits
orian
-0.17
418
-0.15
irie
-0.15
Aqu
-0.15
nan
-0.15
icari
-0.15
Aqu
-0.14
ctr
-0.14
zan
-0.14
å·Ŀ
-0.14
POSITIVE LOGITS
rada
0.21
rad
0.20
RAD
0.18
ãĤ¯ãĥŃ
0.18
rad
0.16
Rad
0.16
Robbins
0.16
radi
0.16
RAD
0.15
dro
0.15
Activations Density 0.027%