INDEX
Explanations
phrases that indicate analytical or evaluative actions related to theories and processes
New Auto-Interp
Negative Logits
persever
-0.14
ëĬ¥
-0.13
stub
-0.13
dub
-0.13
Oro
-0.13
rias
-0.13
Hier
-0.12
ãģ°ãģĭãĤĬ
-0.12
ÏĥÏĦά
-0.12
usch
-0.12
POSITIVE LOGITS
ingham
0.18
okus
0.17
ãĥ³ãĥĸ
0.16
ocus
0.16
ploy
0.14
acas
0.14
éĨ
0.14
asca
0.14
ande
0.14
ourt
0.13
Activations Density 0.126%