INDEX
Explanations
phrases that indicate intention or action
New Auto-Interp
Negative Logits
/Common
-0.15
Happy
-0.15
weg
-0.14
essor
-0.14
er
-0.14
j
-0.13
ج
-0.13
/loose
-0.13
Mult
-0.13
CO
-0.13
POSITIVE LOGITS
ieder
0.16
Uvs
0.16
IFn
0.15
afari
0.15
KNOWN
0.15
лиÑĤ
0.14
èm
0.14
interchange
0.14
aan
0.14
atatype
0.14
Activations Density 0.156%