INDEX
Explanations
numerals and specific words in various languages or scripts
New Auto-Interp
Negative Logits
ه
-0.81
م
-0.67
у
-0.66
ي
-0.65
ο
-0.63
själva
-0.62
\{\\-0.62
ر
-0.62
е
-0.58
י
-0.56
POSITIVE LOGITS
featureID
0.54
e
0.51
وفاته
0.48
يكب
0.46
B
0.43
NPs
0.42
IndentedString
0.41
characterised
0.41
J
0.40
moga
0.40
Activations Density 0.019%