INDEX
Explanations
proper nouns and specific scientific terms
New Auto-Interp
Negative Logits
ない
-0.88
nya
-0.84
său
-0.76
ness
-0.59
liśmy
-0.59
nostru
-0.59
นั้น
-0.59
ling
-0.58
ned
-0.57
น
-0.57
POSITIVE LOGITS
aaaa
0.70
e
0.68
aaa
0.66
aaaaaaaa
0.66
اااا
0.62
aaaaaa
0.61
aaaaa
0.58
a
0.56
aa
0.56
eins
0.55
Activations Density 1.227%