INDEX
Explanations
references to the concept of "otherness" or external entities
New Auto-Interp
Negative Logits
allon
-0.16
ycz
-0.14
acker
-0.13
iko
-0.13
amar
-0.13
Certificates
-0.13
schemas
-0.13
reputation
-0.13
ses
-0.13
_bw
-0.13
POSITIVE LOGITS
words
0.46
words
0.39
Words
0.33
Words
0.32
.words
0.30
_words
0.29
other
0.26
WORDS
0.23
other
0.22
(words
0.22
Activations Density 0.008%