INDEX
Explanations
references to personal identity and self-description
New Auto-Interp
Negative Logits
anda
-0.17
oler
-0.17
ãĥķãĤ
-0.15
ogne
-0.15
aterno
-0.15
awe
-0.15
ollen
-0.15
ands
-0.14
inis
-0.14
imals
-0.14
POSITIVE LOGITS
797
0.16
part
0.16
.Err
0.14
unto
0.14
inn
0.14
cribe
0.13
PRETTY
0.13
tük
0.13
Winston
0.13
victims
0.13
Activations Density 0.105%