INDEX
Explanations
references to specific individuals or proper nouns
New Auto-Interp
Negative Logits
i
-0.32
auf
-0.29
iou
-0.27
iens
-0.26
ÛĮ
-0.26
aes
-0.25
aed
-0.24
aan
-0.24
a
-0.24
aat
-0.24
POSITIVE LOGITS
dest
0.24
venture
0.23
vertisement
0.23
rian
0.23
ia
0.22
deo
0.21
ler
0.21
ewater
0.20
imir
0.20
ja
0.20
Activations Density 0.027%