INDEX
Explanations
references to specific individuals or names in the text
New Auto-Interp
Negative Logits
rix
-0.17
ustin
-0.16
ictions
-0.15
geç
-0.15
igon
-0.15
hierarchy
-0.15
625
-0.15
uns
-0.15
Äħd
-0.15
plain
-0.14
POSITIVE LOGITS
ashtra
0.23
agh
0.22
ajs
0.21
itur
0.21
ishi
0.20
ames
0.20
angan
0.19
aja
0.19
AMES
0.18
atan
0.17
Activations Density 0.036%