INDEX
Explanations
references to individuals in the text
New Auto-Interp
Negative Logits
precis
-0.16
ts
-0.15
gaard
-0.15
rana
-0.15
izador
-0.15
ted
-0.15
lef
-0.14
ogn
-0.14
usercontent
-0.14
stadt
-0.14
POSITIVE LOGITS
nels
0.33
ification
0.27
hood
0.27
nel
0.27
ified
0.26
nage
0.26
/people
0.25
age
0.25
ae
0.24
nal
0.24
Activations Density 0.031%