INDEX
Explanations
names of people or places
letters or characters that appear frequently in the text
New Auto-Interp
Negative Logits
censored
-0.77
coli
-0.70
destro
-0.70
hairs
-0.70
envy
-0.70
kernels
-0.68
mosqu
-0.68
proxies
-0.67
mosaic
-0.67
prol
-0.67
POSITIVE LOGITS
idd
1.04
afer
0.99
inn
0.99
ady
0.99
urd
0.98
itz
0.98
oor
0.97
acker
0.97
alla
0.96
ü
0.95
Activations Density 0.160%