INDEX
Explanations
phrases with legal or adversarial connotations
New Auto-Interp
Negative Logits
prosec
-0.81
princ
-0.77
citiz
-0.76
commissions
-0.75
¥ŀ
-0.74
skelet
-0.74
censored
-0.73
obser
-0.73
lifes
-0.72
newcom
-0.70
POSITIVE LOGITS
"[
2.00
"(
1.91
"
1.90
He
1.69
"'
1.66
"...
1.57
Asked
1.49
Instead
1.48
She
1.46
His
1.44
Activations Density 0.344%