INDEX
Explanations
proper nouns related to a specific entity or individual
mentions of a specific person's name
New Auto-Interp
Negative Logits
dec
-0.69
clean
-0.68
conditioning
-0.65
England
-0.65
pipes
-0.64
Ok
-0.63
pipe
-0.63
smoke
-0.63
cigarettes
-0.62
ESP
-0.62
POSITIVE LOGITS
alan
4.81
abal
1.20
alos
1.18
atan
1.18
ala
1.11
assian
1.10
aris
1.09
alin
1.08
asma
1.07
anan
1.07
Activations Density 0.017%