INDEX
Explanations
mentions of smoking-related terms
references to smoking and its effects
New Auto-Interp
Negative Logits
assian
-0.81
tell
-0.70
Defenders
-0.69
translation
-0.69
Vector
-0.67
telling
-0.66
Philipp
-0.65
Nou
-0.65
ousse
-0.65
HCR
-0.64
POSITIVE LOGITS
cessation
1.31
smoking
1.19
smoked
1.08
smoker
1.08
cigarettes
1.07
smoke
1.03
cigars
1.02
habits
0.96
tobacco
0.95
smokers
0.93
Activations Density 0.015%