INDEX
Explanations
names of celebrities and characters from popular culture
instances of actor names and notable films or shows
New Auto-Interp
Negative Logits
vain
-0.78
cessation
-0.74
gard
-0.70
obstruction
-0.69
carriage
-0.68
appl
-0.67
yss
-0.66
restored
-0.65
manual
-0.63
attentive
-0.63
POSITIVE LOGITS
advertising
1.45
Probably
1.07
Often
0.97
Based
0.96
Everyone
0.95
Sometimes
0.95
Most
0.94
Honestly
0.94
Few
0.94
ccording
0.93
Activations Density 0.183%