INDEX
Explanations
instances of the phrase "can't help" followed by verbs or pronouns indicating involuntary actions or emotions, such as "feeling", "smile", "notice", and "staring"
instances of the word "help" and its variations
New Auto-Interp
Negative Logits
Sov
-0.73
Safe
-0.70
lore
-0.70
Home
-0.67
oven
-0.66
âĢ¢âĢ¢
-0.64
LAN
-0.64
LOS
-0.62
ledged
-0.61
yz
-0.60
POSITIVE LOGITS
noticing
1.07
wondering
0.90
feeling
0.89
grinning
0.86
but
0.85
smiling
0.82
laughing
0.81
imagining
0.80
slipping
0.78
but
0.76
Activations Density 0.026%