INDEX
Explanations
happy greetings and phrases
The neuron primarily detects the token "happy" (and close variants/uses of "Happy") — i.e., expressions of happiness/positive greetings.
New Auto-Interp
Negative Logits
K
0.85
F
0.76
W
0.76
G
0.74
ক
0.74
E
0.72
R
0.70
M
0.69
ف
0.69
H
0.68
POSITIVE LOGITS
happy
0.83
feliz
0.79
Happy
0.75
mutlu
0.66
felices
0.61
felicidad
0.60
happy
0.60
happier
0.60
bahagia
0.59
HAPPY
0.57
Activations Density 0.021%