INDEX
Explanations
expressions of enjoyment or positive experiences
New Auto-Interp
Negative Logits
ine
-0.67
Al
-0.61
Ber
-0.60
Hu
-0.58
Kle
-0.58
Ra
-0.58
netinet
-0.58
T
-0.57
B
-0.57
pmatrix
-0.57
POSITIVE LOGITS
enjoy
1.88
Enjoy
1.85
enjoy
1.80
ENJOY
1.76
enjoyed
1.75
enjoyment
1.69
Enjoying
1.68
Enjoy
1.66
enjoys
1.63
enjoying
1.60
Activations Density 0.037%