INDEX
Explanations
words related to expressions of appreciation and community engagement
New Auto-Interp
Negative Logits
lication
-0.21
iation
-0.20
WARD
-0.19
ackets
-0.19
ughters
-0.18
lesc
-0.18
ortion
-0.18
ward
-0.17
icks
-0.17
ings
-0.16
POSITIVE LOGITS
e
0.19
point
0.17
eer
0.16
ors
0.16
ports
0.16
otine
0.15
ALE
0.15
emma
0.15
tar
0.15
ply
0.14
Activations Density 0.052%