INDEX
Explanations
expressions of happiness or satisfaction
expressions of gratitude or happiness
New Auto-Interp
Negative Logits
helicop
-0.76
effic
-0.73
artifacts
-0.72
improve
-0.69
çīĪ
-0.67
contam
-0.67
impair
-0.66
Improve
-0.65
cend
-0.65
irrel
-0.65
POSITIVE LOGITS
glad
0.86
Tid
0.79
ness
0.77
dy
0.76
terday
0.72
ा
0.71
joy
0.71
Sonia
0.70
tid
0.69
imar
0.68
Activations Density 0.018%