INDEX
Explanations
expressions of gratitude or thankfulness
New Auto-Interp
Negative Logits
oto
-0.73
smoking
-0.72
agate
-0.69
calling
-0.68
opers
-0.66
uter
-0.66
change
-0.66
improve
-0.65
inic
-0.65
dump
-0.64
POSITIVE LOGITS
acknowled
1.03
giving
0.97
citiz
0.92
pardon
0.91
gements
0.83
acknowledgment
0.78
gratitude
0.76
FUL
0.74
NESS
0.74
ledged
0.74
Activations Density 9.741%