INDEX
Explanations
expressions of gratitude towards others
New Auto-Interp
Negative Logits
dominates
-0.70
ths
-0.70
Course
-0.68
thing
-0.63
Worse
-0.63
Temperature
-0.60
ighed
-0.59
hibition
-0.59
POL
-0.59
isation
-0.58
POSITIVE LOGITS
omever
0.86
RIP
0.81
contributors
0.74
involved
0.73
Ü
0.73
rats
0.73
involved
0.71
congr
0.70
listeners
0.70
readers
0.69
Activations Density 0.163%