INDEX
Explanations
time-related phrases or durations
phrases that indicate reasons or justifications
New Auto-Interp
Negative Logits
hammad
-0.63
olen
-0.61
oster
-0.61
antine
-0.59
angelo
-0.57
nings
-0.55
MSN
-0.55
fired
-0.55
jen
-0.54
wd
-0.53
POSITIVE LOGITS
reasons
1.39
reason
1.27
etheless
0.92
Reasons
0.90
awhile
0.85
certain
0.83
WAY
0.82
reason
0.77
ways
0.77
milliseconds
0.77
Activations Density 0.402%