INDEX
Explanations
expressions related to personal reflection, questioning, and opinions
New Auto-Interp
Negative Logits
oway
-0.74
ilings
-0.71
iling
-0.70
represented
-0.66
avid
-0.65
senal
-0.64
cers
-0.64
arrivals
-0.63
fter
-0.63
lication
-0.61
POSITIVE LOGITS
raining
1.34
happen
0.99
hurts
0.92
happened
0.90
happens
0.81
happening
0.80
easier
0.79
kinda
0.77
depends
0.74
beh
0.74
Activations Density 1.693%