INDEX
Explanations
introspective statements and self-reflection
New Auto-Interp
Negative Logits
dding
-0.66
mma
-0.62
ichever
-0.61
cknow
-0.61
noticed
-0.59
herent
-0.58
Friendly
-0.57
Columb
-0.56
tted
-0.55
EW
-0.55
POSITIVE LOGITS
entails
1.06
alian
0.92
boils
0.91
hurts
0.86
happened
0.85
transpired
0.85
happens
0.84
feels
0.83
takes
0.80
rains
0.80
Activations Density 0.081%