INDEX
Explanations
phrases related to expressing opinions or beliefs
concepts related to self-reflection and decision-making
New Auto-Interp
Negative Logits
Adds
-0.65
Downing
-0.64
Loading
-0.61
Lub
-0.60
Corpor
-0.59
Drunk
-0.58
Logan
-0.57
Franco
-0.57
major
-0.57
Berry
-0.57
POSITIVE LOGITS
iety
0.80
hereafter
0.78
aspire
0.77
abouts
0.74
yssey
0.74
catentry
0.72
posium
0.70
thereafter
0.70
thereof
0.69
artney
0.68
Activations Density 0.339%