INDEX
Explanations
expressions related to discussions or reflections on various topics
expressions of hesitation, caution, or shame regarding personal experiences or opinions
New Auto-Interp
Negative Logits
ngth
-0.80
ynthesis
-0.79
strous
-0.73
vantage
-0.71
ictionary
-0.66
DragonMagazine
-0.66
anooga
-0.65
odder
-0.65
orld
-0.64
uction
-0.64
POSITIVE LOGITS
how
1.21
what
1.05
admitting
1.04
acknowledging
1.02
whether
1.01
letting
0.99
where
0.96
choosing
0.95
disclosing
0.94
knowing
0.94
Activations Density 0.187%