INDEX
Explanations
phrases related to causing harm or danger
phrases indicating the imposition of burdens or pressures
New Auto-Interp
Negative Logits
experiences
-0.77
events
-0.72
matters
-0.71
revelations
-0.71
anism
-0.70
modes
-0.69
secrets
-0.66
projects
-0.65
anship
-0.65
settings
-0.65
POSITIVE LOGITS
halt
1.10
lid
1.09
stop
1.04
moratorium
1.03
limit
0.98
disclaimer
0.97
spotlight
0.97
dent
0.96
ceiling
0.96
smile
0.96
Activations Density 0.067%