INDEX
Explanations
words related to rules, regulations, consequences, and penalties
concepts related to societal rules, norms, and consequences
New Auto-Interp
Negative Logits
iven
-0.67
ãĥij
-0.65
ãĥķãĤ©
-0.63
ophon
-0.62
ãĤ»
-0.61
ando
-0.60
Siber
-0.59
ãĥĭ
-0.59
SPONSORED
-0.59
`.
-0.59
POSITIVE LOGITS
deserve
0.98
tended
0.96
will
0.91
must
0.89
are
0.87
cannot
0.86
may
0.86
ought
0.85
knows
0.84
tend
0.84
Activations Density 0.676%