INDEX
Explanations
words related to actions or behaviors
elements related to societal issues and their consequences
New Auto-Interp
Negative Logits
DIV
-0.64
Ort
-0.63
seriousness
-0.60
sonian
-0.58
framework
-0.56
ãĥ´ãĤ¡
-0.55
strip
-0.55
NP
-0.53
nu
-0.53
irc
-0.52
POSITIVE LOGITS
when
1.81
when
1.73
WHEN
1.43
When
1.41
When
1.38
whenever
1.38
Whenever
1.05
Whenever
0.99
during
0.94
during
0.89
Activations Density 0.258%