INDEX
Explanations
phrases describing actions or behaviors of individuals in different situations
actions related to illegal or inappropriate behavior
New Auto-Interp
Negative Logits
endif
-0.79
etheless
-0.78
soever
-0.75
nown
-0.74
atari
-0.74
fortunately
-0.70
ciating
-0.69
Alternatively
-0.67
cking
-0.67
inis
-0.65
POSITIVE LOGITS
sake
0.96
purposes
0.87
improper
0.83
illegal
0.79
reasons
0.78
violations
0.73
nonviolent
0.72
unpopular
0.72
unlawful
0.69
improperly
0.65
Activations Density 0.165%