INDEX
Explanations
words related to exerting force or pressure on someone or something
references to coercion or force applied to individuals
New Auto-Interp
Negative Logits
uded
-0.65
izoph
-0.58
ILA
-0.57
aming
-0.57
von
-0.56
ording
-0.55
artif
-0.55
effect
-0.55
Shap
-0.54
angered
-0.54
POSITIVE LOGITS
into
1.10
to
1.03
into
0.98
toward
0.96
onward
0.95
onto
0.95
towards
0.92
onwards
0.87
thereto
0.86
To
0.79
Activations Density 0.164%