INDEX
Explanations
phrases related to policy changes or adjustments
New Auto-Interp
Negative Logits
wine
-0.66
ا
-0.65
waters
-0.65
stereotype
-0.65
Sic
-0.60
Heller
-0.57
dies
-0.57
Nass
-0.56
Sung
-0.56
hung
-0.56
POSITIVE LOGITS
effected
1.15
drastic
1.07
gradual
0.93
incremental
0.89
occur
0.89
undone
0.89
wrought
0.84
undo
0.83
occurring
0.81
foreseen
0.80
Activations Density 0.429%