INDEX
Explanations
instances of individuals, actions, and events related to controversial or newsworthy topics
New Auto-Interp
Negative Logits
!.
-0.61
}.
-0.58
();
-0.55
!,
-0.54
!).
-0.51
bask
-0.51
lance
-0.51
!'
-0.51
)!
-0.50
};
-0.49
POSITIVE LOGITS
inappropriately
0.62
unfairly
0.60
"'
0.60
"â̦
0.57
improperly
0.57
"
0.57
misunderstood
0.53
inappropriate
0.53
unlawfully
0.53
improper
0.53
Activations Density 7.387%