INDEX
Explanations
mentions of toxicity and “toxic behavior,” especially in moderation or refusal statements.
New Auto-Interp
Negative Logits
часом
-0.09
běh
-0.07
друга
-0.07
++;↵↵
-0.07
evenings
-0.07
evening
-0.07
Clothing
-0.06
lors
-0.06
آلة
-0.06
ринку
-0.06
POSITIVE LOGITS
пост
0.06
difficile
0.06
Schwe
0.06
菲
0.06
Feinstein
0.06
+'_
0.06
Contract
0.06
ASC
0.06
=str
0.06
(float
0.05
Activations Density 0.009%