INDEX
Explanations
negative descriptors and insults related to individuals or groups
New Auto-Interp
Negative Logits
TagMode
-0.55
-0.49
ánd
-0.47
ChildIndex
-0.42
דו
-0.42
iciens
-0.41
\|\
-0.40
isel
-0.38
kend
-0.38
dataSet
-0.38
POSITIVE LOGITS
crap
0.88
للمعارف
0.87
StructEnd
0.86
ThroughAttribute
0.86
protoc
0.85
ProtoMessage
0.84
bullshit
0.83
morons
0.82
发表于
0.82
idiotic
0.81
Activations Density 0.454%