INDEX
Explanations
harmful or triggering content
New Auto-Interp
Negative Logits
unitas
0.88
asymmetries
0.86
heresy
0.83
discontinuities
0.80
বর্তন
0.76
0.75
discontinuity
0.74
Killed
0.73
pernicious
0.72
deus
0.72
POSITIVE LOGITS
integral
0.80
deemed
0.69
considered
0.68
emotionally
0.65
indicative
0.65
categorized
0.62
capable
0.62
greatly
0.62
both
0.62
apporter
0.61
Activations Density 0.263%