INDEX
Explanations
input filtering and validation
New Auto-Interp
Negative Logits
safety
0.46
safety
0.44
Safety
0.42
dangereux
0.42
prone
0.41
dangerous
0.40
ിച്ചി
0.40
безопасность
0.40
SAFETY
0.39
susceptible
0.39
POSITIVE LOGITS
purification
0.73
purifier
0.68
Purification
0.67
purifying
0.65
Sanit
0.63
FILTER
0.61
purify
0.61
Filter
0.61
filtration
0.60
filter
0.59
Activations Density 0.004%