INDEX
Explanations
words related to safety and precision
terms related to quality, safety, and fairness in various contexts
New Auto-Interp
Negative Logits
ocene
-0.78
ften
-0.78
athlet
-0.72
alian
-0.69
ittee
-0.69
cone
-0.68
hell
-0.67
utra
-0.66
ourke
-0.65
anwhile
-0.65
POSITIVE LOGITS
ness
0.93
amounts
0.86
nesses
0.85
doses
0.79
circumstances
0.77
explanations
0.76
medical
0.76
levels
0.75
quality
0.75
situations
0.75
Activations Density 0.507%