INDEX
Explanations
references to safety events and practices in a professional context
New Auto-Interp
Negative Logits
locator
-0.17
kowski
-0.17
pitch
-0.16
stin
-0.15
ÃľR
-0.15
obili
-0.14
angkan
-0.14
pitch
-0.14
ech
-0.14
695
-0.14
POSITIVE LOGITS
safety
0.35
Safety
0.34
Safety
0.31
hazards
0.26
hazard
0.26
Unsafe
0.25
Hazard
0.25
unsafe
0.25
afety
0.24
safer
0.23
Activations Density 0.040%