INDEX
Explanations
references to things or actions being incorrect, inappropriate, or harmful
New Auto-Interp
Negative Logits
casters
-0.73
anned
-0.72
ranging
-0.68
ivism
-0.68
lishes
-0.67
ury
-0.67
thood
-0.66
rs
-0.66
cit
-0.64
zeb
-0.63
POSITIVE LOGITS
amount
0.90
side
0.87
way
0.86
kind
0.84
thing
0.83
piece
0.75
direction
0.75
solution
0.73
person
0.72
number
0.72
Activations Density 6.764%