INDEX
Explanations
behavior-related words and phrases
references to behavior and its various contexts
New Auto-Interp
Negative Logits
endiary
-0.75
mand
-0.72
inite
-0.71
anmar
-0.69
racted
-0.67
sonian
-0.67
enegger
-0.67
inka
-0.67
vu
-0.66
ondo
-0.65
POSITIVE LOGITS
modification
1.06
behaviors
1.05
behavior
1.01
avior
0.97
behaviours
0.97
aviour
0.96
uation
0.95
patterns
0.95
behavi
0.92
uate
0.92
Activations Density 0.049%