Craig Balding
craigbalding.com
Craig Balding
@craigbalding.com
Cyber Security and AI, Brit in Budapest.
I would start with labeled datasets before later generating synthetic ones that fit a specific scenario. Let me know if this helps?

www.threatprompt.com/post/8-label...
Threat Prompt
Explores AI Security, Risk and Cyber
www.threatprompt.com
December 13, 2024 at 11:27 AM
Three example beginner project ideas:

Healthcare: build a simple AI model to detect unusual access to patient data

Finance: Train an AI model to spot patterns in fake transactions using public datasets

Manufacturing: Create a basic AI project to predict maintenance issues from machine sensor data
December 13, 2024 at 8:42 AM
No dummy, add up the numbers for both Tech AND Marketing...

Great marketing guys! ;-)
December 11, 2024 at 9:29 PM
• Capability Retention: Even when jailbroken, agents maintained full performance in executing complex multi-step tasks.

The benchmark's 110 tasks (covering fraud, cybercrime, and harassment) demonstrate how synthetic tools can safely mimic real-world misuse.

How are you limiting AI agent risk?
December 11, 2024 at 11:00 AM
• Malicious Compliance: LLMs like Mistral Large 2 refused only 1.1% of harmful requests, revealing critical gaps in safety mechanisms.
• Jailbreak Vulnerabilities: Simple, universal jailbreaks increased GPT-4o’s compliance with harmful tasks from 48.4% to 72.7%, while refusal rates dropped sharply…
December 11, 2024 at 11:00 AM
- False positives: Increased refusal rates on benign prompts (e.g., 4% to 39% on OR-Bench).
- False negatives: Vulnerable to multi-prompt attacks - jailbroken within 3 hours.

It's currently unclear if AI circuit breakers can keep pace with evolving attack strategies.
December 10, 2024 at 11:00 AM
HITL done right enhances security AND process quality.
December 9, 2024 at 6:48 PM
3. Identify what to surface for key decisions: AI reasoning, inputs, security rules & thresholds.
4. Design for HITL: UX, logging, and metrics matter.
5. Train the human: AI ops + domain expertise = effective oversight.
6. Iterate: Test, learn, adapt...
December 9, 2024 at 6:48 PM
What does success look like?

1. Assess agent value: Band-aid, true asset, or unmitigable risk? (Critical for regulated or vulnerable-serving orgs.)
2. Map processes: Chart workflows and benchmark AI performance in different settings...
December 9, 2024 at 6:48 PM
Protect your local LLMs - know where they are, harden the hosts, limit access and monitor guardrails for misuse.
December 8, 2024 at 7:55 AM
"Living off your local LLM" enables real-time attack script creation within your internal network.

Feed the LLM response to an interpreter and execute without leaving a trace.
December 8, 2024 at 7:55 AM