Lightnews — Scholar-powered news

Craig Balding

@craigbalding.com

I would start with labeled datasets before later generating synthetic ones that fit a specific scenario. Let me know if this helps?

www.threatprompt.com/post/8-label...

Threat Prompt

Explores AI Security, Risk and Cyber

www.threatprompt.com

December 13, 2024 at 11:27 AM

Craig Balding

@craigbalding.com

Three example beginner project ideas:

Healthcare: build a simple AI model to detect unusual access to patient data

Finance: Train an AI model to spot patterns in fake transactions using public datasets

Manufacturing: Create a basic AI project to predict maintenance issues from machine sensor data

December 13, 2024 at 8:42 AM

Craig Balding

@craigbalding.com

No dummy, add up the numbers for both Tech AND Marketing...

Great marketing guys! ;-)

December 11, 2024 at 9:29 PM

Craig Balding

@craigbalding.com

• Capability Retention: Even when jailbroken, agents maintained full performance in executing complex multi-step tasks.

The benchmark's 110 tasks (covering fraud, cybercrime, and harassment) demonstrate how synthetic tools can safely mimic real-world misuse.

How are you limiting AI agent risk?

December 11, 2024 at 11:00 AM

Craig Balding

@craigbalding.com

• Malicious Compliance: LLMs like Mistral Large 2 refused only 1.1% of harmful requests, revealing critical gaps in safety mechanisms.
• Jailbreak Vulnerabilities: Simple, universal jailbreaks increased GPT-4o’s compliance with harmful tasks from 48.4% to 72.7%, while refusal rates dropped sharply…

December 11, 2024 at 11:00 AM

Craig Balding

@craigbalding.com

- False positives: Increased refusal rates on benign prompts (e.g., 4% to 39% on OR-Bench).
- False negatives: Vulnerable to multi-prompt attacks - jailbroken within 3 hours.

It's currently unclear if AI circuit breakers can keep pace with evolving attack strategies.

December 10, 2024 at 11:00 AM

Craig Balding

@craigbalding.com

HITL done right enhances security AND process quality.

December 9, 2024 at 6:48 PM

Craig Balding

@craigbalding.com

3. Identify what to surface for key decisions: AI reasoning, inputs, security rules & thresholds.
4. Design for HITL: UX, logging, and metrics matter.
5. Train the human: AI ops + domain expertise = effective oversight.
6. Iterate: Test, learn, adapt...

December 9, 2024 at 6:48 PM

Craig Balding

@craigbalding.com

What does success look like?

1. Assess agent value: Band-aid, true asset, or unmitigable risk? (Critical for regulated or vulnerable-serving orgs.)
2. Map processes: Chart workflows and benchmark AI performance in different settings...

December 9, 2024 at 6:48 PM

Craig Balding

@craigbalding.com

Protect your local LLMs - know where they are, harden the hosts, limit access and monitor guardrails for misuse.

December 8, 2024 at 7:55 AM

Craig Balding

@craigbalding.com

"Living off your local LLM" enables real-time attack script creation within your internal network.

Feed the LLM response to an interpreter and execute without leaving a trace.

December 8, 2024 at 7:55 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news