Can AI actually do science? A hard new test

Vol. 01 · No. 61 · Thursday 2 July 2026 · Daily

Can AI actually do science — not recite facts, but do the messy judgement work of real research? OpenAI just built a brutal test to find out: 129 real, deliberately messy biology problems, each worth 20 to 40 hours of an expert's time. Its best model solved under a third. The unsettling part isn't the fail rate — it's the speed. A few months ago, the score was under 5%.

The Six Signals

One story from each frontier — what happened, then what's in it for you and who stands to gain.

IN THIS ISSUE

01 · AI & ML — Can AI actually do science? A brutal new test keeps score
02 · Robotics — Robots are learning to feel, not just see
03 · Biotech — Scientists say they built a living cell from scratch
04 · Energy — China built the giant magnet a fusion plant needs
05 · Space — One fuel, two engines: a tiny thruster aimed at Mars
06 · Quantum — Quantum's error problem is now a free download

Get The Signal free, every day →

1 · AI & ML — Can AI Actually Do Science? A Brutal New Test Keeps Score

OpenAI released GeneBench-Pro, a hard new test of whether AI can do the judgement work of real science — spotting bad data, picking the right method, and reaching a conclusion a researcher could act on — across 129 messy, research-grade biology problems. Reviewers reckon each would take a human expert 20 to 40 hours.
The best model (OpenAI's own GPT-5.6, in its most careful mode) solved 31.5%. A few months ago, the top score was under 5% — and OpenAI thinks the test could be maxed out by year-end.
This isn't a recall quiz — it grades judgement on deliberately messy data, the line between "AI helps a scientist" and "AI runs the analysis." Tellingly, today's models often spot a flaw in the data but fail to act on it.

What's in it for me? The AI you lean on for analysis is getting dramatically better at real, messy problems — but "better" still means wrong about two times in three on hard science. The practical takeaway: let AI draft the analysis, keep a human on the "is this actually right?" call, and judge any tool on messy real data, not a clean demo.

Who benefits: researchers, drug-hunters and data teams who could get lab-grade analysis on tap — and OpenAI, which now sets the yardstick; the pressure lands on anyone who trusts a slick AI answer without checking it.

Source: OpenAI · ⚑ The score jump is real, but OpenAI graded its own test and the underlying paper is a bioRxiv preprint (not yet peer-reviewed). 30 Jun.

2 · Robotics — Robots Are Learning to Feel, Not Just See

A wave of new robotics research is chasing the same prize: giving robots a real sense of touch. A peer-reviewed study (dubbed TouchWGNN, 26 Jun) helps a robot track an object it can't fully see while handling it; fresh papers focus on capturing human hand skill and copying it into robot fingers.
Today's robots mostly "see," then fumble the moment fingers make contact — grabbing something soft, threading a cable, folding a cloth. Contact and touch are where the next big jump in ability lives.
Tellingly, the whole field is pushing there at once — which usually comes right before a wave of new real-world skill.

What's in it for me? The robots meant to help in homes, warehouses and hospitals are useless until they can handle real, floppy, slippery, half-hidden stuff — and that's exactly what this work targets. A simple test for any robot demo: skip the backflips and watch whether it can reliably pick up a soft or hidden object. That, not acrobatics, is the tell.

Who benefits: makers of robot hands and grippers, and anyone waiting on robots that do fiddly real-world jobs; the honest caveat is that some of this is early-stage research — a couple of these papers haven't been peer-reviewed yet.

Source: Frontiers in Robotics & AI · ⚑ TouchWGNN is peer-reviewed (26 Jun); the hand-skill papers (RealDexUMI, YUBI) are preprints, not yet reviewed. A research frontier, framed as a wave rather than one event.

3 · Biotech — Scientists Say They Built a Living Cell From Scratch

A University of Minnesota team says it assembled a working cell — nicknamed the "SpudCell" — from 150–200 non-living molecules, and that it can feed, grow and split on its own for about five generations, running on a genome smaller than anyone thought life could manage. They released the recipe openly.
The dream: instead of editing life that already exists, build a blank cell from parts and one day program it like software — to make drugs, materials or living sensors on demand.
The giant asterisk: it's an un-reviewed preprint that the top journal Cell already rejected — one reviewer said the SpudCells "were not real biology" — and scientists are openly arguing over whether the thing is even alive.

What's in it for me? If it holds up, this is the starting gun for programmable cells — a long-term route to making medicines and materials to order. For now the useful move is a little media literacy: treat "we made life" as a claim to watch, not a result — and see whether it survives peer review before you believe the headline.

Who benefits: synthetic-biology labs and the open-science camp, if it's real; the risk is a splashy claim that doesn't survive scrutiny — which is exactly why the "is it alive?" argument matters.

Source: Science / University of Minnesota · ⚑⚑ bioRxiv preprint, NOT peer-reviewed; the manuscript was rejected by the journal Cell, and experts disagree on whether it counts as alive. 1 Jul.

4 · Energy & Climate — China Built the Giant Magnet a Fusion Plant Actually Needs

At its CRAFT fusion-technology facility in Hefei, China says it finished and tested its largest-ever magnet — plus a second, high-tech magnet — of the kind used to bottle up and squeeze the super-hot gas inside a fusion reactor.
Everyone chasing fusion hits the same wall: not the physics of the hot gas, but building the enormous, reliable magnets on budget and at scale. Passing acceptance tests on full-size magnets says China can now manufacture them in-house.
It's the unglamorous industrial muscle — not another record temperature — that decides who can actually assemble a fusion power plant rather than just run an experiment.

What's in it for me? Fusion is the dream of near-limitless clean power, and this is a real step toward building the machines, not just testing ideas. Want to track who's really ahead? Watch magnet-manufacturing muscle, not headline plasma temperatures.

Who benefits: national fusion programmes and the magnet-and-superconductor supply chain; the honest caveat is this is one country's industrial milestone in a very long race — not power on the grid.

Source: CGTN · ⚑ State-media reported (29 Jun; work dated 27 Jun), no independent Western confirmation yet; an engineering milestone, not net energy. Today's one China-centred story.

5 · Space — One Fuel, Two Engines: the Tiny Thruster Built to Reach Mars

MIT engineers showed that a single green fuel can power both kinds of spacecraft engine: chemical thrusters for quick bursts of speed, and ultra-efficient electric thrusters for the long cruise — all from one shared tank. NASA will fly a shoebox-sized test satellite this year.
Until now those two engines needed separate fuels and hardware, adding weight and cost. Sharing one tank lets even small, cheap satellites both sprint and cruise — enough, MIT says, to send a CubeSat to Mars or the asteroid belt.
It's one piece of a wider push for faster deep-space engines — from NASA's record-power plasma thruster to experimental fusion rockets — all aimed at shrinking the months-long slog between planets.

What's in it for me? Cheaper, more capable little spacecraft mean more eyes on storms, crops, wildfires and deep space — and a faster route to the science and materials that trickle back into everyday tech. The milestone to watch is that first shared-fuel test flight later this year.

Who benefits: small-satellite makers, universities and NASA, plus the propulsion startups chasing faster engines; the caveat is this is a lab result heading to its first space test — promising, not proven.

Source: MIT News · ⚑ Published lab study (10 Jun); the first in-space test (a NASA CubeSat) is later in 2026. A frontier round-up, not breaking news.

6 · Quantum — Quantum's Error Problem Just Became a Free Download

IBM released Qiskit Paulice (29 Jun), a free, open-source tool that lets any developer bolt error-detection onto a quantum program in a few lines of code — no PhD, no new hardware. It's been run on circuits up to 50 qubits.
Today's quantum computers are noisy — they make mistakes constantly. Full error correction (fixing faults live) needs thousands of qubits we don't have yet. Error detection — spotting a bad run and throwing it out — is the cheaper half you can use today.
The quiet shift: making today's shaky quantum machines trustworthy is turning from a physics problem into an ordinary software download.

What's in it for me? The quantum computer you'll one day rent by the hour is getting reliable faster — because cleaning up its errors is becoming a free add-on, not a research project. That's how quantum inches toward real jobs like designing medicines and materials.

Who benefits: developers and companies experimenting on today's quantum machines (and IBM's software ecosystem); the caveat is this is detection, not correction — it makes results cleaner, it doesn't yet make quantum computers powerful.

Source: IBM Quantum · ⚑ Open-source software release (29 Jun); error detection (post-selection), not full error correction.

Where Signals Meet

A short science-fiction scene — built only from today's real signals — to show where these frontiers could be heading.

AI & ML × Robotics × Biotech × Energy × Space × Quantum

Field notes from 2047: the year the machines carried us

By 2047, almost everything runs itself, and almost no one remembers how. The labs run themselves too — AI proposes the experiments, then judges the results. The fusion plant outside town needs no fuel deliveries, only its great magnets. A probe built by MIT's grandchildren is already past Jupiter on a thimble of green propellant. In the clinic, a printed cell brews the week's medicine from a recipe someone downloaded. The robots that pack, cook and care no longer merely see — they feel, handling a newborn or a fraying wire with a fingertip's judgement. Even the quantum machines, once hopelessly error-prone, now quietly check their own work.

Maya oversees the city's research institute — or rather, she watches it run. Every result the machines return is right more often than any human could manage. The unease isn't that they fail. It's that they so rarely do, and that when one finally does, the people who once knew how to catch it stopped looking years ago. We built a world that carries us. The question nobody schedules time for: what happens the day it stumbles, and we've forgotten how to walk?

Built from today's signals: AI that runs the lab (Signal 1) · robots that feel, not just see (Signal 2) · a cell built from a recipe (Signal 3) · fuel-free fusion power (Signal 4) · a tiny craft on a long-haul engine (Signal 5) · quantum machines that check their own work (Signal 6).

⚑ Science fiction, not news — a 2040s scenario, not a 2026 product.

Quick answers

What does it mean for AI to "do science"?

It's the difference between an AI that answers questions and one that does the actual work of research. "AI for science" means handing a model a messy, real dataset and a research question and asking it to choose the right analysis, spot bad data, and reach a conclusion a scientist could act on — the judgement part, not just recalling facts. That's what OpenAI's new GeneBench-Pro test tries to measure, and it's the same goal behind US programs like DARPA and the NSF's AI Forge and the Energy Department's FASST. The current scoreboard is humbling and striking at once: the best model solves fewer than one in three research-grade genetics problems, but that's up from about one in twenty a few months ago. The gap that remains is telling — today's models often notice a flaw in the data but fail to act on it — which is exactly why a human still has to sign off on the result.

What does it mean to "build a living cell from scratch"?

Normally, every living cell is a copy of an earlier one — life comes from life. "Building a cell from scratch" means starting with a pile of non-living ingredients — the molecules that carry instructions, the machinery that reads them, and a protective bubble to hold it all — and coaxing them to behave like a real cell: taking in food, growing, and splitting in two. The claim in today's signal is that a lab did exactly that with a couple of hundred parts and an unusually tiny set of genetic instructions. If it's genuine, it's a landmark, because a made-from-parts cell could one day be programmed like software to manufacture drugs or materials. The catch is huge: the work hasn't passed peer review, a top journal turned it down, and experts don't even agree the result is truly alive — so treat it as a claim to watch, not a settled fact.

Why can't today's quantum computers just fix their own errors?

Today's quantum computers are powerful but jittery: the tiny quantum bits that store information are so delicate that stray heat and vibration constantly nudge them into mistakes. Fixing those mistakes as they happen — "error correction" — is possible in theory, but it takes thousands of extra bits working together to protect a single good one, and today's machines don't have nearly enough. So researchers lean on the cheaper half: "error detection." Instead of repairing a mistake, the computer runs extra checks that flag when a calculation has gone wrong, so you can simply throw that bad run away and keep the good ones. Today's signal is a free software tool that adds those checks automatically. It doesn't make a quantum computer more powerful — it makes the answers you get out of today's shaky machines more trustworthy, which is what's needed while the hardware catches up.

"Who benefits" names companies logically tied to each story — information to help you follow the money, not investment advice. Health items are general information, not medical advice.

Can AI Actually Do Science? A Brutal New Test Keeps Score