Synthetic Data

Cf.

1. What Is Synthetic Data?

In the most common technical sense, synthetic data is data that artificial systems generate rather than measure directly from the world. Models, rules, simulations, and generative systems produce it to approximate the statistical, structural, or semantic properties of real datasets.1

Typical characteristics:

Not collected through direct observation or sensing.
Generated by algorithms, simulations, or procedural rules.
Intended to stand in for or supplement empirical data.
Often used where real data is scarce, sensitive, expensive, or impossible to obtain.

In applied fields, human researchers or users often evaluate synthetic data by how closely it mirrors real data distributions, correlations, or downstream task performance, such as machine learning training outcomes.

This instrumental definition treats synthetic data primarily as a substitute or proxy.

Recent scholarship emphasises that such data are meaningful and politically charged rather than neutral stand‑ins, raising questions of responsibility, justice and governance in data ecosystems.

For more‑than‑human and Deep Design Lab (DDL) work, this is not an abstract concern. Environmental data are always partial, mediated and world‑making rather than merely representational, and nonhuman entities are treated as active participants. Within this frame, “data” is already synthetic in an important sense: it is composed, translated and staged. Laser scanning, ecological modelling and game‑based design all enact worlds rather than simply capturing them.

This note therefore reframes synthetic data as a more‑than‑human worlding practice.

2. Alternative Understandings of Synthetic Data

Beyond this pragmatic view, synthetic data can be understood in several distinct and sometimes conflicting ways.

2.1 Synthetic as Model-Based World Making

In simulation sciences, synthetic data is not merely a copy of reality but the outcome of an explicit theory about how the world works. Climate models, agent-based models, traffic simulations, or crowd dynamics generate data that expresses assumptions, simplifications, and causal hypotheses.

Here, synthetic data is epistemic:

It makes theoretical commitments visible.
It encodes causal stories rather than just observations.
It produces futures, counterfactuals, and unrealized possibilities.

The synthetic is not opposed to the real but to the unexamined.

2.2 Synthetic as Infrastructural Artifact

From an STS perspective, synthetic data is an infrastructural object. It emerges from institutional needs such as privacy regulation, benchmarking, or standardization.

In this view:

Synthetic data stabilizes categories.
It normalizes certain behaviours or bodies.
It quietly governs what can be known and optimized.

Synthetic datasets do not just represent the world. They participate in shaping it by defining what counts as valid input.

2.3 Synthetic as Performative Data

We can also understand synthetic data performatively. When practitioners use synthetic data to train systems that act in the world, synthetic data feeds back into reality.

Examples include:

Synthetic faces shaping facial recognition norms.
Synthetic mobility data influencing urban planning.
Synthetic user behaviour shaping platform design.

Here, synthetic data does not follow reality. Reality begins to follow synthetic data.

2.4 Synthetic as Speculative Material

In design research, architecture, and speculative practice, synthetic data can function as a medium for exploration rather than prediction.

Used this way, synthetic data:

Explores possible worlds rather than probable ones.
Foregrounds uncertainty and contingency.
Operates closer to fiction, scenario, or provocation.

This aligns synthetic data with critical design and speculative realism rather than engineering optimization.

3. Connection to More-Than-Human Research

There is a strong and underexplored connection between synthetic data and more-than-human research.

3.1 Decentring the human observer

More-than-human research challenges the assumption that data must originate in human perception or intention. Synthetic data extends this decentring by:

Allowing nonhuman agents to generate data.
Modelling environments as active participants.
Treating agency as distributed rather than intentional.

In agent-based models, ecological simulations, or robotic swarms, data emerges from interactions rather than observation. This resonates with more-than-human ontologies.

3.2 Synthetic Data as Multispecies Translation

Synthetic data often operates as a translation layer between human epistemic systems and nonhuman processes.

Examples:

Ecological models translating animal behaviour into numerical agents.
Climate simulations translating atmospheric dynamics into grids and parameters.
Urban simulations translating human and nonhuman flows into interacting entities.

These translations are not neutral. They determine which nonhuman agencies become legible and which agencies remain excluded.

3.3 Worlding Rather Than Representing

More-than-human research often shifts from representation to worlding. Synthetic data fits this shift well.

Instead of asking whether synthetic data accurately represents reality, the more-than-human question becomes:

What worlds this data enacts.
Which relations are made possible.
Which beings are amplified or silenced.

Synthetic data participates in composing worlds where humans are only one actor among many.

3.4 Ethical and Political Implications

From a more-than-human perspective, synthetic data raises distinctive ethical questions:

Whose lifeworlds are modelled.
Which forms of agency are simplified or erased.
How nonhuman temporalities and rhythms are compressed into human scales.

These questions differ from standard concerns about bias (see, Bias) or accuracy. They concern ontological inclusion and exclusion.

4. Relevance to the Deep Design Lab Research Orientation

A recurring concern in DDL work is that environmental knowledge is always partial, mediated, and world-making rather than representational in a narrow sense. Whether the focus is trees, multispecies urbanism, or computational environments, the work resists the idea that data simply records a pre-existing reality.

Key aspects that matter here:

Designed apparatuses produce knowledge.
Nonhuman entities are active participants, not passive objects.
Computational models are ontological commitments, not neutral tools.
Uncertainty and incompleteness are not failures but conditions of engagement.

Within this frame, data is already synthetic in the sense that human stakeholders, collectives or machines compose, translate, and stage it. Laser scanning itself is not raw capture but a highly structured way to make trees legible to technical systems.

So synthetic data is close to the default condition in more-than-human approaches.

4.1 Are Statistical Extrapolations from Scanned Trees Synthetic Data?

4.1.1 What Is Being Synthesized

When you laser scan a limited number of trees, you obtain:

High resolution geometric and material traces.
Data tied to specific individuals, locations, and moments.
Measurements produced through a specific sensor logic.

When you statistically extrapolate from this sample, you generate:

Trees that were never scanned.
Forms that are not directly observed.
Distributions of branching, volume, biomass, or growth patterns.

Researchers do not measure these extrapolated trees directly. They infer, generate, and model them. That makes them synthetic.

However, they are not fictional in a loose sense. Empirical data, ecological theory, and modelling choices constrain them. This places them in a category that might be called empirically anchored synthetic data.

4.1.2 Why This Is Not Merely Interpolation

It is tempting to describe this as interpolation or estimation, but that language hides an ontological shift.

Researchers use extrapolated trees to:

Populate simulations.
Estimate ecosystem services.
Drive design decisions.
Visualize futures or absences.

These trees become operative entities. They act within models and influence outcomes. At that point, they function exactly as synthetic data does in machine learning or simulation sciences.

They are not just filling gaps. They are worlding additional trees into existence within the research apparatus.

4.2 Why This Matters for More-Than-Human Research

This is where the connection becomes especially strong.

4.2.1 From Representing Trees to Enacting Forests

In a more-than-human framework, the key question is not whether extrapolated trees are accurate representations of real trees.

The question becomes:

What kinds of trees are allowed to exist in the model.
Which aspects of arboreal life become legible.
Which relations are preserved or erased.

A statistically generated tree is a compressed translation of a more-than-human being into computational form. That translation has consequences.

For example:

Growth becomes geometry.
Health becomes parameters.
Agency becomes interaction rules.

This is not a flaw. It is the condition under which trees can participate in digital environments at all.

4.2.1.1 Synthetic Data as Multispecies Negotiation

Seen this way, extrapolated trees are a negotiated compromise between:

Arboreal lifeworlds.
Sensing technologies.
Statistical reasoning.
Architectural and urban questions.

Synthetic data is the site where this negotiation happens.

Rather than opposing synthetic to real, Roudavski’s work invites us to ask whether the synthesis is careful, situated, and ethically attentive to more-than-human difference.

4.2.1.2 Productive Incompleteness

An important alignment is that these extrapolations are knowingly incomplete.

They do not claim to exhaust what a tree is. They acknowledge:

Missing temporalities.
Unmodeled relations.
Forms of agency that escape quantification.

This aligns with Roudavski’s emphasis on humility, partiality, and openness in environmental modelling. Synthetic data here is not about control but about participation under constraint.

4.2.2 Are Ecological Games Synthetic Data?

4.2.2.1 Epistemic Objects

Mould Racing interprets the Merri Creek game through epistemic objects (colonies, designers, fragmentation, recruitment). Epistemic objects are deliberately incomplete, open research constructs that catalyse collaboration.6 3 4 5

Example:

Colony: interactions between species, environment, and players are modelled as properties that self‑organise into colonies. Distinct colony shapes and behaviours (rings, clumps, linear formations) prompt players to notice ecological patterns.
Designer: players, as designers, can only influence colonies indirectly, by moving and fertilising cells. This shifts perception from designer‑as‑controller to designer‑as‑steward who sets conditions but cannot control outcomes.
Fragmentation: players racing road‑bound colonies experience linear habitats as vulnerable to blockages, leading to discussions about fragmentation and resilience (edges, gaps, and their effects).
Recruitment: a player racing a fast‑reproducing species finds it exhausting to maintain adequate fertilisation, discovering the importance of recruitment without prior knowledge of the ecological term.

The game thus produces synthetic ecological data such as colonies, growth patterns, trajectories that are grounded in empirical site measurements but extended through simulation and embodied play. These synthetic data, in turn, generate synthetic understandings of ecological concepts among designer‑participants.

4.2.2.2 Boundary Objects

PocketPedal, a smartphone game simulating a two‑kilometre segment of St Kilda Road in Melbourne, is another example.7

Core elements:

Environment: a stylised but recognisable model of the road, including lanes, intersections, tramways, buildings, and landmarks, informed by traffic counts, crash statistics, and design reports.
Agents: automated vehicles of various types with probabilistic behaviour ranging from cautious to aggressive; parked cars that can open doors into the bicycle lane; traffic densities matching real conditions.
Avatar and controls: the player controls a cyclist avatar via simple touch gestures, aiming to reach the city or maximise score without serious collision.
Scoring: a score aggregates with distance; a quality multiplier increases when the player stays in cycling provisions and decreases when leaving them or approaching hazards, making some design configurations feel risky or frustrating.

The game is embedded in collaborative-design workshops with diverse stakeholders (cyclists, drivers, planners, health workers), including patterns such as “nested play”, “parallel play”, and “distributed bran”, where game sessions and social interactions co‑produce shared understandings and exchanges about cycling infrastructure.

Here again, synthetic data is central:

synthetic traffic sequences, collisions, and near‑misses generated from rule‑based models and real statistics;
recorded player trajectories, crash types, and scores gathered during play and discussion.

These datasets support the construction of boundary objects8 that different stakeholders can use and contest simultaneously.

5. A Concise Positioning

Statistical extrapolations from laser scanned trees can be understood as:

Synthetic data grounded in situated measurement.
Speculative yet disciplined worlding.
A method for allowing nonhuman entities to act within computational research.

In this sense, synthetic data is not a secondary or compromised form of data. For more‑than‑human approaches and DDL work, it is a necessary mode of engagement with realities that cannot be fully captured, only carefully synthesised. The central question becomes not whether the data is synthetic but whether the synthesis is careful, situated and ethically attentive to more‑than‑human difference and ecological justice.2

6. Additional Issues

Model collapse, iterative degradation of synthetic populations.9
Indigenous data sovereignty and governance.10
Data refusal.11

7. Possible Projects

Synchrotron Project

Goals of the project: investigate the feasibility and limitations of imaging benchmark objects such as large trees for data-driven design and fabrication of prosthetic habitats for wildlife.
Available data: volumetric scans of large, medium and small woody objects produced by the Australian Synchrotron IMBL beamline.
Skills required: volumetric reconstruction and processing, comparative evaluation of imaging modalities, numerical assessment of spatial structures, and possible translation of volumetric outputs into downstream design and fabrication workflows.
Support available: existing internal documentation, technical workflows, examples, and advice. High-performance computing resources.
Relevance to the synthetic data research: demonstrates how empirical measurements constrain understanding of ecological baselines and can be extrapolated beyond descriptions and sample models into generalisations. It also explores the roles of hypotheses and algorithms, including their use and reappropriation across domains.

Related references:

Roudavski, Stanislav, and Julian Rutten. “Towards More-than-Human Heritage: Arboreal Habitats as a Challenge for Heritage Preservation.” Built Heritage 4, no. 4 (2020): 1–17. https://doi.org/10.1186/s43238-020-00003-9.

Intelligent Lighting Networks Project

Goals of the project: develop intelligent lighting networks that are aware of environments and inhabitants, can react to unfolding events, predict future patterns, and optimise performance. Extend this work toward stakeholder-viable deployment that treats urban lighting as an interspecies culture and interspecies design problem.
Available data: case-study datasets and analyses covering site geography, geometry, use, and ecology, plus a virtual environment for simulation, assessments of correspondence between virtual and real experiences, designed site-specific networks, and tested use-case scenarios.
Relevant skills: simulation environment development and evaluation, spatial and ecological data analysis, design of networked system behaviours and controls, and stakeholder-facing research methods for multi-stakeholder deployment contexts.
Support available: documented project notes, funding proposals, presentations, collaboration, and advice.
Relevance to the synthetic data research: provides a concrete case of synthetic data as infrastructural and performative, where simulation outputs can stabilise categories, define what can be known and optimised, and feed back into real-world decisions about lighting and ecological conditions.

Near-Surface Ecologies Project

Goals of the project: develop a workflow that documents and classifies near-surface micro-habitats and their bryophytes, then extracts numerical parameters from scans to inform interventions that support near-surface ecologies, with mosses used as a key indicator species.
Available data: photography and scanning of bryophytes in urban and natural environments, typologies of urban surface conditions, and scan-derived numerical parameters including density, porosity, self-shading, and measures of complexity.
Relevant skills: field documentation and capture workflows (photography, surface scanning, volumetric scanning), classification methods, computational analysis for parameter extraction, and ecological interpretation of what is translated into typologies and parameters.
Support available: a defined project scaffold and an articulated description of the project’s intent.
Relevance to the synthetic data research: operationalises synthetic data as multispecies translation and worlding by turning scans and typologies into synthetic representations that determine what nonhuman agencies become legible or excluded.

Seymour Tree Scans

Goals of the project: extend understanding of arboreal habitat structures through analysis of scans of forest stands.
Available data: terrestrial laser scans of forest stands. Records of hollows including occupancy data. GIS files of the site including trees. Photography, before and after states.
Relevant skills: terrestrial 3D capture processing, point cloud or mesh analysis, quantitative feature extraction, and statistical modelling for extrapolation where scan-informed models may generate trees or distributions beyond those directly scanned.
Support available: monitoring at the site is ongoing, and advice and collaboration are available.
Relevance to the synthetic data research: advances the theme’s concept of empirically anchored synthetic data by moving from measured trees to inferred and statistically generated trees that become operative entities in modelling, monitoring interpretation, and design decision-making.

References

Bryson, Mitch, Feiyu Wang, and James Allworth. “Using Synthetic Tree Data in Deep Learning-Based Tree Segmentation Using LiDAR Point Clouds.” Remote Sensing 15, no. 9 (2023): 2380. https://doi.org/10.3390/rs15092380.

Capasso, Marianna. “Synthetic Data as Meaningful Data. On Responsibility in Data Ecosystems.” Big Data & Society 12, no. 4 (2025): 20539517251386053. https://doi.org/10.1177/20539517251386053.

Carroll, Stephanie Russo, Ibrahim Garba, Oscar L. Figueroa-Rodríguez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, et al. “The CARE Principles for Indigenous Data Governance.” Data Science Journal 19, no. 1 (2020). https://doi.org/10.5334/dsj-2020-043.

Fassnacht, Fabian Ewald, Hooman Latifi, and Florian Hartig. “Using Synthetic Data to Evaluate the Benefits of Large Field Plots for Forest Biomass Estimation with LiDAR.” Remote Sensing of Environment 213 (2018): 115–28. https://doi.org/10.1016/j.rse.2018.05.007.

Harms, Philip, Neelakshi Joshi, and Stefan Knauß. “Designing Multispecies Role-Playing Games: From Human-Nature Partnerships towards Multispecies Justice.” Npj Urban Sustainability 5, no. 1 (2025): 68. https://doi.org/10.1038/s42949-025-00257-1.

Holland, Alexander, Philip Gibbons, Jason Thompson, and Stanislav Roudavski. “Modelling and Design of Habitat Features: Will Manufactured Poles Replace Living Trees as Perch Sites for Birds?” Sustainability 15, no. 9 (2023): 7588. https://doi.org/10.3390/su15097588.

Holland, Alexander, Philip Gibbons, Jason Thompson, and Stanislav Roudavski. “Terrestrial Lidar Reveals New Information about Habitats Provided by Large Old Trees.” Biological Conservation 292 (2024): 110507. https://doi.org/10.1016/j.biocon.2024.110507.

Holland, Alexander, and Stanislav Roudavski. “Mobile Gaming for Agonistic Design.” In Fifty Years Later: Revisiting the Role of Architectural Science in Design and Practice: The 50th International Conference of the Architectural Science Association, edited by Jian Zuo, Lyrian Daniel, and Veronica Soebarto, 299–308. Adelaide: Architectural Science Association, 2016. https://doi.org/10.31219/osf.io/vy5dq.

Jacobsen, Benjamin N. “Machine Learning and the Politics of Synthetic Data.” Big Data & Society 10, no. 1 (2023): 20539517221145372. https://doi.org/10.1177/20539517221145372.

LaDeau, Shannon L., Barbara A. Han, Emma J. Rosi-Marshall, and Kathleen C. Weathers. “The next Decade of Big Data in Ecosystem Science.” Ecosystems 20, no. 2 (2017): 274–83. https://doi.org/10.1007/s10021-016-0075-y.

Michener, William K., and Matthew B. Jones. “Ecoinformatics: Supporting Ecology as a Data-Intensive Science.” Trends in Ecology & Evolution 27, no. 2 (2012): 85–93. https://doi.org/10.1016/j.tree.2011.11.016.

Mirra, Gabriele, Alexander Holland, Stanislav Roudavski, Jasper Wijnands, and Alberto Pugnale. “An Artificial Intelligence Agent That Synthesises Visual Abstractions of Natural Forms to Support the Design of Human-Made Habitat Structures.” Frontiers in Ecology and Evolution 10 (2022): 806453. https://doi.org/10.3389/fevo.2022.806453.

Mitrokhov, Konstantin. “Between World Models and Model Worlds: On Generality, Agency, and Worlding in Machine Learning.” AI & Society 40, no. 6 (2025): 5087–99. https://doi.org/10.1007/s00146-024-02086-9.

Ravn, Louis. “Towards Synthetic Data Justice for Development: A Case Study of Synthetic Datasets on Human Trafficking.” Big Data & Society 12, no. 4 (2025): 20539517251381670. https://doi.org/10.1177/20539517251381670.

Roudavski, Stanislav, Alexander Holland, and Julian Rutten. “Data Games for Ecological Design.” In Learning, Prototyping and Adapting, Short Paper Proceedings of the 23rd International Conference on Computer-Aided Architectural Design Research in Asia (CAADRIA), edited by Weixin Huang, Mani Williams, Dan Luo, and YiXin Wu, 115–20. Hong Kong: CAADRIA, 2018. https://doi.org/10.5281/zenodo.1319861.

Rüegg, Janine, Corinna Gries, Ben Bond-Lamberty, Gabriel J. Bowen, Benjamin S. Felzer, Nancy E. McIntyre, Patricia A. Soranno, Kristin L. Vanderbilt, and Kathleen C. Weathers. “Completing the Data Life Cycle: Using Information Management in Macrosystems Ecology Research.” Frontiers in Ecology and the Environment 12, no. 1 (2014): 24–30. https://doi.org/10.1890/120375.

Schäfer, Jannika, Hannah Weiser, Lukas Winiwarter, Bernhard Höfle, Sebastian Schmidtlein, and Fabian Ewald Fassnacht. “Generating Synthetic Laser Scanning Data of Forests by Combining Forest Inventory Information, a Tree Point Cloud Database and an Open-Source Laser Scanning Simulator.” Forestry: An International Journal of Forest Research 96, no. 5 (2023): 653–71. https://doi.org/10.1093/forestry/cpad006.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. “AI Models Collapse When Trained on Recursively Generated Data.” Nature 631, no. 8022 (2024): 755–59. https://doi.org/10.1038/s41586-024-07566-y.

Steinhoff, James, and Sam Hind. “Simulation and the Reality Gap: Moments in a Prehistory of Synthetic Data.” Big Data & Society 12, no. 1 (2025): 20539517241309884. https://doi.org/10.1177/20539517241309884.

Susser, Daniel, and Jeremy Seeman. “Critical Provocations for Synthetic Data.” Surveillance & Society 22, no. 4 (2024): 453–59. https://doi.org/10.24908/ss.v22i4.18335.

Footnotes

Böhlen, Marc. On the Logics of Planetary Computing: Artificial Intelligence and Geography in the Alas Mertajati. Abingdon: Routledge, 2025.˄
Holland, Alexander, and Stanislav Roudavski. “Mobile Gaming for Agonistic Design.” In Fifty Years Later: Revisiting the Role of Architectural Science in Design and Practice: The 50th International Conference of the Architectural Science Association, edited by Jian Zuo, Lyrian Daniel, and Veronica Soebarto, 299–308. Adelaide: Architectural Science Association, 2016. https://doi.org/10.31219/osf.io/vy5dq.˄
Roudavski, Stanislav, Alexander Holland, and Julian Rutten. “Mould Racing, or Ecological Design through Located Data Games.” In Proceedings of the 24th International Symposium on Electronic Art (ISEA), edited by Rufus Adebayo, Ismail Farouk, Steve Jones, and Maleshoane Rapeane-Mathonsi, 193–200. Durban: Durban University of Technology, 2018. https://doi.org/10.5281/zenodo.1321341.˄
Knorr Cetina, Karin, Karin Knorr Cetina, Theodore R. Schatzki, and Eike von Savigny, eds. “Objectual Practice.” In The Practice Turn in Contemporary Theory, 148–97. London: Routledge, 2001.˄
Rheinberger, Hans-Jörg. Toward a History of Epistemic Things: Synthesizing Proteins in the Test Tube. Stanford: Stanford University Press, 1997.˄
Holland, Alexander, and Stanislav Roudavski. “Mobile Gaming for Agonistic Design.” In Fifty Years Later: Revisiting the Role of Architectural Science in Design and Practice: The 50th International Conference of the Architectural Science Association, edited by Jian Zuo, Lyrian Daniel, and Veronica Soebarto, 299–308. Adelaide: Architectural Science Association, 2016. https://doi.org/10.31219/osf.io/vy5dq.˄
Leigh Star, Susan. “This Is Not a Boundary Object: Reflections on the Origin of a Concept.” Science, Technology, & Human Values 35, no. 5 (2010): 601–17. https://doi.org/10.1177/0162243910377624.˄
Björgvinsson, Erling, Pelle Ehn, and Per-Anders Hillgren. “Agonistic Participatory Design: Working with Marginalised Social Movements.” CoDesign 8, nos. 2–3 (2012): 127–44. https://doi.org/10.1080/15710882.2012.672577.˄
Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. “AI Models Collapse When Trained on Recursively Generated Data.” Nature 631, no. 8022 (2024): 755–59. https://doi.org/10.1038/s41586-024-07566-y.˄
Carroll, Stephanie Russo, Ibrahim Garba, Oscar L. Figueroa-Rodríguez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, et al. “The CARE Principles for Indigenous Data Governance.” Data Science Journal 19, no. 1 (2020). https://doi.org/10.5334/dsj-2020-043.˄
Feminist Data Manifest-No ˄