Polkaholic.io’s 2022 XCM Transfers Now in BigQuery Public Datasets: substrate-etl.polkadot.xcmtransfers
Over 2022, while building Polkaholic.io’s multichain indexer, we got incrementally better at indexing XCM Transfers and building a global asset registry. We found a couple dozen groups (parachains, analysis firms, dapp builders, Parity) using Polkaholic.io XCM Transfers API to fetch recent XCM Transfers systematically to do data analysis. We realized that was necessary to do a few things:
- Build a XCM Global Asset Registry (XCM GAR) and make it as complete as possible with a community-driven solution. Michael is leading this effort in this xcm-global-registry repo. Basically, parachains have all kinds of ways to represent their assets and assets of others, and putting this all together is essential to model XCM activity.
- Reanalyze the past robustly. Historically, we have been dependent on traces and full archive nodes, which we needed to match origination and destination chain activity. We’re adjusting our dependence on these by working even when this is not available, and expect that with XCM v3’s context tracking this will be far simpler and result in high confidence analytics.
- Make it easy to access and query historical XCM Transfers data in a substrate-etl, which exposes all the chain data along with XCM Transfers in a Google BigQuery public dataset.
So, instead of doing hundreds of paginated API calls to Polkaholic API and loading it somewhere, you can just dive into an xcmtransfers table of BigQuery. Using the dataset you can generate historical analyses and derive aggregate insights like Subwallet’s Polkadot Deep Dive Q4 2022 or Messari’s State of Polkadot Q3.
We are happy to report that we have “finalized” a 2022 XCM Transfer analysis for everyone to use in substrate-etl! We say “finalized” because we hope to improve it based on your feedback in the coming weeks.
To access the substrate-etl XCM transfers data, you will need a Google Cloud project (see here, yes, you can get started with a free account) and access these 2 tables:
1. substrate-etl.kusama.xcmtransfers
2. substrate-etl.polkadot.xcmtransfers
The project is substrate-etl — and you will need to add it in BigQuery by clicking on “Add Data” and then “Search for and star a project”
Once you have added the project, search for the “xcmtransfers” table and you should be able to find the schema and preview some records.
A single table holds all the activity for the relay chain and all its parachains, so there are two tables (see the left side above) following the same substrate-etl xcmtransfers schema (see the right side above). Each table is day-partitioned by origination_ts (filtering on this makes table scans faster and cheaper). The Polkaholic.io XCM Transfers API has an “XCMInfo” data structure organizing what happens on the “origination” and “destination” which is in an xcm_info field. For maximum BigQuery queriability, we flattened the XCM Info data structure (origination, destination) into a couple of dozen fields starting with origination_ and destination_ (that is, in snake_case). You can sum, avg, min, max most of the floats and filter / group by all of these fields with familiar SQL operations in BigQuery. We will show a bunch of queries on this dataset and ask some basic questions about 2022.
This query shows a quick summary of what has been sent in Polkadot, which enabled XCM Transfers in May 2022:
SELECT EXTRACT(MONTH From DATE(origination_ts)) as month, count(*) num_records,
round(sum(origination_amount_sent_usd), 2) as total_origination_amount_sent_usd,
round(sum(destination_amount_received_usd), 2) as total_destination_amount_received_usd,
round(avg(origination_amount_sent_usd), 2) as avg_origination_amount_sent_usd,
round(avg(destination_amount_received_usd), 2) as avg_destination_amount_received_usd
FROM `substrate-etl.polkadot.xcmtransfers`
WHERE DATE(origination_ts) >= "2022–01–01" and DATE(origination_ts) <= "2022–12–31"
and destination_execution_status = "success"
group by month
order by month;
Note the filter on destination_execution_status = “success” — To make XCM transfer historical analytics easier, unlike the API, we did not include any records where there was a failure to send an XCM Message from the origination chain or the destination_execution_status was known to be a failure. However, we are including “unknown” cases, where our XCM indexing process could not match the sending xcm transfer to some destination event or balance change.
A lot of different kinds of analytics questions can be asked and answered with queries against substrate-etl.{polkadot,kusama}.xcmtransfers like the above. Here are a few of them, along with some notes:
- How many distinct XCM Transfer senders and beneficiaries were there in 2022 in Polkadot vs Kusama?
Senders: 48,478 in Polkadot and 32,533 in Kusama
Query: (for polkadot; for kusama, replace “polkadot” with “kusama”)
SELECT distinct origination_sender_pub_key
FROM `substrate-etl.polkadot.xcmtransfers`
WHERE DATE(origination_ts) >= "2022-01-01" and DATE(origination_ts) <= "2022-12-31"
and destination_execution_status = "success";
Beneficiaries: 38,188 in Polkadot and 21,937 in Kusama
Query: (for polkadot; for kusama, replace “polkadot” with “kusama”)
SELECT distinct destination_beneficiary_pub_key
FROM `substrate-etl.polkadot.xcmtransfers`
WHERE DATE(origination_ts) >= "2022-01-01" and DATE(origination_ts) <= "2022-12-31"
and destination_execution_status = "success";
Note: we include both origination_sender_pub_key and origination_sender_ss58 (similarly, destination_beneficiary_pub_key and destination_beneficiary_ss58) to support multichain and single chain queries. So, even if there is one account sending with different origination_sender_ss58 addresses across multiple parachains, all these different ss58 addresses have one underlying origination_sender_pub_key — so, to dedup, we use the pub_key version.
2. How many distinct assets were transferred in Polkadot and Kusama and how much was transferred?
The Polkaholic.io indexer uses the XCM Global Asset Registry to map local asset representations into Multilocation, which can be mapped to human-readable symbol that are easy to filter and group on across the entire ecosystem. Even though you can count on evil actors to register imposter symbols in a specific parachain, you can count on parachain engineers (better: parachain goverance) to never register those assets in the parachain registry. So at present, assets have a clean 1:1 mapping with symbol.
Answer: Polkadot had 16 assets; Kusama had 33 assets
Query: (for polkadot; for kusama, replace “polkadot” with “kusama”)
SELECT symbol,
round(sum(origination_amount_sent), 2) as total_origination_amount_sent,
round(sum(destination_amount_received), 2) as total_destination_amount_received,
round(sum(origination_amount_sent_usd), 2) as total_origination_amount_sent_usd,
round(sum(destination_amount_received_usd), 2) as total_destination_amount_received_usd,
round(avg(origination_amount_sent_usd), 2) as avg_origination_amount_sent_usd,
round(avg(destination_amount_received_usd), 2) as avg_destination_amount_received_usd,
count(*) num_records
FROM `substrate-etl.polkadot.xcmtransfers`
WHERE DATE(origination_ts) >= "2022-01-01" and DATE(origination_ts) <= "2022-12-31"
and destination_execution_status = "success"
group by symbol
order by total_origination_amount_sent_usd desc, num_records desc;
You can see that some assets don’t have USD asset values. We try to model this with a combination of Coingecko and Polkadot’s very own defi chains. But to do this well is quite challenging —very often liquidity pools are illiquid, stablecoin (AUSD) got depegged due to error mints, and coingecko doesn’t have a value for every parachain token (e.g TUR, TNKR, etc.). And in a few cases, we have some work to do.
3. How many distinct parachains are sending XCM Transfers in Polkadot vs Kusama?
Answer: 11 on Polkadot and 28 on Kusama
Query: (for polkadot; for kusama, replace “polkadot” with “kusama”)
SELECT origination_para_id, origination_chain_name,
round(sum(origination_amount_sent), 2) as total_origination_amount_sent,
round(sum(destination_amount_received), 2) as total_destination_amount_received,
round(sum(origination_amount_sent_usd), 2) as total_origination_amount_sent_usd,
round(sum(destination_amount_received_usd), 2) as total_destination_amount_received_usd,
round(avg(origination_amount_sent_usd), 2) as avg_origination_amount_sent_usd,
round(avg(destination_amount_received_usd), 2) as avg_destination_amount_received_usd,
count(*) num_records
FROM `substrate-etl.polkadot.xcmtransfers`
WHERE DATE(origination_ts) >= "2022-01-01" and DATE(origination_ts) <= "2022-12-31"
and destination_execution_status = "success"
group by origination_para_id, origination_chain_name
order by total_origination_amount_sent_usd desc, num_records desc;
To make analytics easy for everyone, we model the relay chain with a origination_para_id (and destination_para_id) of 0. We include human readable chain names in origination_chain_name (and destination_chain_name) and origination_id (and destination_id) so no one has to lookup parachain IDs unnecessarily.
4. How many distinct pallet sections/methods were used for sending XCM Transfers in Polkadot vs Kusama?
Answer: 20 on Polkadot, 25 in Kusama
Query: (for polkadot; for kusama, replace “polkadot” with “kusama”)
SELECT origination_section, origination_method, count(*) num_records
FROM `substrate-etl.polkadot.xcmtransfers`
WHERE DATE(origination_ts) >= "2022-01-01" and DATE(origination_ts) <= "2022-12-31"
and destination_execution_status = "success"
group by origination_section, origination_method
order by count(*) desc
5. How many XCM Transfers happened in 2022?
This question is ill-posed because Substrate extrinsics+XCM has immense flexibility in what can be done in a single extrinsic call —eg utility:batchAll can issue multiple XCM transfers (each sending XCM Messages) while xTokens:transferMultiasset can issue multiple assets — and an XCM message can contain an XCM within it — remember, at the end of the day receiving chains are actually processing XCM instructions in a XCVM, not special purpose instructions like “process xTokens:transferMultiasset”. Take this defining example:
If you have a single extrinsic that results in 2 messages where the first message M1 results in sending 2 assets (say, KSM+RMRK) and the second message M2 sends 3 assets (say, KSM+USDT+SDN), how many XCM transfers do you have — 1 (the single extrinsic), 2 (2 messages M1+M2), 4 (KSM+RMRK+USDT+SDN) or 5?
Because a single extrinsic can:
- construct one or more XCM messages;
- have each XCM Message contain one or more assets
we index XCM Transfers in this way:
- having origination_xcm_index vary across each XCM Message initiated by a single extrinsic — so for the above example this is 0 for M1 and 1 for M2.
- having origination_transfer_index vary across each asset within each message — so for the above example this has origination_transfer_index varying (KSM+RMRK) for origination_xcm_index=0 and 0+1+2 for for 5 assets
So for our single extrinsic, there are actually 5 records in the table. Having indexed XCM message sending extrinsics in the above, we can then run a query like this:
SELECT symbol, origination_xcm_index, origination_transfer_index, count(*)
FROM `substrate-etl.kusama.xcmtransfers`
WHERE DATE(origination_ts) >= "2022-01-01" and DATE(origination_ts) <= "2022-12-31"
and destination_execution_status = "success"
and origination_xcm_index > 0 or origination_transfer_index > 0
group by symbol, origination_xcm_index , origination_transfer_index
order by count(*) desc;
5 (modified). How many distinct extrinsicIDs XCM Transfers successfully completed in 2022 across Polkadot and Kusama?
Answer: 157,305 in Polkadot and 140,609 in Kusama
SELECT distinct origination_extrinsic_id
FROM `substrate-etl.polkadot.xcmtransfers`
WHERE DATE(origination_ts) >= "2022-01-01" and DATE(origination_ts) <= "2022-12-31"
and destination_execution_status = "success" ;
Note again that we included “successfully completed” in our modified question as we only have included successfully completed — there . Still, where are many caveats to this result — unlike blockchains, indexers of blockchains get different results based on what they have modeled:
- Polkaholic can’t model what there is no public archive RPC Endpoint for. Our starting point is the same list as polkadot.js/apps, which we also use for XCM Global Asset Registry.
- Polkaholic only indexes what it has an xcm asset model for, which is represented in the XCM Global Asset Registry output. Many parachains have an incomplete asset registry, or incomplete xcm asset registry, and as a side-effect of this XCM Global Asset Registry it is easy to see what is incomplete.
- Polkaholic only knows XCM transfers initiated by structured extrinsics right now — but it is possible to roll your own XCM message and use “polkadotXCM.send” — while we index XCM Messages (and include this in origination_xcm_msg_hash) we did not fully model those yet. While XCM has a seemingly simple instruction set, it is non-trivial to model whether XCM transfers are happening in a generic XCM Message, as opposed to paying for remote execution. It might be too hard to get 100%, but heuristics may be devised to get most of these in practice.
How often are you updating this dataset? What else can be done with this dataset?
We will be updating this dataset with new data every 4–6 hours starting today. (For more real-time, use Polkaholic.io’s XCM Transfers API). If there are updates to XCM Global Asset Registry, new RPC Endpoint, or some data issue reported by the community, we may reprocess the past in bulk, causing destination_execution_status=“unknown” records to become “success”. [As of this moment, we have around 5K “unknown” Polkadot + Kusama records, which we are auditing.] When such an update occurs, instead of querying API hundreds of times, you can just run the same queries very quickly. To make it clear when we have updated xcm_info data (which powers most of the other columns), we have included xcm_info_last_update_time. This should make reasoning about bulk updates easier.
Our examples above were looking across the Polkadot and Kusama ecosystem. There are many many different questions you can ask about your favorite parachain (filtering origination_para_id or destination_para_id), over different time periods, for specific top users, for complex extrinsics, and made even more interesting when looking at complex multichain analysis behavior when combined with other extrinsics. We are finalizing the other public datasets, which you can also see substrate-etl tables and run queries against. XCM transfers are often coupled with defi operations, and doing multichain analysis in BigQuery is quite easy with wildcard operations, but we’ll save that for another post.
Ok, what does the future hold for XCM analysis?
With XCM v3 merged in by Gavin Wood yesterday (we have literally been asking about this since we started building polkaholic.io over the last 15 months!), we can cheerfully say that XCM analysis in 2023 will get better and probably even more exciting as things go well beyond XCM Transfers and outside the Polkadot ecosystem.
We hope XCMv3 will have “context-tracking” to make the matching process super reliable and expect that the GlobalConsensus / network parameters will result in XCM going outside Polkadot / Kusama. We can be sure NFTs will be covered very soon. But it is extremely unlikely that XCM is only about about moving tokens around. We are excited at modeling remote execution well and doing real-time analysis reported through web sockets and with distributed tracing (efforts which we started in 2022 and reduced to practice), and are really looking forward to covering XCM activity outside the Polkadot + Kusama ecosystem. All of this will make for a beautiful interoperable multichain future and are proud to be part of this ecosystem.
In the meantime, we would like your detailed feedback on what is missing in this substrate-etl xcmtransfers dataset:
- do you have assets / extrinsics that are not modelled, or not well modelled in this xcmtransfers dataset?
- do you have an analytics set of questions that is not covered, or not well modelled by this xcmtransfers dataset or substrate-etl more generally?
You can get in touch with us in our Matrix room, find me on Telegram (@sourabhniyogi), or file an issue on Github. We are “finalizing” our preview of the full substrate-etl data where we have all raw data (blocks, extrinsics, events, transfers) generated for every parachain indexed by Polkaholic.io, which we will post about next.
Important: This work is part of Colorful Notion’s Polkaholic.io substrate-etl Treasury Proposal, to be submitted in the coming month.
Acknowledgments: Many thanks to Karim, Pavla and Marek of Parity Data for supporting and encouraging a general substrate-etl effort, Mattias + sub0 workshop participants who encouraged a XCM Global Asset Registry, An of Subwallet and Nick of Messari for probing us with great questions about XCM activity of 2022.