Retrieving historical data from Alpaca/Polygon.io

the · December 15, 2020, 6:29pm

Hello!

I'm new to QuantRocket, and looking to retrieve historical data. So far it seems that only IB is supported to do so. Is there a way to pull historical data using either Alpaca or Polygon with the framework? If that's not the case as of yet, is there a way to sideload custom data?

Thanks a lot!

Brian · December 16, 2020, 1:48pm

Check out the Data Library for all available datasets. You are probably looking for the US Stock dataset.

the · December 17, 2020, 12:48am

@Brian That doesnt seem to be what we want. We want to use our alpaca or polygon key to use the datasets provided by alpaca and polygon, whom provide historic data, much in the same way IB can be used for this purpose. The US Stock dataset does not have the level of detail we can get off of polygon.

Brian · December 18, 2020, 2:38pm

What additional data/level of detail were you looking for, specifically?

the · December 19, 2020, 6:40am

1 second resolution for a decade of back history on different and varied stocks.

the · December 20, 2020, 2:55am

Shall I take the lack of response to mean quantrocket is not capable of fulfilling our needs? Vendor lock-in on data, and more specifically lack of alpaca or polygon support will sadly require me to look for a different service and cancel my subscription if that is the case.

sub · December 21, 2020, 12:32pm

If you find a vendor that will support arbitrary third-party historical data ingestion at 1s resolution for a decade at anything close to quantrocket functionality and pricing, let us know.

In other words, without wishing to speak for Brian here, I don't think you quite appreciate how big your request is. For example, my docker directory with just 1 minute us-stock data ingested is 190G. 1s resolution will be ~10T. Managing and processing that much data is a different ballgame.

Brian · December 21, 2020, 3:21pm

For most traders, minute data seems to provide the best balance of granularity, data manageability, and ability to execute trades. As sub mentions, storage size is a challenge with second data (although the US stock minute bundle is 50-60GB not 190GB). And more data can mean slower query speeds or computation speeds.

In addition, QuantRocket’s backtesters pair better with minute data. Zipline is hard-wired for minute data, and Moonshot uses cron for scheduling, which has minute granularity.

Second-level data is also harder to execute on. Many stocks are illiquid at that granularity. And there’s less margin for error when it comes to timely order execution with your broker.

A realistic use case for more granular data might be receiving real-time data over a websocket for a small number of symbols in a custom script and sending orders to the blotter. That could be done today. For backtesting, since QuantRocket saves real-time data to a database, you could run the real-time data collection for a few weeks or months and use that for testing and analysis. That won’t give you deep backtesting, but usually, the higher frequency your trading strategy, the less useful backtesting is anyway and the more you have to try it in live execution to see if it works.

sub · December 21, 2020, 4:38pm

Hmm. Why is my docker directory using 190G of disk? I've only downloaded US stock and securities listing data from IB and figi. The latter is all IB securities, but it's a table with only ~90,000 rows so negligible compared to the minute data.

the · December 21, 2020, 8:14pm

We do not need, nor want, to store the entire 10 years at once, though we do need to see windows within that 10 year period to randomize our test sets for backtesting.

The backtesting itself also has no need to run at second resolutions, nor do we have any need for trades to execute at second resolutions. We trade at as little as 15 minute intervals on some stocks and second resolution allows us to execute our order up to 1 minute sooner than we otherwise could, and that gives us a financial advantage, but that doesn't mean we are constantly buying and selling with 1 second lag time.

Realtime data doesn't fill that need either, as we do backtest at 1 second resolution, but only over short windows, backtester works perfectly fine for that purpose.

The assumption here, which seems to be off, is that our trading strategy itself buys and sells at 1 second resolution, when in fact we just use 1 second resolution to respond more quickly to developing pivot points.

sub · December 21, 2020, 10:12pm

The assumption is not that you are trading at 1s resolution. It's that you need 1s data. Calculate how much space that will take (or ask an existing vendor how big their 1s file is). The vendor can't offer 1s resolution without offering meaningful history across a meaningful universe, even if you personally only use a small portion of it. 10 years of OHLCV data at 1s resolution on a 5,000 instrument universe is over 1 trillion data values.

the · December 21, 2020, 10:54pm

1s resolution for 10 years stored in a fixed-width CSV file (no delimeter) with just the closing price (which is all I use) is about 2 gigs. Thats 315 million entries (number of seconds in a decade) * 8 bytes per entry (floating value). So thats 2.5 billion bytes total worth of data. Thats 2.52 GB needed to store 1 second data for 10 years.

Keep in mind I've already been storing and working with 1 second data before moving to QR, so I'm well aware of how trivial it is to handle. Even QR supports it through Interactive Brokers it is just that IB, unlike polygon, is brutally slow.

Bear in mind i neither want nor need a 5000 instrument complete universe at 1S intervals. I do need the ability to pull down 1s interval data as-needed, usually only on a handful of stocks at any time.

sub · December 21, 2020, 11:23pm

You don't need second data across 5,000 instruments. But you want QR to provide you second data for the instruments of your choosing, which means QR has to have all instruments available, or enable you to ingest arbitrary 3rd party second data. And then you want backtesting capabilities on second data, and presumably history management. It's already too easy to blow out memory with minute data. It will be 60x easier with second data, hence will likely require some kind of out of core framework - i.e. forget pandas. Then, if second data is available, people will start trying to use it to analyze large universes, so even if you don't need that, it will require support. It's a big jump to get there from here.

But you said it was trivial, so just write your own framework. Zipline is opensource. Extend it to handle second data.

the · December 21, 2020, 11:32pm

What I want, is the ability to side load data of my choosing from whatever source I want (as I can do from IB) without having vendor lock-in on QR... yes memory constraints could be an issue, but thats on me, there is a reason i have massive memory on my systems.

It appears to me QR wants the best of both worlds, to lock me into their data and lock me out of third-party data source, yet not be obligated to provide the high quality data of external sources.. thats why its a deal breaker to me, not allowing arbitrary third party data is just not going to work.

And yes developing our own is likely how that will play out, we were doing that before trying QR and will likely have to go that route now that we learned about the vendor lockin.

sub · December 21, 2020, 11:45pm

"Vendor lockin" is a mischaracterization. QR does a lot more than provide you a data file. It also curates the data, updates it daily, keeps track of security master integration, makes the data available for backtesting, etc. I suspect that's why they don't yet offer the ability to ingest arbitrary third party data, especially at minute resolutions. Again, that's a lot of functionality to support. I also hope 3rd party ingestion will be supported at some point in the future, but even if it is, I'd be surprised if it came with second resolution support.

Having QR automatically manage all the data and broker interaction for me is already worth the license in my opinion. And they're completely upfront about what data support comes along with the license.

the · December 21, 2020, 11:49pm

You seem to be forgetting that QR already supports 1 second data from IB, its brutally slow to fetch because IB severely limits how fast you can fetch as compared to polygon, but clearly QR already supports 1 second resolutions, its doing it with IB afterall.

In fact all that would be needed would be to 1) allow polygon to populate the master database, 2) allow fetching from polygon in addition to IB and int he same manner IB allows it. Outside of that, considering 1s resolution is already supported by the underlying tech and proven with IB, it should be perfectly feasable.

My guess is that the reason we dont see third-party ingestion of data is because while it would be trivial to implement normally it is more difficult to implement in a way that preserves QR vendor lock in (that is allowing it while still being able to prevent free users from using it). Its hard to see any reason other than vendor lockin that it wouldnt be supported when 1s resolution is already supported from IB.

sub · December 22, 2020, 12:04am

IB only provides the last 6 months for any bar size <= 30s. So yes, current infrastructure supports it, for very short backtesting.

I see I misunderstood what you meant by vendor lockin. You mean not giving the product away for free. In which case, I am all for vendor lockin. QR wouldn't exist if it was free. Note that Quantopian no longer exists. It was free.

the · December 22, 2020, 12:10am

No vendor lockin does not mean "not giving away the product for free"... in this case vendor lockin appears to be done with the intent of not giving it away for free, but that is not what vendor lockin means.

Vendor lockin here means they require you to use their data and not ingest third-party data. It is vendor lockin regardless of if the product is given away for free or not. But it appears that to provide third-party ingestion the creator would have to open up the ability for someone to illegally hack the system and be able to use it for free, this is secondary to the vendor lockin, not analogous to it.

sub · December 22, 2020, 12:20am

I don't understand your argument. The free version can expose (or not) any functionality QR chooses to. Everything is hackable, with enough effort, since we get the code locally (one of the key reasons for using QR, btw. If you really have alpha, you're crazy uploading raw strategy code to third parties in my opinion). But it's a lot easier just to pay the license fee. If your trading doesn't cover the cost of a QR license, you probably shouldn't be trading

Anyway, I think we've pretty much exhausted the subject. Good luck with your trading!

the · December 22, 2020, 12:25am

I am only trying to understanding the reasoning here, without the owner/developer stating so I can only speculate. But it does appear that the effort to obfuscate is focused around the master DB, and thus the reason third-party ingestion is blocked is because it would have to expose some of what has been obfuscated.

Regardless of anything else what matters nad the crux of the problem is simple.. there is no third party ingestion, or native support for historic polygon data. Whatever the reason or anything else we discuss for me personally those are most haves to continue using the service, they arent available so I will have to stop using the service and seek other solutions.

This is not meant as an attack on the author or you or animosity in any way. Its a simple statement of needs that arent satisfied, but would be and are satisfied by open source platforms (though we would have to reinvent some of what QR does to get there with our own system, and thats fine)... so it is what it is. If the author wishes to provide third party ingestion or polygon historic data in the near future id be happy to stick around, if not, best of luck to him and hope him all the success in the world.