Sharadar S&P 500 reindexing

DaveA · July 29, 2020, 2:41pm

Hi - I have upgraded to the new version 2.0 and can access sharadar data with no problems. I can run get_prices and get data back, same with fundamentals. These work fine:

prices = get_prices('usstock-1d', start_date='2020-06-01', fields='Close')
closes = prices.loc['Close']

I'm following along with the docs/guide and things fall over when i do the

are_in_sp500 = get_sharadar_sp500_reindexed_like(closes)

it just sits at busy forever, more than a half hour so far, have tried a couple of times after restarting docker. it still seems like it is doing something as the terminal 2 is still pumping things out - but it never completes or gets back to idle. I'd like to be able to have that reindex work and then create a universe of S&P 500 names or a filter from which to try some of the fundamental data on. I had this working in the previous version, excluding the sharadar sp500 data which is new. flightlog is spitting this out on repeat:

    quantrocket_blotter_1|recycling spooler after 90 tasks
    quantrocket_blotter_1|OOOPS the spooler is no more...trying respawn...
    quantrocket_blotter_1|spawned the uWSGI spooler on dir /var/tmp/uwsgi/spool with pid 65
    quantrocket_houston_1|172.18.0.22 - - [29/Jul/2020:14:39:07 +0000] "GET /master/securities.csv?vendors=ibkr&fields=ibkr_ConId HTTP/1.1" 400 137 "-" "-"
    quantrocket_houston_1|172.18.0.22 - - [29/Jul/2020:14:39:07 +0000] "GET /master/securities.csv?vendors=alpaca&fields=alpaca_AssetId HTTP/1.1" 400 139 "-" "-"
    quantrocket_houston_1|172.18.0.22 - - [29/Jul/2020:14:39:07 +0000] "GET /account/balances.csv?latest=True&fields=Account&fields=Broker&fields=Paper HTTP/1.1" 400 77 "-" "-"
    quantrocket_houston_1|172.18.0.3 - - [29/Jul/2020:14:39:07 +0000] "GET /ibg1/gateway HTTP/1.1" 200 22 "-" "python-requests/2.22.0"
    quantrocket_houston_1|172.18.0.22 - - [29/Jul/2020:14:39:07 +0000] "GET /ibgrouter/gateways?status=running HTTP/1.1" 200 3 "-" "-"
    quantrocket_houston_1|172.18.0.3 - - [29/Jul/2020:14:40:07 +0000] "GET /ibg1/gateway HTTP/1.1" 200 22 "-" "python-requests/2.22.0"
    quantrocket_houston_1|172.18.0.22 - - [29/Jul/2020:14:40:07 +0000] "GET /ibgrouter/gateways?status=running HTTP/1.1" 200 3 "-" "-"

any help well received.
D

Brian · July 29, 2020, 3:45pm

There's a performance issue with that function which will be patched in the next release. You can go ahead and use the patched version now. First, in a JupyterLab terminal, download the patched version to your codeload directory:

curl 'https://raw.githubusercontent.com/quantrocket-llc/quantrocket-client/dd68c484361ca56540d8541b5960e2d9b94c3380/quantrocket/fundamental.py' -o /codeload/sp500_patch.py

Then, instead of importing like this:

from quantrocket.fundamental import get_sharadar_sp500_reindexed_like

Import the patched version:

from codeload.sp500_patch import get_sharadar_sp500_reindexed_like

And use it as normal. The next time you update the software, you can go back to using the normal import.

DaveA · July 29, 2020, 4:04pm

Thanks - that was driving me nuts, thought i had screwed something up. Much appreciated.

DaveA · July 29, 2020, 6:29pm

Hi - I have got that to work fine but now running into some other issues - I'm doing the reindexing to get a table of prices for sids in the S&P 500 on each day. I'm doing that by downloading the sp500_changes file and then creating a list of all sids ever to have been in it.

download_sharadar_sp500(filepath_or_buffer='sp500_changes.csv', start_date='1900-10-01')
download_sharadar_sp500(filepath_or_buffer="sp500_changes.csv")
sp500_changes = pd.read_csv("sp500_changes.csv", parse_dates=["DATE"])
latest_changes = sp500_changes.drop_duplicates(subset="Sid", keep="last")
current_members = latest_changes[latest_changes.ACTION == "added"]
current_members_sid = current_members['Sid'].to_list()
ever_in_sp500 = sp500_changes['Sid'].to_list()
print(len(current_members))
print(len(current_members_sid))
print(len(ever_in_sp500))

prices = get_prices('usstock-1d', universes='everinsp500', start_date='2000-01-01', fields='Close')
closes = prices.loc['Close']
closes.to_csv('closes.csv')

are_in_sp500 = get_sharadar_sp500_reindexed_like(closes)
daily_SP500_members = closes.where(are_in_sp500)
daily_SP500_members.to_csv('daily_SP500_members.csv')
are_in_sp500.to_csv('boolspx.csv')

i get some problems where i'm not getting enough names coming out - the s&p part is fine, it looks like it goes wrong where there are some names where the 'usstock-1d' set of prices does not have prices for some sids for the whole period for some reason. the sharadar one does - see below - but i would rather use the usstock-1d one as that is total retun with splits/dividends while sharadar isn't.

any idea why the usstock-1d isn't complete as i see it - or am i doing something pretty dumb along the way here? two lines for me show different start dates for amt prices - which is the cause (I think)

amtprices = get_prices('sharadar-us-stk-1d', sids='FIBBG000B9XYV2', start_date='2000-01-01', fields='Close')
amtprices.head(4)

amtprices = get_prices('usstock-1d', sids='FIBBG000B9XYV2', start_date='2000-01-01', fields='Close')
amtprices.head(4)

thank you for any help,
D

Brian · July 30, 2020, 1:33pm

The usstock dataset only goes back to 2007 while the Sharadar prices go back to the 1990s. Each dataset has its own start date, which are documented in the Data Library.

DaveA · July 30, 2020, 2:04pm

Thanks - that part I understand - its that for some names - AMT for example, the data isnt there going back to 2007, it has some later starts dates/missing lengthy periods, which means when doing the reindxing on close prices to map to full SPX, it falls over. AMT data below from usstock-1d as i see it? First entries are May 2001 for a few days, then a big hole. Stock been listed since year 2000...in the sharadar dataset, it is more complete, 2000 start date. So reindexing using usstock 1d only returns something like 460 names in 2011.

Field	Date	FIBBG000B9XYV2
Close	5/6/2011	46.444
Close	5/9/2011	47.0288
Close	5/10/2011	47.5783
Close	5/11/2011	47.4276
Close	5/12/2011	47.2415
Close	5/13/2011	46.9136
Close	5/16/2011	46.5237
Close	5/17/2011	46.2579
Close	1/3/2012	52.1154
Close	1/4/2012	52.8687
Close	1/5/2012	53.4801
Close	1/6/2012	53.3915

amtprices = get_prices('usstock-1d', sids='FIBBG000B9XYV2', start_date='2000-01-01', fields='Close')
amtprices.head(4)
amtprices.to_csv('amt.csv')

DaveA · July 30, 2020, 2:44pm

*May 2011 in usstock-1d, typo above

Brian · July 30, 2020, 3:25pm

Thank you for reporting this. Currently investigating...

Brian · July 31, 2020, 3:18pm

After digging into this, what’s going on is that, as a result of certain types of corporate restructuring events, the usstock and Sharadar datasets in some cases map data to different sids, causing usstock data not to line up perfectly with Sharadar data (even though both datasets have the data). This chart shows how well the two align over time.

They line up perfectly in recent history but diverge by up to 5% as you go back to 2007. This is because the further back you go, the more of these corporate events that cause the divergences have occurred. In these restructurings the security typically receives a new ISIN and FIGI, and this is reflected in usstock. Examples:

AMT converted to a REIT in 2011. Usstock has the pre- and post-event data under two different sids, whereas Sharadar has linked the pre- and post-event data together.
AON moved its domicile from US to UK in 2012, resulting in new identifiers. Again, usstock has this under two different sids, Sharadar under one.
For YHOO, usstock maps the data to the identifier for Altaba (AABA), the Yahoo spinoff after Verizon acquired part of Yahoo in 2017; Sharadar maps to the identifer for the pre-acquisition Yahoo.

In these examples I’d give the leg up to how Sharadar has done it, at least in the first two examples. For back office purposes the pre- and post-event entities are truly different, which is why new identifiers are issued, but for backtesting purposes linking the two entities together is preferable.

I think usstock’s handling of these situations can be improved and brought more in line with Sharadar’s handling. Still, almost by definition Sharadar data will probably always align more perfectly with Sharadar data.

If your only reason for using usstock is that it’s total return, have you considered using the Sharadar price data but applying the dividend adjustments yourself in pandas? This snippet shows how and could be put in a helper function. Would that solution work for you?

DaveA · July 31, 2020, 4:52pm

Thank you for such a thorough answer - equity data sure is messy.

I will implement the Sharadar dividend adjustments and use that. Thank you again. Keeping things within one data provider seems cleaner generally. Thanks.

Brian · September 3, 2020, 6:31pm

Version 2.2.0 is now available and includes the performance improvement to get_sharadar_sp500_reindexed_like. The release notes also discuss data corrections that have been made to the US Stock dataset per the discussions in this ticket (i.e. linking together pre- and post-corporate restructuring price series for various securities like AMT and AON).

Brian · September 17, 2020, 10:00pm

This topic was automatically closed after 14 days. New replies are no longer allowed.