Introduction

In this project, I developed a financial data processing and visualization platform using Apache Kafka, Apache Cassandra, and Bokeh. I used Kafka for realtime stock price and market news streaming, Cassandra for historical and realtime stock data warehousing, and Bokeh for visualization on web browsers. I also wrote a web crawler to scrape companys' financial statements and basic information from Yahoo Finance, and played with various economy data APIs.

Please check the GitHub repo of this project, and most importantly, please check this platform's website and play with each plot \(\rightarrow\) magiconch.me

Architecture

There are currently 3 tabs in the webpage:

Stock: Streaming & Fundamental
- Single stock's candlestick, basic company & financial information;
- Realtime S&P500 price during trading hours (fake date during non-trading hours)
Stock: Comparison
- 2 user-selected stocks' price, and their statstical summay & correlation
- 5,10,30-day moving average of adjusted close price
Economy
- Geomap of various economy data by state
- Economy indicators's plot
- Latest market news

Here is the architecture of the platform.

Please check each tab's screenshot:

Tab 1:

Tab 2:

Tab 3:

1 Stock: Streaming & Fundamental

1.1 Data Source

Alpha Vantage : provide free APIs for realtime and historical data on stocks.
- Problems: rate limiting of 5 calls per minute, 500 calls per day for free account .
Yahoo Finance: write a web crawler to get company's summary profile and fundamental information such as financial statements from Yahoo Finance.

1.2 ETL

1. Historical data

In the first tab, both historical data and streaming data are presented. Before each trading day, I got every stocks' historical data from Alpha Vantage to update the newest daily price and volume.

The raw data is in string type, the transformation of the data type needs to be performed. I also standardized 'time' column to "%Y-%m-%d %H:%M:%S" because that's the datetime format Cassandra could recognize.

Each stock has three tables that stored its daily, 1-minute, and 18-second frequency's OHLCV in Apache Cassandra. Cassandra has its own query language Cassandra Query Language (CQL) which is quite similar to SQL.

CREATE TABLE IF NOT EXISTS SYMBOL_historical ( 
       TIME timestamp,           
       SYMBOL text,              
       OPEN float,               
       HIGH float,               
       LOW float,                
       CLOSE float,              
       ADJUSTED_CLOSE float,     
       VOLUME float,             
       dividend_amount float,    
       split_coefficient float,  
       PRIMARY KEY (SYMBOL,TIME));

2. Realtime data

I used Apache Kafka to handle streaming data, the stream plot would combine both historical as well as the streaming data for visualization. The topic for this streaming task is 'stock_streaming1'.

This topic has two producers, which recevied fake data generated by random number during non-trading hours and realtime S&P500 price and volume from Alpha Vantage during trading hours. It also has two consumers, which send data to Cassandra database and to a local json file.

NOTE:

Since Alpha Vantage could only be called 5 times per minute, so I could either get one minute frequency data by calling TIME_SERIES_INTRADAY function provided by Alpha Vantage every minute , or to get data faster, call GLOBAL_QUOTE function every 18 seconds to get the latest price and volume information, while avoiding rate limiting error.

One other thing to note is timezone. I am in Pacific Time. To align with Eastern Time trading hours, I need to record the time I called GLOBAL_QUOTE, adjust to Eastern Time, and store accordingly to the database.

Interestingly, Cassandra stores datetime type data in UTC timezone. When querying data from Cassandra, we need to double check whether the time changed. Otherwise, the plot would not seamlessly combine historical data and streaming data.

3. Fundatmental data

I wrote a script to crawl the financial statements and summary profile given the company's symbol, and loaded each company's fundamental data to a json file.

2 Stock: Comparison

The second tab only uses historical daily frequency data, extracted from Cassandra database. It aims to compare two stocks. So I plot 2 selected stocks' adjusted close and volume, along with 3 technical indicators: 5MA, 10MA, 30MA; as well as the statistical summay and correlation of these 2 stocks.

You can also click on the legend entries to hide technical indicators.

3 Economy

3.1 Data Source

The U.S. Bureau of Economic Analysis: provide API to access to BEA published economic statistics.
The Federal Reserve Bank of St. Louis: provide API that has been integrated into pandas_datareader
NEWS API: provide breaking news headlines by categories or keywords

3.2 ETL

1. Economy data

Firstly I went through websites of BEA and FRED, and identified the key economy data to present, such as real GDP, CPI, unemployment rate, and etc. BEA provides economy data by state, FRED provides nationwide economy indicators. Then I decided to combine geomap using data from BEA to present the difference across the states, and time series plots using data from FRED to present the change of key economy indicators. The economy data are all stored in json files.

To plot geomap, I downloaded geographical boundaries of U.S. states here, and fed into Bokeh.

2. Business news

NEWS API is a really interesting and useful tool to get up-to-date news headlines given categories or keywords.

Here I used Kafka to stream news data using topic NEWS and stored in Cassandra.

CREATE TABLE IF NOT EXISTS NEWS ( 
       DATE date, 
       publishedAt timestamp,           
       TITLE text,              
       SOURCE text,               
       description text,               
       url text, PRIMARY KEY (DATE,publishedAt)) 
       WITH CLUSTERING ORDER BY (publishedAt ASC);")