Yr12 Journal 15

In Term 3 Week 2, I was working on my project.
Although it says this blog post has a word count more than 1000 words, it includes the code count as well, so the actual blog post is less than 1000 words

Progress

I have to apologise to myself. I underestimated myself. Although I planeed to spend 3 weeks on data crawling, model training and model implementation, I finished all of these on Saturday afternoon with a cup of coffee. The progress was more successful than I though it would be. Therefore I feel sorry to myself such that I would plan to spend that long time on such simple things.

First of all, let’s discuss what have I done on Saturday

Data Crawling

Of course this Data Science needs data to work, so do my project. My project requires a heaps of accurate market data, and fortunately Binance Crypto Market has API that provides all I need, it provides:

  • K-Line Data in a specefic range of time with the specefic interval

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    [
    [
    1499040000000, // Open time
    "0.01634790", // Open
    "0.80000000", // High
    "0.01575800", // Low
    "0.01577100", // Close
    "148976.11427815", // Volume
    1499644799999, // Close time
    "2434.19055334", // Quote asset volume
    308, // Number of trades
    "1756.87402397", // Taker buy base asset volume
    "28.46694368", // Taker buy quote asset volume
    "17928899.62484339" // Ignore.
    ]
    ]

    Basically I can use this to get all price/number of trades/volume/time in a given range, I can use this data to train a model. Alternatively, I can use this API to get recent data and input this to the model and get a output price

  • Current Avg Price of a specefic coin one other (i.e. ETH to BTC, or even BTC to AUD)

    1
    2
    3
    4
    {
    "mins": 5,
    "price": "9.35751834"
    }

    Basically I can get the price of the coin at current time, and apply it to a trained model to predict the price in the future

As we can see, I have got lots of useful, accurate, and complete data, at any time period.

Now I need to combine those data in a period of time to a DataFrame so we can use it to train a model.

After overtake a few Challenges, here is the code for data crawling process:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def get_history_df(coin, year, month):
ck = []
bk = []
for i in range(1, 28):
# 10hr data
s = u.get_certain_time(year, month, i, 0, 0)
e = u.get_certain_time(year, month, i, 10, 0)
# coin kline
coin_k = get_k_for_coin(coin, s, e)
# btc kline
btc_k = get_k_for_BTC_baseline(s, e)
ck += coin_k
bk += btc_k
# 10hr data
s = u.get_certain_time(year, month, i, 10, 0)
e = u.get_certain_time(year, month, i, 20, 0)
# coin kline
coin_k = get_k_for_coin(coin, s, e)
# btc kline
btc_k = get_k_for_BTC_baseline(s, e)
ck += coin_k
bk += btc_k
# 4hr data
s = u.get_certain_time(year, month, i, 21, 0)
e = u.get_certain_time(year, month, i, 23, 59)
# coin kline
coin_k = get_k_for_coin(coin, s, e)
# btc kline
btc_k = get_k_for_BTC_baseline(s, e)
ck += coin_k
bk += btc_k
# dataframe
df = u.get_dataframe_from_kline(ck, bk)
return df

where:

1
2
3
def get_certain_time(year, month, day, hour, min):
return int(
round(time.mktime(time.strptime(f"{year}-{month}-{day} {hour}:{min}:00", "%Y-%m-%d %H:%M:%S")) * 1000, 0))
1
2
3
4
5
6
7
8
9
10
11
def get_k_for_coin(name, start, end):
"""
get k value for each minute in the provided gap
:param name:
:param start:
:param end:
:return:
as mentioned above
"""
l = int(round(int(end-start)/1000/60,0))
return client.klines(f'{name}BTC', "1m", limit=l, startTime=start, endTime=end)
1
2
3
4
5
6
7
8
9
10
11
def get_k_for_BTC_baseline(start, end):
"""
get k value for each minute in the provided gap
:param name:
:param start:
:param end:
:return:
same as get_k_for_coin
"""
l = int(round(int(end-start)/1000/60,0))
return client.klines(f'BTCAUD', "1m", limit=l, startTime=start, endTime=end)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def get_dataframe_from_kline(coin_kline, btc_kline):
data = []
i = 0
for k in coin_kline:
data.append((
timestamp_ms_to_datetime(k[0]).hour,
numpy.float64(k[1]),
numpy.float64(k[2]),
numpy.float64(k[3]),
numpy.float64(k[5]),
timestamp_ms_to_datetime(k[6]).hour,
numpy.float64(k[7]),
k[8],
numpy.float64(k[9]),
numpy.float64(k[10]),
numpy.float64(btc_kline[i][1]), # btc open
numpy.float64(btc_kline[i][2]), # btc high
numpy.float64(btc_kline[i][3]), # btc low
numpy.float64(btc_kline[i][4]), # btc close
numpy.float64(btc_kline[i][5]), # btc volume
numpy.float64(btc_kline[i][8]), # btc trades
numpy.float64(k[4]),
))
i += 1
df = pd.DataFrame(data, columns=('Open Time Hr', 'Open', 'High', 'Low', 'Volume', 'Close Time Hr',
'Quote Asset Volume', 'Trades', 'Buy Base', 'Buy Quote', 'BTC Open',
'BTC High', 'BTC Low', 'BTC Close', 'BTC Volume', 'BTC Trades', 'Close Price'))
df.dropna(inplace=True)
return df

By doing this, now we have a DataFrame that contains the

  • ‘Open Time Hr’
  • ‘Open’
  • ‘High’
  • ‘Low’
  • ‘Volume’
  • ‘Close Time Hr’
  • ‘Quote Asset Volume’
  • ‘Trades’
  • ‘Buy Base’
  • ‘Buy Quote’
  • ‘BTC Open’
  • ‘BTC High’
  • ‘BTC Low’
  • ‘BTC Close’
  • ‘BTC Volume’
  • ‘BTC Trades’
  • ‘Close Price’
    In a given time. These are the factors I think that have a correlation to the close price.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
           Open Time Hr      Open      High  ...  BTC Close  BTC Volume  BTC Trades
    33853 12 0.069273 0.069325 ... 31804.65 0.10191 4.0
    34303 21 0.070211 0.070212 ... 31761.30 0.12436 9.0
    20758 1 0.058891 0.058891 ... 30723.81 0.27317 18.0
    28130 9 0.065255 0.065255 ... 33826.62 0.19866 6.0
    32476 12 0.069774 0.069780 ... 32410.61 0.07023 10.0
    ... ... ... ... ... ... ... ...
    30427 1 0.067749 0.067758 ... 33662.69 0.28920 11.0
    13101 11 0.056726 0.056726 ... 31262.10 0.17775 7.0
    13547 18 0.055691 0.055708 ... 31100.57 0.05177 7.0
    4982 14 0.055103 0.055122 ... 28014.05 0.20659 16.0
    20193 14 0.058244 0.058287 ... 30487.53 0.04745 8.0

Model Training

I used Tensorflow to train my model, I split my data as input and output.

  • Input has 16 variables (the first 16 facotrs mentioned above)
  • Output has 1 variable (Close Price)

I also normalised all the data to make it easier to train (z-score method), to be able to reverse the operation and make prediction of current price, reverse normalised process is also required:

1
2
3
4
5
6
7
8
def get_norm_df(df):
normalized_df = (df - df.mean()) / df.std()
return normalized_df


def rev_norm_df(df: DataFrame, o):
rev = df * o.std() + o.mean()
return rev

I split my data as 80% training data and 20% testing data, also I split the input and output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def get_train_dataset(df: DataFrame):
return df.sample(frac=0.8, random_state=0)


def get_test_dataset(df: DataFrame, train_dataset: DataFrame):
return df.drop(train_dataset.index)


def get_train_labels(train_dataset: DataFrame):
keys = train_dataset.keys()
key = keys[-1]
return train_dataset.pop(key)


def get_test_labels(test_dataset: DataFrame):
keys = test_dataset.keys()
key = keys[-1]
return test_dataset.pop(key)

As now I have all the data I need, I can crate my model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def create_model(train_dataset: DataFrame):
# sns.pairplot(train_dataset, diag_kind="kde")
# plt.show()
print(f"will use this training dataset:\n{train_dataset}")
model = keras.Sequential([
layers.Dense(128, activation='relu', input_shape=(len(train_dataset.keys()),)),
layers.Dense(16, activation='relu'),
layers.Dense(1)
])

optimizer = tf.keras.optimizers.RMSprop(0.0005)

model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse'])
return model

I will explain why I chose to make a Sequential Model and why I had these params in the next blog post, due to the word limit.

Just a brief description, I build a model that has 3 layers, input layer (with 16 inputs, or train_dataset.keys().length inputs), and I have a hidden layer (with 16 units), finally I have an output layer that output one unit. I have a learning rate of 0.0005 (tried a whole afternoon and figure out this number gives the most accurate result). Finally I used Mean Absolute Error and Mean Square Error to calculate the loss as a metrics.

Lets see the result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 2176

dense_1 (Dense) (None, 16) 2064

dense_2 (Dense) (None, 1) 17

=================================================================
Total params: 4,257
Trainable params: 4,257
Non-trainable params: 0
_________________________________________________________________
1
2
3
4
5
6
            loss           mae           mse  ...       val_mae       val_mse  epoch
28 0.0001337428 0.0076155611 0.0001337428 ... 0.0116641819 0.0003233170 28
29 0.0001296904 0.0075853723 0.0001296904 ... 0.0118185701 0.0002183406 29
30 0.0001223174 0.0075605982 0.0001223174 ... 0.0093790321 0.0002534208 30
31 0.0001188979 0.0074074916 0.0001188979 ... 0.0042574769 0.0000758429 31
32 0.0001404843 0.0073013869 0.0001404843 ... 0.0059910971 0.0001049181 32
1
Testing set Mean Abs Error:  0.01 Close Price

As we can see, the error is 0.01 of the normalised price, this is very accurate!

Lets compare the test output and the predicted output:

  • Test output
    1
    [-1.22100359 -1.21980964 -1.20701729 ...  1.44731059  1.46180858 1.52031228]
  • Predicted output
    1
    [-1.2320391 -1.2223128 -1.2007444 ...  1.4353155  1.4469447  1.5193961]

It is amazingly accurate!

Let’s have a look on some plots:

  • Mean Absolute Error

  • Mean Square Error

  • Actual Output v.s. Prediction

In conclusion I trained an accurate model

Model Implementation

I used it to predict the price of a coin in the next 60 minutes with the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
coin = "ETH"
# cur price
cur_b = d.get_avg_BTC()
# past data
ck, bk, coin_k, btc_k = d.download_last_hr_data(coin)
# next n min
minute = 60
btc = u.predict_price(model, coin, minute, ck, bk, coin_k, btc_k)
print(f"Prediction in next {minute} mins: 1 {coin} = {btc} BTC")
print(f"Prediction in next {minute} mins: 1 {coin} = {btc * cur_b} AUD")
print(f"Prediction in next {minute} mins: {((btc * cur_b / cur_p) - 1) * 100}%")
ind = []
for i in range(0, 60):
btc = u.predict_price(model, coin, i, ck, bk, coin_k, btc_k)
ind.append(btc * cur_b)
u.plot(coin, ind, range(0, 60))

where:

1
2
3
4
5
6
7
8
9
10
11
12
13
def download_last_hr_data(coin):
ck = []
bk = []
# 1hr data
s = u.get_time() - 1000 * 60 * 60 # 1hr ago
e = u.get_time()
# coin kline
coin_k = get_k_for_coin(coin, s, e)
# btc kline
btc_k = get_k_for_BTC_baseline(s, e)
ck += coin_k
bk += btc_k
return ck, bk, coin_k, btc_k
1
2
3
4
5
6
def plot(coin, indexes, mins):
plt.plot(mins, indexes)
plt.xlabel("Minutes")
plt.ylabel("Trend")
plt.title(f"Predicted {coin} Price Trending in the next hour")
plt.show()

And here is a sample result:

The trending is unbelievable accurate as I spent my whole Saturday night testing it:

  • Saturday 8.30pm test

    830p 830a
  • Saturday 9.05pm test

    905p 905a
  • Saturday 10pm test

    10t
  • Saturday 11pm test

    11t

Challenges

  • Binance K Line API only return 1000 entries per request
    To overtake this, I have to split my requests, I requested 600 entries per time (600 minutes’ data -> 10 hrs’ data), and I requested mutiple times, to get the 24 hour data in a specefic day.
  • Binance K Line API needs to input date manually
    I made a for loop that iterates day 1 to day 28 in a month, and gets the 24 hour data on that day. (In the future I will let the program figure out how many days are there exactly in a given month).
  • As Tensorflow does not support using Mac GPU to train model (although I have a Radeon Pro Vega 20 4 GB Graphic Card), I have to use CPU to train model, and this is really slow. Therefore I reduced the layers and units of my model to increase the training speed. Also I implemented an early stop callback, when the loss is approximately the same for 10 epoches, then stop the training process.

Reflection

  • I learned how to use Tensorflow and to train a model using a given data, and how to evaluate and predict output.
  • I learned how the basics of Machine Learning Process works, I will explain this in the next blog post due to word count limit
  • I have done everything well this week.

Timeline

I am on track, and I am too beyond. I finished all the work I planned to work on for the first three weeks. I might need to change my timeline, because I believe I can finish the User Interface next week, therefore I have extra few weeks to spend on testing. Hence, I think I will spend around 8 weeks on testing the accuracy and validity of the model.

I planned to design a GUI for my project next week, so users can choose whatever coin they want to train and predict, also they can specify a range of time of the data they want to use to train the model. Also I planned to finish my User Document and my Presentation PowerPoint.