Commit c13a47d7 authored by Alex Fout's avatar Alex Fout

Merge branch 'master' of biof-git.colorado.edu:hackathon/brain_age_from_eeg

parents b5a76bdd 01a8cbf4
init:
pip install -r requirements.txt
\ No newline at end of file
library(neuralnet)
library(caret)
library(AppliedPredictiveModeling)
# read CSV
cr_all <- read.csv("cor_data_cleaned.csv")
cr_row_names <- cr_all[,1]
cr_all <- cr_all[,2:172]
rownames(cr_all) <- cr_row_names
cr_col_names <- colnames(cr_all)[2:172]
#generate input and output column names
cnames_inputs <- paste0(cr_col_names, "i")
cnames_outputs <- paste0(cr_col_names, "o")
train <- cr_all[,2:172]
train <- cbind(train, train)
colnames(train) <- c(cnames_inputs, cnames_outputs)
# Preprocess with caret package
preProcValues <- preProcess(train, method = c("range"))
train2 <- predict(preProcValues, train)
# formula for model
f <- as.formula(paste(paste(cnames_inputs, collapse = " + "), "~", paste(cnames_outputs, collapse = " + ")))
set.seed(0)
model1 <- neuralnet(f, train2, hidden=c(20,2,20), algorithm = 'rprop+', threshold = 0.1, learningrate = 0.7)
print (paste("MSE=", model1$result.matrix[1]))
#plot(model1)
results <- compute(model1, train2[,1:171]) #Run them through the neural network
results$net.result
inl <- as.data.frame(results$neurons[[3]][,2:3])
inl <- cbind(inl, iris[,5])
colnames(inl) <- c("HL1","HL2")
# Read the labels file
labels <- read.csv("labels.csv")
nn_data <- inl
nn_data$fn <- cr_all$X
# left join
nn_data <- merge(x = nn_data, y = labels, by = "fn", all.x = TRUE)
# detect outliers
library(anomalyDetection)
nn_data$HL1[is.na(nn_data$HL1)] <-0
nn_data$HL2[is.na(nn_data$HL2)] <-0
nn_data$md <- mahalanobis_distance(nn_data[,2:3])
nn_data$anomaly <- nn_data$md > 5
# plot with labeled ouliers
p <- ggplot(nn_data, aes(HL1, HL2, color=anomaly)) + geom_point() +
geom_text(aes(label = ifelse(anomaly==TRUE,as.character(fn),''), hjust=0, vjust=0))
p
# Neuro-ID
How reliable is EEG as a consistent identifier for individuals with Mild Traumatic Brain Injury (MTBI)?
## Scope
Is EEG test-retest data consistent enough to identify an individual? Do mTBI and concussive events change the effectiveness of the algorithm due to brainwave changes?
Pattern recognition and classification in time series data
## Abstract
Raw brainwave data taken from an electroencephalogram (EEG) system has been shown to be unique for each individual. Developing a machine learning algorithm to
identify a specific person by their raw brainwave data can help to determine if a person’s brainwaves become unidentifiable after a concussion and therefore
indicate changes in the brain.
## Tools
## Methods
### <i> Tools </i>
Python package to analyse EEG Data: https://martinos.org/mne/stable/index.html
## Waves in EEG
### <i> Subjects and Data Aquisition </i>
The 98 subjects included in the dataset were college aged males (18-24) Division I football players. Each player had a minimum
of two EEG recording sessions. The first EEG measurements were recorded before the season began (baseline), and at the end of
season (~6 months after). Some subjects had EEG retest sessions in order to record potential concussion, 20.4% of subjects indicated
concussion symptoms. At the time of each measurement, the EEG was recorded for 4-15 minutes, the subjects were in an eye-closed
resting state throughout the duration of the recording. During the recording sessions artifacts were identified and removed from
the data analysis for accuracy.
### <i> Biosignal Recording and EEG Parameterisation <i>
The EEG recordings were performed with electrodes secured at sites FP1, FP2, F7, F3, Fz, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz,
P4, T6, O1, O2 with 19-channel equipment (WAVi). A headset containing the electrodes is placed on the patient. The electrodes are
examined to ensure quality contact. If contact is unacceptable, conductive gel can be added and the eSocs can be rubbed along
the scalp to exfoliate the location of the electrode in order to assist in gaining proper contact. Once contact is deemed
acceptable, a auditory P300 Eyes Closed Protocol is run. The patient is instructed to avoid any synchronized motions and blinks
during the P300 test as this will affect the quality of the data.
![Image of 10-20 array](https://en.wikipedia.org/wiki/10-20_system_(EEG)#/media/File:21_electrodes_of_International_10-20_system_for_EEG.svg)
### <i> Data </i>
The subject data is anonymized with number labels, each number represents a different subject. The letters attached to the numbers
represent the session (i.e. a=first session, b=second session etc.). Each session has two files a .raw and a .art.
The .raw files contain the raw waves and the .art contain artifacting data for each session. Use both files to build raw waves.
<br><u>.art file key </u>
Same dimensions as the raw data. Gives a coloring to the EEG data, that shows how reliability of the data.
* 0= Reliable (black)
* 1= Not Reliable (red)
* 2= May or may not be reilable,to be used or discarded (blue)
### <i> Waves in EEG </i>
<b> Delta:</b> 0-4 Hz
<br><b> Theta:</b> 4-8 Hz
<br><b> Alpha:</b> 8-13 Hz
<br><b> Beta:</b> 13-20 Hz
<br><b> Gamma:</b> 20-40 Hz
## Data
Data were measured in multiple times per subject, once at the start of the season, once at the end of the season and everytime the subject had a concussion.
the data is labeled alphabetically from starting from a to represent the EED data collection for different times.
**Example**
if a subject has data labeled as:
* a : measurement beginning of the season
* b : measurement after the first concussion
* c : measurement after the second concussion
* d : measurement the end of the season
Note Some subject may not had any concussions, or may have missing data where they did not show up at the beginning of the season.
**Raw EEG Data:**
<b> Data Files </b>
* 98 subjects
* 19 columns, ~61440 rows (256 s time step total of 4 minute data)
* 98 Subjects
**Art file (Artifacts)**
Same dimensions as the raw data. Gives a coloring to the EEG data, that shows how reliable that data is.
* 0 : Black (reliable)
* 1 : Red (Not reliable)
* 2 : Blue (May or May not be reliable)
**P300 Data**
The subjects are hearing a beep sound at a specific pitch at regular interval. Than randomly a high pitch sound is exposed to the subject, which creates a spike in the subjects EEG data.
Each subject is exposed to 40 of these high pitch sounds at a random time frame.
More speficially the change of the EEG Data with regard to the high pitch sound is an amplitude increase. The amplitude increase is observed in any frequency band (depends at which frequency band the subject's EEG data is at the time of hearing the high pitch sound).
## Analysis
### PreProccessing
* Seperate out Different waves
* Features extraction using neural network
* FFT : static
* Short-Time FFT : FFT with window
* Wavelet
1. Extract (𝜶, 𝛽, 𝛾,𝜃) Waveform
2. Divide into 2s epochs
3. Features extraction using neural netork
a. FFT: static
b. Short-Time FFT: FFR with window
c. Compute static features for each epoch
d. Channel coherence (each wave)
e. Channel correlation Pearson (each wave)
4. Embedding a lower dimensional manifold
a. PCA
b. LDA
c. Localized linear embedding (tSNE)
5. Analyse distances in reduced space between 2013 & 2014 baselines
### Clustering
* Aggloremative
* k-means
* NNMF (Non Negative Matrix Factorization)
#### Measures
* Dynamics Time warping (Distance Meausure)
### Prediction Model
## Results
The data did not show siginificant differences in the channel cohernece in alpha frequency between concussed and non concussed subjects. Additional features need
to be analyzed in order to determine if EEG is a reliable identifier for a subject with mTBI.
......@@ -18,7 +18,7 @@ def printINFO(v, string):
def SaveWave2csv(pid, v=False, extension='raw', inOneCSV=False, nfilterCoeff=4001):
printINFO(v, "Patient ID: {}".format(pid))
p = patient.Patient(pid)
p = patient.Patient(pid, '')
if p.pre_test is not None:
printINFO(v, "Extracting waves for pre_test!")
......@@ -95,7 +95,7 @@ def SaveWave2csv(pid, v=False, extension='raw', inOneCSV=False, nfilterCoeff=400
if p.post_test is not None:
# save post_end to csv
fname = "".join([pid, getLetter(nintermediate+2), '_', wave,'.', extension])
fname = "".join([pid, getLetter(nintermediate+1), '_', wave,'.', extension])
fpath = os.path.join(path,fname)
printINFO(v,"Saving file: {}".format(fpath))
p.post_test.waves[wave].to_csv(fpath, index=False)
......@@ -130,7 +130,7 @@ if __name__ == '__main__':
if args.pid:
pid = [args.pid]
else:
pid = np.arange(1,99)
pid = np.arange(55, 99)
for i in pid:
SaveWave2csv(str(i), v=args.v, inOneCSV=not(args.csvPerWave), extension=extension, nfilterCoeff=nfilterCoeff)
import numpy as np
from matplotlib import pyplot as plt
from itertools import cycle
import sys
from config import pid_noConcussion, pid_3stepProtocol, pid_testRetest, pid_concussion, feature_functions, epoch_size, \
embedding_args, pid_testlist, channels, subfolder
from patient import Patient
from embedding import Embedding
colors = cycle(['r', 'b', 'g', 'y', 'c', 'm', 'k'])
def embed_and_plot(emb, examples, all_color=None, linewidth=2):
pre_post_distances = []
alpha = 0.2 / np.log(len(examples)) if len(examples) > 1 else 1
for tup in examples:
if all_color is None:
if sys.version_info < (3, 0):
# for python2 use
color = colors.next()
else:
# for python3 use
color = next(colors)
else:
color = all_color
pid = tup[0]
pre_emb = emb.embed(tup[1])
post_emb = emb.embed(tup[2])
plt.plot(pre_emb[:, 0], pre_emb[:, 1], linestyle='None', marker="x", color=color, label=str(pid) + "_pre", alpha=alpha)
plt.plot(post_emb[:, 0], post_emb[:, 1], linestyle='None', marker="o", color=color, label=str(pid) + "_post", alpha=alpha)
# calculate centriods and plot a line
pre_cent = centroid(pre_emb)
post_cent = centroid(post_emb)
plt.plot([pre_cent[0], post_cent[0]], [pre_cent[1], post_cent[1]], '-', linewidth=linewidth, color=color)
# record distance
pre_post_distances.append(np.linalg.norm(post_cent - pre_cent))
return pre_post_distances
def centroid(data):
length = len(data)
x_sum = np.sum(data[:, 0])
y_sum = np.sum(data[:, 1])
return np.array([float(x_sum)/length, float(y_sum)/length])
# get training data from un-concussed individuals
n_keep = 1000
train_lists = [pid_concussion, pid_noConcussion]
train_bools = [True, False]
examples_lists = [[], []]
train_examples = []
labels = ["concussion", "noconcussion"]
# for lst, pat_list in zip([pid_noConcussion, pid_3stepProtocol, pid_testRetest, pid_concussion], [noCon_pats, step_pats, retest_pats, con_pats]):
# for lst, pat_list in zip([pid_noConcussion], [noCon_pats]):
i = 0
for pid_list, train_bool in zip(train_lists, train_bools):
for pid in pid_list:
print("Processing pid: {}".format(pid))
p = Patient(pid, subfolder, load_session_raw=False, load_session_examples=True)
# get examples from pre_test
pre = post = None
if p.pre_test is not None:
pre = p.pre_test.load_examples(subfolder)
if pre is not None:
np.random.shuffle(pre)
if (p.n_concussions == 0):
# get examples from post_test
if p.post_test is not None:
post = p.post_test.load_examples(subfolder)
if post is not None:
np.random.shuffle(post)
else:
if p.intermediate_tests[0] is not None:
post = p.intermediate_tests[0].load_examples(subfolder)
if post is not None:
np.random.shuffle(post)
if post is not None and pre is not None:
if train_bool:
train_examples.append((pid, pre, post))
examples_lists[i].append((pid, pre, post))
i += 1
# create training data
train_data = np.vstack([tup[1][:n_keep] for tup in train_examples] + [tup[2][:n_keep] for tup in train_examples])
# create and train embedding
emb = Embedding(**embedding_args)
emb.train(train_data)
# plot both
# visualize embedding
colors = ["r", "b"]
f = plt.figure(figsize=(10, 10))
for label, examples_list, color in zip(labels, examples_lists, colors):
distances = embed_and_plot(emb, examples_list, all_color=color)
plt.title("pre/post test centroid distance".format())
plt.xlabel("PC1")
plt.ylabel("PC2")
#plt.legend()
plt.savefig(subfolder + "_pc1vs2", dpi=300, transparent=True)
plt.show()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment