Data Engineer Intern
Department Overview
We are developing some of the first AI-powered algorithms to transform care for pregnant women at risk for adverse outcomes. There are multiple events around the time of delivery that can affect the outcome and experience for both the mother and infant. Our lab is focused on creating critical cutting-edge future-forecasting and vitals-management algorithms to improve patient safety, predict and prevent adverse pregnancy outcomes.
Job Description
We are looking for a responsible, motivated, and self-directed big data engineer intern who is interested in the translation of medical data to artificial intelligence tasks. You will be building components of our learning platform and using in-house tools to extract, transform, and load medical record data into a machine learning-ready database. The Big Data Engineer Intern is responsible for building docker modules which automate data pulls, validation, and the creation of machine learning datasets, along with an Pythonic API for efficient interaction with this system, and automation of test cases uses Python. Interest in Natural Language Processing (NLP) and Machine Learning is a plus. Attention to detail, high integrity and excellent communication skills are a must.
Minimum Qualifications
* Strong experience with Python3
* Understanding of Big Data pipelines and databases including PostgreSQL, MongoDB, and MySQL.
* Experience with ETL procedures.
* Experience with the AWS cloud environment and/or similar interfaces.
* Knowledgable in managing Ubuntu and RedHat Linux images.
* Hands on experience with Docker and Kubernetes.
Responsibilities
* Perform Data Pulls from text files and remote databases using in-house tools and extract relevant data into atomized records.
* Implement Docker modules in a scalable environment that manages the storage and retrieval of these data.
* Perform manual queries using SQL/Python to collect and analyze large amounts of data.
* Perform data validation and cleaning of different parameters within medical records to prepare data for machine learning tasks.
* Build a Python API for efficient query of the database, and for the easy creation of machine learning datasets.
* Perform automation of test cases using Python.
Important
- The candidate must be available to work full time for at least 4 months or part-time for 8 months.
- The start date may be flexible, ideally by July 1 or sooner.
- Hiring is pending upon credentialing from our institution, a process that may take 6 weeks or longer.
- There is an opportunity for hiring on a permanent position.