Mehroos Ali

About

  • Collaborative data engineering and software developer professional with substantial knowledge and experience in analysis, design, development, implementation, migration, convergence, management, and support of large scale databases, data warehouses, and big data systems by creating intuitive architectures and frameworks that help organizations effectively capture, store, process, visualize and analyze huge volume of structured, semi-structured, unstructured and stream of heterogeneous data set.
  • Proven talent for aligning business strategy and objectives with established analytical paradigms to achieve maximum operational impacts with minimum resource expenditures. Results-focused leader with expertise spanning data engineering, software development, business analytics, cross-functional team leadership, complex problem-solving.
  • I am currently pursuing my Masters in Computer Science at the University of Texas at Dallas specializing in Intelligent Systems.
  • I have interned at Amazon as Data Engineer where I gained knowledge and experience working with design and development of streaming data pipelines.
  • I have previously worked as a Data Engineer for Onward Technologies which is a global IT service provider in domains such as data analytics, data science, Artificial Intelligence (AI) and Machine Learning (ML). Before that I was working with Cognizant on their flagship Core Banking and Insurance customer - Suncorp.
  • I am a Microsoft Certified Azure Data Engineer and Databricks Certified Data Engineer Associate.
  • I am interested in Big Data Engineering, Cloud Data Warehousing, DevOps and Full Stack Development.

Education

Master of Science, Computer Science

2021 - 2023

University of Texas at Dallas, Richardson, TX

Relevant Courses : Database Design, Machine Learning, Artificial Intelligence, Natural Language Processing, Big Data Management and Analytics, Design and Analysis of Algorithms, Information Retrieval, Human Computer Interaction

Bachelor of Technology, Mechanical Engineering

2014 - 2018

Motilal Nehru National Institute of Technology, Prayagraj, India

Skills

Python 70%
Java 80%
SQL 90%
Scala 50%
Hadoop (HDFS/MapR/Hive/Sqoop/Oozie/Yarn/Presto) 90%
Spark (PySpark)/Spark Streaming 80%
Airflow 50%
Kafka/Kafka Connect 50%
Tableau 50%
Data Modeling (Relational, Dimensional, ER, EER)80%
Databases (SQL/NoSQL)/Datawarehousing (Facts, Dimensions, Cubes) 80%
Data Pipeline Designing (ETL, ELT, Batch, Streaming, CDC)80%
Data/Delta Lakes 90%
Cloud Platform (Databricks/AWS/GCP/Azure) 70%
Maven/SBT/Gradle 80%
Git/GitHub/GitLab 100%
Web Development (React, React Native, Redux, NodeJS, JS) 70%
Agile Development 100%

Certifications

img img img

Professional Experience

Data Engineer Intern

May 2023 - Present

Trinity Industries, Dallas, Texas, United States

  • Working on a building change data capture solution for migrating data from MS SQL server to AWS using Databricks, Spark Streaming, Debezium, and Kafka.
  • Designed and implemented streaming data pipelines using Delta Lake to perform upsert (merge) operations on incoming data streams, ensuring real-time data updates and maintaining data consistency.
  • Leveraged Azure Data Factory to seamlessly integrate data across various Azure services, such as Azure Databricks, Azure SQL Database, Azure Blob Storage, and Azure Synapse Analytics.
  • Connected Tableau to AWS Athena and created reports and dashboards for railroad mileage comparison.

Data Engineer Intern

May 2022 - Aug 2022

Amazon, Boulder, Colarado, United States

  • Created data producers and consumers for Kinesis Streams, ensuring uninterrupted end-to-end data flow.
  • Authored ETL scripts using PySpark and AWS Glue APIs, implementing complex transformations and business logic to cleanse and enrich data during the ETL process.
  • Created Jupyter notebooks on AWS EMR clusters to enable exploratory data analysis of large-scale datasets.

Data Engineer

Jan 2021 - Aug 2021

Onward Technologies Limited, Chennai, India

  • Migrated 250 spark jobs from on-premise Hadoop to Google Cloud Platform, reducing the processing time and increasing the computational limit by more than 60%.
  • Designed and implemented a scalable data pipeline to process structured and semi-structured data by integrating 550 million raw records from different data sources using Kafka and PySpark and storing processed data in MongoDB.
  • Authored Airfow DAGs for daily data ingestion and processing from Google Cloud Storage to BigQuery.
  • Written hive queries to parse the raw data and store the refined data in partitioned and bucketed tables.

Data Engineer

Nov 2018 - Jan 2021

Cognizant Technology Solutions India Pvt Ltd, Chennai, India

  • Handled sqoop parallelism, incremental data load from Oracle to HDFS, and Hive tables for daily data growth.
  • Designed Nifi workflows for data ingestion from various sources such as RDMS, REST API, Kafka topic, etc.
  • Improved runtime of slow-running spark jobs by 60% by optimizing Spark SQL joins.
  • Developed a notification-based system using SNS, SQS, lambda, and DynamoDB to automate its deployment to AWS via GitLab.
  • Stored data from spark as wide tables in Elastic Search for real-time aggregation and visualization in Kibana.
  • Involved in integrating back-end systems in NodeJS with the dashboards created using React.

Portfolio Projects

Information Retrieval Search Engine

  • Developed a search engine for desserts (sweets) that can crawl and index 100,000+ web pages from the internet and create a web graph.
  • The index and the graph will be used to develop two relevance models - page ranking and HITS, which will rank the search results.
  • The project will focus on clustering web pages to improve the search results by using flat clustering and two agglomerative clustering methods.
  • Also, query expansion through pseudo-relevance feedback will be implemented using the Rocchio algorithm.
Databricks Formula 1 Racing Analysis

  • Created databricks notebooks to ingest, transform, analyze and create reports on Formula 1 racing data.
  • Written Spark SQL queries to find the dominant drivers and teams for visualization.
  • Scheduled the pipeline using Azure Data Factory (ADF) for monitoring and alerts.
AWS Batch ETL Pipeline

  • Built functional python script to load songs and logs data from S3 bucket.
  • Transformed them to create and store as fact and dimension tables in redshift.
  • Orchestrated the data pipeline using Airflow DAGs and enforced data quality checks.
Twitter Streaming Analysis

  • Designed and implemented a real-time streaming and classification system for sentiment analysis on Twitter data.
  • Pulled live tweets using Nifi (Twitter API) into Kafka topic for cleaning, parsing, and filtering using Spark.
  • Applied Stanford NLP to get sentiment score for each tweet and visualized using ES-Kibana.
Realtime Customer Viewership Analysis

  • data pipeline created for unification and consolidation of real-time customer web events, weblogs, and profile data into a hive warehouse for ad-hoc analysis.
ABC Stores Pipeline

  • data pipeline created for a retail store called ABC-Stores using Hadoop for storage and spark for data processing to produce reports to perform analytics in Power BI.
  • Scheduled the pipeline for daily batch data using airflow.
BigQuery Spark-SQL Batch ETL

  • Batch ETL pipeline project on GCP to load and transform daily flight data using Spark to update tables in BigQuery.
  • Scheduled the pipeline for daily batch data using airflow.
Solving Word Puzzle using Hash Table

  • Read a list of words from a file and store it in a Hash Table.
  • Generated a random 2-D array based on user input for rows and columns.
  • Iterated the 2-D puzzle in 8 directions and checked the existence of a word formed in the Hash Table.
Tash Shell

  • Created a simple Unix shell using C Programing.
  • Implemented build in commands, parallel commands, and redirection of outputs in the Shell.
Maze Solver

  • Created a 2-D maze and solved it using DisJoint Sets operations in Java.
Kruskals algorithm

  • Built a graph of Texas cities from a file.
  • Found the minimum spanning tree of the graph by applying Kruskal's algorithm.
File Statistics using MIPS

  • Read a text file using MIPS and produced statistics such as upper case letters, lower case letters, number symbols, other symbols, lines of text, and signed numbers.
Batch Number Conversion

  • Assembly program which reads any number in binary, decimal, or hexadecimal format and converts it into any required format based on user input.
XV6 Lottery Scheduler

  • Replace the current round-robin scheduler in xv6 with a lottery scheduler.
  • Assigned each running process a slice of the processor in proportion to the number of tickets it has. The more tickets a process has, the more it runs. Each time slice, a randomized lottery determines the winner of the lottery; that winning process is the one that runs for that time slice.
Seeking Tutor Problem

  • Code solution for classic synchronization problem of seeking Tutor.
  • Solution is implemented using concurrent programming in C.
File System Checker

  • C program to read a file system image and check its consistency using a set of 12 rules.
  • When the image is not consistent, the check should output an appropriate error message.
DB Design - ebay.com

  • Data modeling project which involves design of database for ebay.com.
  • Created an Entity-Relationship model for the Ebay database by identifying and analyzing data requirements.
  • Converted the Entity-Relationship model to a relational model by applying mapping and normalization techniques.
  • Written DDL, and DML statements for the relational model and defined relevant stored procedures and trigger.
Muy Feliz Android Application

  • Created an Android application to help new parents manage their children and personal hobbies.
  • Utilized Redux, a library for managing application state, to improve the scalability and maintainability of the app.
  • Utilized third-party libraries and APIs to add features such as screen routing, calendar, image upload and date-picker.

Recommendations

These are few recommendations from the people I have worked with such as my colleagues, mentors, friends, etc.

Contact

I am Actively looking for full time/contract opportunities in the field of data engineering and software development starting December 2023.

Location:

2248 Dahlia Way, TX, 75080

Call:

+1 214-940-7050