PySpark, Python, SQL, Databricks (including Autoloader), AWS (Lambda, EC2, S3), Docker, Open-Telemetry, Git, VSCode, Shell
Lambda, EC2, S3, IAM, CloudWatch, CloudFormation
A Silicon Valley cloud startup was on a mission to develop a cloud-cost-computing feature aimed at
enabling companies to consolidate and optimize their cloud and machine usage expenditures. The
ambitious project involved challenges such as:
• Identifying the right mix of data sources to provide a comprehensive view of cloud expenditures.
• Collecting vast amounts of data in real-time, ensuring accuracy and consistency.
• Efficiently managing and processing large datasets to deliver live streaming data services.
• Developing data dictionaries and building a robust infrastructure to support data collection, processing, and visualization from the ground up.
No Pre-existing Systems: Starting from zero, we embraced new technologies such as AWS Lambda and Databricks Autoloader, learning and implementing as we built the system. Varied Data Environments: Each data source, from VMware to AWS calls, demanded customized approaches, adapting to different environments and optimizing for batch and live data streaming. Leading the process from start to finish, my role involved orchestrating the end-to-end creation of a data science and engineering infrastructure. This entailed planning and execution, from the initial setup of data sources and dictionaries to the final stages of live data streaming and visualization, all designed to empower real-time analysis and decision-making.
1. Robust and Scalable Data Architecture
I led the initiative to set up scalable data pipelines and storage solutions, ensuring flexibility for both batch and real-time data processing.
2. Modern Data Processing and Learning Curve
We transitioned to PySpark and Databricks, learning and employing modern cloud-based data pro- cessing methods to ensure maximum performance and efficiency and build End-to-End robust Data Pipelines.
3. Integration of Advanced Data Analytics
Implemented a seamless integration of Databricks tools with AWS technologies to analyze and visu- alize data effectively. This strategic combination enabled the team to streamline data flows, enhance analytics, and provide actionable insights, which drove decision-making and business strategy for the end-users.
• Streamlined Data Pipelines: Pioneered the development of end-to-end data pipelines, reduc-
ing data retrieval times and enhancing business operations significantly.
• Complex Data Management: Innovatively processed highly nested JSON files, establishing
a system for efficient data structuring and storage.
• Cloud Cost Optimization: Implemented intelligent data processing and storage solutions,
resulting in substantial cloud cost savings.
• Real-time Analytics and Alerts: Enabled live data visualization and monitoring, empowering
proactive cloud resource management with real-time alerts.
• Adoption of Cutting-Edge Technologies: Rapid assimilation and application of new tech-
nologies like AWS Lambda and Databricks Autoloader enhanced the startup’s innovative edge.
• Future-Ready Infrastructure: Constructed a versatile infrastructure that is prepared to
adapt to emerging technologies and scale with evolving data processing requirements.
