I want to highlight a new presentation about Data Preparation in Data Science projects:
“Comparison of Programming Languages, Frameworks and Tools for Data Preprocessing and (Inline) Data Wrangling in Machine Learning / Deep Learning Projects”
A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 80% of the whole project.
This session compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing (like Talend, Pentaho), streaming analytics ingestion (like Apache Storm, Flink, Apex, TIBCO StreamBase, IBM Streams, Software AG Apama), and data wrangling (DataWrangler, Trifacta) within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Hadoop, Spark, KNIME or RapidMiner. The session discusses how this is related to visual analytics tools (like TIBCO Spotfire). Therefore, it also shows best practices for how the data scientist and business analyst should work together to build good analytic models.
Key takeaways of this session:
– Learn various options for preparing data sets to build analytic models
– Understand the pros and cons and the targeted persona for each option
– See different technologies and open source frameworks for data preparation
– Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation
Slide Deck
The following shows the slide deck:
Click on the button to load the content from www.slideshare.net.
Video Recording: Data Preparation vs. (Inline) Data Wrangling
Here is the video recording:
In the age of digitization, the concept of pricing is no longer fixed or manual.…
In the rapidly evolving landscape of intelligent traffic systems, innovative software provides real-time processing capabilities,…
In the fast-paced world of finance, the ability to prevent fraud in real-time is not…
Choosing between Apache Kafka, Azure Event Hubs, and Confluent Cloud for data streaming is critical…
In today's data-driven world, understanding data at rest versus data in motion is crucial for…
If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate…