Types of File Formats in Hadoop with example.
Hadoop supports various file formats to store data efficiently, ensuring optimized storage and processing. Here's an expanded explanation of each type with examples:
1. TextFile:
- Description: This is a basic, plain-text storage format where each line in the file represents a record.
- Example: Anamika Singh,Bangalore,35 Shekhar,Delhi,30
- 5
2. SequenceFile:
- Description: SequenceFile is a binary file format that stores key-value pairs. It's designed to be splittable and supports various compression codecs.
- Example:
- key1 value1 key2 value2
- Key1 Value1 Key2 Value2
3. Avro:
- Description: An AVRO file is a data file created by Apache Avro, containing data serialized in a compact binary format and schema in JSON format that defines the data types. It supports schema evolution, meaning you can change the schema without breaking the existing data.
- Example:{"name": "Anamika Singh", "age": 35, "city": "Bangalore"} {"name": "Shiva", "age": 30, "city": "Varanasi"}
4. Parquet:
- Description: Parquet is a columnar storage file format, optimized for performance and space efficiency. It supports complex nested data structures and provides efficient compression.
- Example: The data in Parquet is binary, so it won't be readable like text formats.
5. ORC (Optimized Row Columnar):
- Description: ORC is another columnar storage file format similar to Parquet. It's designed for high performance and offers efficient compression, making it suitable for large-scale data processing.
- Example: The data in ORC is stored in a binary format.
6. JSON:
- Description: JSON (JavaScript Object Notation) is a human-readable text-based format for data interchange. It's widely used for data transmission between systems.
- Example:{"name": "Anamika Singh", "age": 35, "city": "New Delhi"} {"name": "Ram", "age": 30, "city": "Bangalore"}
7. XML:
- Description: XML (eXtensible Markup Language) is a markup language that encodes data in a text format. It's used for storing and transporting structured data.
- Example:
- <person>
<name>Anamika Singh</name>
<age>35</age>
<city>New Delhi</city>
</person>
8. Hive RCFile (Record Columnar File):
- Description: RCFile is a columnar storage file format developed for Apache Hive. It stores rows in a columnar format to improve query performance.
- Example: The data is stored in a binary format optimized for Hive queries.
Each of these file formats has its advantages and is suitable for specific use cases. When choosing a file format in Hadoop, consider factors such as data size, query performance, schema flexibility, and compression requirements.
Hope you enjoyed while reading these optimization tips. If you like then please Like, Comment and Share.
Thank You!
Comments
Post a Comment