Data Science/AI

[에러 해결] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 0: invalid start byte

토마토. 2023. 3. 11. 13:52
import pandas as pd
data = pd.read_csv("고등학교.csv")

한글이 포함된 csv 파일을 불러올 때 아래와 같은 에러가 발생하였다. 

 

Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[1], line 2
      1 import pandas as pd
----> 2 data = pd.read_csv("고등학교.csv")

File c:\Users\.venv\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File c:\Users\.venv\lib\site-packages\pandas\io\parsers\readers.py:586, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    571 kwds_defaults = _refine_defaults_read(
    572     dialect,
    573     delimiter,
   (...)
    582     defaults={"delimiter": ","},
    583 )
    584 kwds.update(kwds_defaults)
--> 586 return _read(filepath_or_buffer, kwds)
...
File c:\Users\.venv\lib\site-packages\pandas\_libs\parsers.pyx:843, in pandas._libs.parsers.TextReader._tokenize_rows()

File c:\Users\.venv\lib\site-packages\pandas\_libs\parsers.pyx:1917, in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 0: invalid start byte

 

이때 utf-8 방식으로는 한글을 읽어올 수 없어 발생하는 에러다

따라서 인코딩 방식을 'cp949'으로 바꿔주면 에러가 해결된다. 

 

수정한 코드는 다음과 같다

import pandas as pd
data = pd.read_csv("고등학교.csv", encoding='cp949')