본문 바로가기
Development/Python

한글을 Sub-character level로 파싱하기(python으로 유니코드 파싱)

by IMCOMKING 2020. 4. 16.
# -*- coding: utf-8 -*-

cho = "ㄱㄲㄴㄷㄸㄹㅁㅂㅃㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎ" # len = 19
jung = "ㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ" # len = 21
jong = "//ㄱㅅ//ㄴㅈ/ㄴㅎ///ㄹㄱ/ㄹㅁ/ㄹㅂ/ㄹㅅ/ㄹㅌ/ㄹㅍ/ㄹㅎ///ㅂㅅ/////////".split('/') # len = 27
test = cho + jung + ''.join(jong)

hangul_length = len(cho) + len(jung) + len(jong) # 67


def is_valid_decomposition_atom(x):
return x in test


def decompose(x):
in_char = x
if x < ord('') or x > ord(''):
return chr(x)
x = x - ord('')
y = x // 28
z = x % 28
x = y // 21
y = y % 21
# if there is jong, then is z > 0. So z starts from 1 index.
zz = jong[z - 1] if z > 0 else ''
if x >= len(cho):
print('Unknown Exception: ', in_char, chr(in_char), x, y, z, zz)
return cho[x] + jung[y] + zz


def decompose_as_one_hot(in_char, warning=True):
one_hot = []
# print(ord(''), chr(0xac00))
# [0,66]: hangul / [67,194]: ASCII / [195,245]: hangul danja,danmo / [246,249]: special characters
# Total 250 dimensions.
if ord('') <= in_char <= ord(''): # :44032 , : 55203
x = in_char - 44032 # in_char - ord('')
y = x // 28
z = x % 28
x = y // 21
y = y % 21
# if there is jong, then is z > 0. So z starts from 1 index.
zz = jong[z - 1] if z > 0 else ''
if x >= len(cho):
if warning:
print('Unknown Exception: ', in_char, chr(in_char), x, y, z, zz)

one_hot.append(x)
one_hot.append(len(cho) + y)
if z > 0:
one_hot.append(len(cho) + len(jung) + (z - 1))
return one_hot
else:
if in_char < 128:
result = hangul_length + in_char # 67~
elif ord('') <= in_char <= ord(''):
result = hangul_length + 128 + (in_char - 12593) # 194~ # [:12593]~[:12643] (len = 51)
elif in_char == ord(''):
result = hangul_length + 128 + 51 # 245~ #
elif in_char == ord('♥'):
result = hangul_length + 128 + 51 + 1 # ♥
elif in_char == ord(''):
result = hangul_length + 128 + 51 + 2 #
elif in_char == ord(''):
result = hangul_length + 128 + 51 + 3 #
else:
if warning:
print('Unhandled character:', chr(in_char), in_char)
# unknown character
result = hangul_length + 128 + 51 + 4 # for unknown character

return [result]


def decompose_str(str):
return ''.join([decompose(ord(x)) for x in str])


def decompose_str_as_one_hot(str, warning=True):
#print(str)
tmp_list = []
for x in str:
da = decompose_as_one_hot(ord(x), warning=warning)
tmp_list.extend(da)
return tmp_list


if __name__ == '__main__':
print(decompose_str_as_one_hot('개인적으로 2가 제대로라고 생각하지만'))
print(decompose_str_as_one_hot('SF계의 최고의 수작. 다시봐도 매우 '))
print(decompose_str_as_one_hot('개봉당시 최고의 재미를 선사했던'))
#print(decompose_str('각 맑은 하늘 고운 마음 밟'))
# print(decompose_str_as_one_hot(''))
# print(decompose_as_one_hot(0))
# print(decompose_as_one_hot(127))
# print(decompose_str_as_one_hot('ㄱㄺㅎㅏㅣ'))
# print(decompose_str_as_one_hot('★☆'))

위의 코드를 사용하면, 한글 완성형 문자를 초성-중성-종성으로 분리하여 67개의 자모로 풀어서 표현할 수 있다.


https://gist.github.com/imcomking/085ce7e2088501da8df3b16c4778cb39

'Development > Python' 카테고리의 다른 글

Python Subprocess  (0) 2020.08.13
Python Audio Processing  (0) 2020.05.02
CSV파일 인코딩(Encoding)  (0) 2020.03.23
Conda로 Python 버전 별 설치, 관리, 삭제하기  (0) 2020.02.27
Python Multiprocessing 가이드  (2) 2020.02.20

댓글